[ Question ]

Does the lottery ticket hypothesis suggest the scaling hypothesis?

by Daniel Kokotajlo1 min read28th Jul 20202 comments



The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I'm also interested in defining it differently, as whatever Gwern is talking about.

New Answer
Ask Related Question
New Comment

1 Answers

The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.

I ask a question here about what's true in practice.