The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
[EDIT: This understanding goes beyond what the original paper proved, it draws from things proved (or allegedly proved) in later papers. See thread below. EDIT EDIT: Daniel Filan has now convinced me that my understanding of the LTH as expressed above was importantly wrong, or at least importantly goes-beyond-the-evidence.]
By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I'm also interested in defining it differently, as whatever Gwern is talking about.
This paper seems to be arguing that variance initially increases as network width goes up, then starts decreasing for very large networks, suggesting that overall variance is likely to decrease as we approach more advanced AI systems and networks get very large.
'Variance' is used in an amusing number of ways in these discussions.You use 'variance' in one sense (the bias-variance tradeoff), but "Explaining Neural Scaling Laws", Bahri et al 2021 talks about a difference kind of variance limit in scaling, while "Learning Curve Theory", Hutter 2001's toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally sma... (read more)