- Any results / lines of research I haven't heard of in areas like interpretability, inductive biases, and ML theory
- Obscure papers or posts on things like double descent, simplicity bias, induction heads, and grokking
I’d suggest my LessWrong post on grokking and SGD: Hypothesis: gradient descent prefers general circuits It argues that SGD has an inductive bias towards general circuits, and that this explains grokking. On the one hand, I’m not certain the hypothesis is correct. However, the post is very obscure and is a favourite of mine, so I feel it’s appropriate for this question.
SGD’s bias, a post by John Wentworth, explores a similar question by using an analogy to a random walk.
I suspect you’re familiar with it, but Is SGD a Bayesian sampler? Well, almost advances the view that DNN initialisations strongly prefer simple functions, and this explains DNN generalisation.
Locating and Editing Factual Knowledge in GPT is a very recent interpretability work looking at where GPT models store their knowledge and how we can access/modify that knowledge.
Edit: I see you’re the person who wrote How complex are myopic imitators?. In that case, I’d again point you towards Hypothesis: gradient descent prefers general circuits, with the added context that I now suspect (but am not confident) that “generality bias from SGD” and “simplicity bias from initialisation” are functionally identical in terms of their impact on the model’s internals. If so, I think the main value of generality bias as a concept is that it’s easier for humans to intuit the sorts of circuits that generality bias favours. I.e., circuits that contribute to a higher fraction of the model’s outputs.