Path dependence in ML inductive biases

evhub

Interesting post! I'm pretty curious about these.

A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models - they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:

For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these - does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
- Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
- Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
Evaluate the models for a bunch of behaviours - ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
- The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped - S-shaped would suggest higher path dependence to me
Look for induction head phase changes in each model during training, and compare when they happen.

I'm currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints - if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.

Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I've found that the time of formation is very sensitive to the exact positional embeddings used though.

[-]RobertKirk3y1311

When you talk about whether we're in a high or low path-dependence "world", do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it's more likely that some training processes are highly path-dependent and some aren't. We definitely have evidence that some are path-dependent, e.g. Ethan's comment and other examples like https://arxiv.org/abs/2002.06305, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don't think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they're very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).

Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can't change.

[-]evhub3y34

Yeah, I agree with that. I think path dependence will likely vary across training processes and that we should in fact view that as an important intervention point.

[-]johnswentworth3y11-2

Betting markets on these questions would be nice. I'd bid pretty strongly on "nope, basically no path dependence" for most current architectures; replicability already gives us a ton of bits on the question.

[-]Ethan Caballero3y103

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.

^{^}

Mechanistically, high path dependence corresponds to significant influence and continuity from the structure and ontology of the early model to its final structure, and low path-dependence to destruction/radical transformation of early structures and overdetermination/stability of the final result.

^{^}

"Dissimilar solutions" = Different factorizations of the task into circuits. The low-path-dependence argument here is that "paths are always abundant in high-dimensional spaces".

^{^}

I'm avoiding the term "goal" since I don't presume consequentialism.

^{^}

See one of these reviews. I mostly disbelieve the qualitative conclusions people draw from this work though, for reasons that deserve their own post.

^{^}

All this means right now is "It gives us strong path-dependence vibes, so it's probably correlated with the other stuff"

^{^}

Really a spectrum.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

36

Path dependence in ML inductive biases

36

Path dependence

Diagrams for training dynamics

Reasons to care about path dependence

Specific aspects of training dynamics

A question about predictiveness

Conclusion