Lawrence Chan

I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.

I'm also a part-time fund manager for the LTFF.

Obligatory research billboard website: https://chanlawrence.me/

30

(The Anthropic paper I cited predates ChatGPT by 7 months)

40

Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862

But yes, people complained about it a lot at the time

30

Very cool work; I'm glad it was done.

That being said, I agree with Fabien that the title is a bit overstated, insofar as it's about your results in particular::

Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.

It's a general fact of ML that small changes in finetuning setup can greatly affect performance if you're not careful. In particular, it seems likely to me that the empirical details that Fabien asks for may affect your results. But this has little to do with formatting, and much more to deal with the intrinsic difficulty of finetuning LLMs properly.

As shown in Fabien's password experiments, there are many ways to mess up on finetuning (including by having a bad seed), and different finetuning techniques are likely to lead to different levels of performance. (And the problem gets worse as you start using RL and not just SFT) So it's worth being very careful on claiming that the results of any particular finetuning run upper bounds model capabilities. But it's still plausible that trying very hard on finetuning elicits capabilities more efficiently than trying very hard on prompting, for example, which I think is closer to what people mean when they say that finetuning is an upper bound on model capabilities.

20

Have you tried instead 'skinny' NNs with a bias towards depth,

I haven't -- the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot -> grokking were done with the hope of interpreting the model before/after the slingshots.

That being said, you're probably correct that having more layers does seem related to slingshots.

(Particularly for MLPs, which are notorious for overfitting due to their power.)

What do you mean by power here?

40

Worth noting that both some of Anthropic's results and Lauren Greenspan's results here (assuming I understand her results correctly) give a clear demonstration of learned (even very toy) transformers not being well-modeled as sets of skip trigrams.

91

I'm having a bit of difficulty understanding the exact task/set up of this post, and so I have a few questions.

Here's a summary of your post as I understand it:

- In Anthropic's Toy Model of Attention Head "Superposition",
^{[1]}they consider a task where the model needs to use interference between heads to implement multiple skip trigrams. In particular, they consider, and call this "OV-incoherent, because the OV seems to need to use information "not present" in V of the source token. (This was incorrect, because you can implement their task perfectly using only a copying head and a negative copying head.) They call this attention head superposition because they didn't understand the algorithm, and so mistakenly thought they needed more attention heads than you actually need to implement the task (to their credit, they point out their mistake in their July 2023 update, and give the two head construction). - In this work, you propose a model of "OV-coherent" superposition, where the OV still needs to use information "not present" at the attended to location and which also requires more skip trigrams than attention heads to implement. Namely, you consider learning sequences of the form [A] ... [B] ... [Readoff]-> [C], which cannot naturally be implemented via skip-trigrams (and instead needs to be implemented via what Neel calls hierarchical skip tri-grams or what I normally call just "interference").
- You construct your sequences as follows:
- There are 12 tokens for the input and 10 "output tokens". Presumably you parameterized it so that dvocab=12, and just reassigned the inputs? For the input sequence, you use one token [0] as the read-off token, 4 tokens [1-4] as signal tokens, and the rest [5-11] as noise tokens.
- In general, you don't bother training the model above all tokens except for above the read-off [0](I think it's more likely you trained it to be uniform on other tokens, actually. But at most this just rescales the OV and QK circuits (EVOU and EQKE respectively), and so we can ignore it when analyzing the attention heads).
- Above the read-off, you train the model to minimize cross entropy loss, using the labels:
- 0 -> 1,1 present in sequence
- 1 -> 1, 2 present in sequence
- ...
- 8 -> 3, 4 present in sequence
- 9 -> 4, 4 present in sequence

- So for example, if you see the sequence [5] [1] [2] [11] [0], the model should assign a high logit to [1], if you see the sequence [7] [4] [10] [3] [0], the model should assign a high logit to [8], etc.

- You find that models can indeed learn your sequences of the form [A] ... [B] ... [Readoff]-> [C], often by implementing constructive interference "between skip-trigrams" both between the two attention heads and
*within*each single head.- Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
- Attend to tokens in [1-4], but attend to token [1] the most, then [4], then [3], then [2]. Call this the order of head 1.
- The head increases the logits corresponding to pairs containing the tokens it attends to, except for the pairs that contain tokens higher in the order. That is, when attending to token [1], increase the logits for outputs [0-3] (corresponding to the logits indicating that there's a 1 present in the sequence) and decrease logits for outputs [4-9] (corresponding to all other logits). Similarly, when attending to 4, increase the logits for outputs [6], [8], and [9] (corresponding to logits indicating that there's a 4 present but not a 1). When attending to 3, increase logits for outputs [5] and [7] (there's a 3 but not a 1 or 4), and when attending to 2, increase logits for outputs [2], [3], [4]. In fact, it increases the logits that it attends to less strongly
*more*, which partially cancels out the fact that it attends more to those logits. - So on the sequence [7] [4] [10] [3] [0], head 1 will increase the logits for [6], [8], [9] a lot and [5] and [7] a little, while suppressing all other logits.

- Head 0 implements the same algorithm, but attends in order [2], [3], [4], [1] (the reverse of head 1).
- That being said, it's a lot less clean in terms of what it outputs, e.g. it slightly increases logits [7-9] if it sees a 1. This is probably for error correction/calibration reasons, increasing logits [7-9] helps cancel out the strong bias of head 1 in suppressing logits of [5-9].
- On the sequence [7] [4] [10] [3] [0], head 0 increases the logits for [2], [7], [8] a lot and [3] 9] a little.

- Adding together the two heads causes them to output the correct answer.
- On the sequence [7] [4] [10] [3] [0], since both heads increase logit [8] a lot, and increase the other logits only a little, the model outputs [8] (corresponding to 3, 4 being in sequence).

- Specifically, in your mainline model in the post, head 1 implements something like the following algorithm:
- You conclude that this is an example of a different kind of "attention head superposition", because this task is implemented across two attention heads, even though it takes 10 skip trigrams to naively implement this task.

Questions/comments:

- I'm not sure my understanding of the task is correct, does the description above seem right to you?
- Assuming the description above is correct, it seems that there's an easy algorithm for implementing this with one head.
- When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
- You can explain this with skip-bigrams -- the model needs to implement the 16 skip bigrams mapping each of 4 tokens to the 4 logits corresponding to a pair containing the token.
- You need a slight correction to handle the case where there are two repeated tokens, so you in fact want to increase the logits non-uniformly, so as to assign slightly higher logits to the pair containing the attended to token twice.
- though, if you trained the model to be uniform on all tokens except for [0], it'll need to check for [0] when deciding to output a non-uniform logit and move this information from other tokens, so it needs to stash its "bigrams" in EVOU and not EU

- It's pretty easy to implement 16 skip-bigrams in a matrix of size 4 x 10 (you only need 16 non-zero entries out of 40 total entries). You want EVOU to look something like:

3 2 2 2 0 0 0 0 0 0

0 2 0 0 3 2 2 0 0 0

0 0 2 0 0 2 0 3 2 0

0 0 0 2 0 0 2 0 2 3

Then with EQKE uniform on[1-4] and 0 otherwise, the output of the head (attention-weighted EVOU) in cases where there are two different tokens in the input will be 4 on the true logit and 2 or 3 on the logits for pairs containing one of the tokens but not the other, and 0 on other bigrams. In cases where the same token appears twice, then you get 6 on the true logit, 4 on the three other pairs containing the token once, and 0 otherwise.^{[2]}You can then scale EVOU upwards to decrease loss until weight decay kicks in. - In your case, you have two EVOUs of size 4 x 10 but which are constrained to be rank 5 due to d_head=5. This is part of why the model wants to split the computation evenly across both heads.
- From eyeballing, adding together the two EVOUs indeed produces something akin to the above diagram.
- Given you split the computation and the fact that EVOU being rank 5 for each head introduces non-zero bias/noise, you want the two heads to have opposite biases/noise terms such that they cancel out. This is why you see one head specializing in copying over 1, then 4, then 3, then 2, and the other 2 3 4 1.

- This also explains your observation: "We were also surprised that this problem can be solved with one head, as long as d_head >= 4. Intuitively, once a head has enough dimensions to store every "interesting" token orthogonally, its OV circuit can simply learn to map each of these basis vectors to the corresponding completions."
- It makes sense why d_head >= 4 is required here, because you definitely cannot implement anything approaching the above EVOU with a rank 3 matrix (since you can't even "tell apart" the 4 input tokens). Presumably the model can learn low-rank approximations of the above EVOU, though I don't know how to construct them by hand.

- So it seems to me that, if my understanding is correct, this is also
*not*an example of "true" superposition, in the sense I distinguish here: https://www.lesswrong.com/posts/8EyCQKuWo6swZpagS/superposition-is-not-just-neuron-polysemanticity

- When you see a token, increase the logits corresponding to pairs containing that token. Then, attend to all tokens in [1-4] uniformly.
- What exactly do you mean by superposition?
- It feels that you're using the term interchangeably with "polysemanticity" or "decomposability". But part of the challenge of superposition is that there are more sparse "things" the model wants to compute or store than it has "dimensions"/"components", which means there's no linear transformation of the input space that recovers all the features. This is meaningfully distinct from the case where the model wants to represent one thing across multiple components/dimensions for error correction or other computational efficiency reasons(i.e. see example 1 here), which are generally easier to handle using linear algebra techniques.

- It feels like you're claiming superposition because there are more skip trigrams than n_heads, is there a different kind of superposition I'm missing here?
- I think your example in the post is not an example of superposition in the traditional sense (again assuming that my interpretation is correct), and is in fact not even true polysemanticity. Instead of each head representing >1 feature, the low-rank nature of your heads means that each head basically has to represent 0.5 features.
- The example in the post is an example of
*superposition of skip trigrams*, but it's pretty easy to construct toy examples where any -- would you consider any example where you can't represent the task with <= nheads skip trigrams as an example of superposition?

Some nitpicks:

- What is "nan" in the EVOU figure (in the chapter "OV circuit behaviour")? I presume this is the (log-)sum(-exp) of the logits corresponding to outputs [9] and [10]?
- It's worth noting that (I'm pretty sure though I haven't sat down to write the proof) as softmax attention is a non-polynomial function of inputs, 1-layer transformers with unbounded number of heads can implement
*arbitrary*functions of the inputs. On the other hand, skip n-grams for any fixed n obviously are not universal (i.e. they can't implement XOR, as in the example of the "1-layer transformers =/= skip trigrams post). So even theoretically (without constructing any examples), it seems unlikely that you should think of 1L transformers as*only*skip-trigrams, though whether or not this occurs often in real networks is an empirical question (to which I'm pretty sure the answer is yes, because e.g. copy suppression heads are a common motif).

^{^}Scare quotes are here because their example is really disanalogous to MLP superposition. IE as they point out in their second post, their task is well thought of as naturally being decomposed into two attention heads; and a model that has n >= 2 heads isn't really "placing circuits in superposition" so much as doing a natural task decomposition that they didn't think of.

In fact, it feels like that result is a cautionary tale that just because a model implements an algorithm in a non-basis aligned manner, does not mean the model is implementing an approximate algorithm that requires exploiting near-orthogonality in high-dimensionality space (the traditional kind of residual stream/MLP activation superposition), nor does it mean that the algorithm is "implementing more circuits than is feasible" (i.e. the sense that they try to construct in the May 2023 update). You might just not understand the algorithm the model is implementing!

If I were to speculate more, it seems like they were screwed over by continuing to think about one-layer attention model as a set of skip trigrams, which they are not. More poetically, if your "natural" basis isn't natural, then of course your model won't use your "natural" basis.

^{^}Note that this construction isn't optimal, in part because of the fact that output tokens corresponding to the same token occuring twice occur half as often as those with two different tokens, while this construction gets lower log loss in the one-token case as in the two distinct token case. But the qualitative analysis carries through regardless.

1512

The general version of this statement is something like: if your beliefs satisfy the law of total expectation, the variance of the whole process should equal the variance of all the increments involved in the process.^{[1]} In the case of the random walk where at each step, your beliefs go up or down by 1% starting from 50% until you hit 100% or 0% -- the variance of each increment is 0.01^2 = 0.0001, and the variance of the entire process is 0.5^2 = 0.25, hence you need 0.25/0.0001 = 2500 steps in expectation. If your beliefs have probability p of going up or down by 1% at each step, and 1-p of staying the same, the variance is reduced by a factor of p, and so you need 2500/p steps.

(Indeed, something like this standard way to derive the expected steps before a random walk hits an absorbing barrier).

Similarly, you get that if you start at 20% or 80%, you need 1600 steps in expectation, and if you start at 1% or 99%, you'll need 99 steps in expectation.

One problem with your reasoning above is that as the 1%/99% shows, needing 99 steps in expectation *does not* mean you will take 99 steps with high probability -- in this case, there's a 50% chance you need only one update before you're certain (!), there's just a tail of very long sequences. In general, the expected value of variables need not look like

I also think you're underrating how much the math changes when your beliefs do not come in the form of uniform updates. In the most extreme case, suppose your current 50% doom number comes from imagining that doom is uniformly distributed over the next 10 years, and zero after -- then the median update size per week is only 0.5/520 ~= 0.096%/week, and the expected number of weeks with a >1% update is 0.5 (it only happens when you observe doom). Even if we buy a time-invariant random walk model of belief updating, as the expected size of your updates get larger, you also expect there to be quadratically fewer of them -- e.g. if your updates came in increments of size 0.1 instead of 0.01, you'd expect only 25 such updates!

Applying stochastic process-style reasoning to beliefs is empirically very tricky, and results can vary a lot based on seemingly reasonable assumptions. E.g. I remember Taleb making a bunch of mathematically sophisticated arguments^{[2]} that began with "Let your beliefs take the form of a Wiener process^{[3]}" and then ending with an absurd conclusion, such as that 538's forecasts are obviously wrong because their updates aren't Gaussian distributed or aren't around 50% until immediately before the elction date. And famously, reasoning of this kind has often been an absolute terrible idea in financial markets. So I'm pretty skeptical of claims of this kind in general.

^{^}There's some regularity conditions here, but calibrated beliefs that things you eventually learn the truth/falsity of should satisfy these by default.

^{^}Often in an attempt to Euler people who do forecasting work but aren't super mathematical, like Philip Tetlock.

^{^}This is what happens when you take the limit of the discrete time random walk, as you allow for updates on ever smaller time increments. You get Gaussian distributed increments per unit time -- W_t+u - W_t ~ N(0, u) -- and since the tail of your updates is very thin, you continue to get qualitatively similar results to your discrete-time random walk model above.

And yes, it is ironic that Taleb, who correctly points out the folly of normality assumptions repeatedly, often defaults to making normality assumptions in his own work.

73

When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.

52

I don't know what the "real story" is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying "something something weight norm ergo generalizing". Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.

One issue is that our picture doesn't consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), **especially*** *when there's a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn't sweep the learning rate very hard).

There's also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on -- why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay "favored more distributed solutions" (somewhat false) and "it sure seems empirically true", but we didn't have anything better than that.

There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek's "Transformers can implement Sherman-Morris for closed-form ridge regression" paper? (That's not the name, just the headline result.)

Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there's no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).

In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.

**Bringing it back to the topic at hand: **There's often a lot more "small" confusions that remain, even after doing good toy models work. It's not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).

^{^}Haven't checked, might do this later this week.

Ah, you're correct, it's from the original instructGPT release in Jan 2022:

https://openai.com/index/instruction-following/