iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)

Reply

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Tom Lieberum2y1-2

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Reply

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Tom Lieberum2y00

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y20

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y30

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y00

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y00

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y20

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum2y170

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:

https://wandb.ai/tomfrederik/interpreting_grokking/reports/Increasing-Learning-Rate-at-Grokking--VmlldzoxNTQ2ODY2?accessToken=y3f00qfxot60n709pu8d049wgci69g53pki6pq6khsemnncca1dnmocu7a3d43y8

I set the LR to 5x the normal value (100x tanked the accuracy, 10x still works though). Of course you would want to anneal it after grokking was finished.

Reply

Deducing Impact

Tom Lieberum5y10

While I agree that using percentages would make impact more comparable between agents and timesteps, it also leads to counterintuitive results (at least counterintuitive to me)

Consider a sequence of utilities at times 0, 1, 2 with $U_{0} = 1$ , $U_{1} = 0.01$ and $U_{2} = 0$ .

Now the drop from $U_{1}$ to $U_{2}$ would be more dramatic (decrease by 100%) compared to the drop from $U_{0}$ to $U_{1}$ (decrease by 99%) if we were using percentages. But I think the agent should 'care more' about the larger drop in absolute utility (i.e. spend more resources to prevent it from happening) and I suppose we might want to let impact correspond to something like 'how much we care about this event happening'.

Reply