Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm

3Rohin Shah

New Comment

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.

I disagree at least with the "almost completely" version of this claim:

Notice that the operation you want to learn is

manifestlya convolution operation, i.e.

This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" representation back into the "magnitude" representation, and then simply adds them as normal.

Some evidence for this:

- I've heard it claimed that an LLM represented the number of characters in a token as a magnitude, which was used for deciding whether a line of code was > 80 characters (useful for predicting line breaks).
- This paper trains on non-modular addition and gets this result. (Note however that the paper has a highly unusual setting that isn't representative of typical network training, and arguably the setup is such that you have to get this result: in particular the architecture enforces that the embeddings of the two inputs are added together, which wouldn't work in a Fourier basis. I cite it as evidence that networks do learn magnitude representations when forced to do so.)

It seems like if you believe "when the operation you want to learn is a convolution operation, then you will learn the Fourier basis", you should also believe that you'll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that's false.

Fwiw, I agree that the algorithm is quite "mathematically natural" (indeed, one person came up with the algorithm independently, prompted by "how would you solve this problem if you were a Transformer"), though I feel like the "modular" part is pretty crucial for me (and the story I'd tell would be the one in Daniel's comment).

These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.

In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:

And

But the casual reader should use caution! It is in fact the case that "Inputs x,y are given as one-hot encoded vectors in Rp ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also

(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).much clearerabout this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It isnottrained to 'act' on the numbersx, ythemselves.When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting

is already a significant transformation.You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live 'in the domain' (its almost like a dual point of view: The number x is not the size of the signal, butthe positionat whichthe input signal is non-zero).So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the

indicator functionst↦1x(t) and t↦1y(t). Here, t is something like the 'token variable' in the sense that these are functionsonthe vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group Z/pZ and that we would like the network to learn to produce the function t↦1x+y(t) at its output neurons.In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.

Notice that the operation you want to learn is

manifestlya convolution operation, i.e.And (as I distinctly remember being made to practically chant in an 'Analysis of Boolean Functions' class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that

simultaneously diagonalizes all convolution operations.This is coming close to saying something like: There isonespecial basis that makes the operation you want to learnuniquely easy to do using matrix multiplications,and that basis is the Fourier basis.