Paper: Transformers learn in-context by gradient descent

LawrenceC

Paper: Transformers learn in-context by gradient descent — AI Alignment Forum

17 Paper: Transformers learn in-context by gradient descent

by LawrenceC

16th Dec 2022

2 min read

17

This is a linkpost for https://arxiv.org/abs/2212.07677

The paper argues that auto-regressive transformers implement in-context learning via gradient-based optimization on in-context data.

The authors start by pointing out that with a single linear self-attention (LSA) layer (that is, no softmax), a Transformer can implement one step of gradient descent on the l2 regression loss (a fancy way of saying w -= LR * (w x-y)x^T), and confirm this result empirically. They extend this result by showing that an N-layer LSA-only transformer is similar to N-steps of gradient descent for small linear regression tasks, both in and out of distribution. They also find that the results pretty much hold with softmax self-attention (which isn’t super surprising given you can make a softmax pretty linear).

Next, they show empirically that the forward pass of a small transformer with MLPs behaves similarly to an meta-learned MLP + one step of gradient descent on a toy non-linear regression task, again in terms of both in-distribution and OOD performance.

They then show how you can interpret an induction head as a single step of gradient descent, and provide circumstantial evidence that this explains some of the in-context learning observed in Olsson et al 2022. Specially, they show that 1) a two layer attention-only transformers converge to loss consistent with one step of GD on this task, and 2) the first layer of the network learns to copy tokens one sequence position over in the first layer, prior to the emergence of in-context learning.

(EDIT:) davidad says below:

this is strong empirical evidence that mesa-optimizers are real in practice

Personally, while I think you could place this in the same category as papers like RL^2 or In-context RL with Algorithmic Distillation, which also show mesa optimization, I think the more interesting results are the mechanistic ones -- i.e., that some forms of mesa optimization in the model seem to be implemented via something like gradient descent.

(EDIT 2) nostalgebraist pushes back on this claim in this comment:

Calling something like this an optimizer strikes me as vacuous: if you don't require the ability to adapt to a change of objective function, you can always take any program and say it's "optimizing" some function. Just pick a function that's maximal when you do whatever it is that the program does.
It's not vacuous to say that the transformers in the paper "implement gradient descent," as long as one means they "implement [gradient descent on loss]" rather than "implement [gradient descent] on [ $L_{2}$ loss]." They don't implement general gradient descent, but happen to coincide with the gradient step for $L_{2}$ loss.

(Nitpick: I do want to push back a bit on their claim that they've "mechanistically understand the inner workings of optimized Transformers that learn in-context", since they've only really looked at the mechanism of how single layer attention-only transformers perform in-context learning. )

Interpretability (ML & AI)AI

Frontpage

Mentioned in

30No convincing evidence for gradient descent in activation space

Paper: Transformers learn in-context by gradient descent

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:18 AM

[-]davidad4y51

I think it's too easy for someone to skim this entire post and still completely miss the headline "this is strong empirical evidence that mesa-optimizers are real in practice".

[-]tailcalled4y20

I don't think so.

Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over.

If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.

[-]LawrenceC4y20

Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:

A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system

And the following definition of a mesa-optimizer:

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

In this paper, the authors show that transformers gradient descent (an optimization algorithm) to optimize a particular objective ( loss). (This is very similar to the outer optimization loop that's altering the network parameters.) So the way davidad uses the word "mesa-optimization" is consistent with prior work.

I also think that it's pretty bad to claim that something is only an optimizer if it's a power-seeking consequentialist agent. For example, this would imply that the outer loop that produces neural network policies (by gradient descent on network parameters) is not an optimizer!

Of course, I agree that it's not the case that these transformers are power-seeking consequentialist agents. And so this paper doesn't provide direct evidence that transformers contain power-seeking consequentialist agents (except for people who disbelieved in power-seeking consequentialist agents because they thought NNs are basically incapable of implementing any optimizer whatsoever).

[-]nostalgebraist4y714

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

Programs that implement the computation of taking the derivative. (, or perhaps $f, x \to f^{'} (x)$ .)
Programs that implement some particular function g, which happens to be the derivative of some other function. ( $x \to g (x)$ , where it so happens that $g = F^{'}$ for some $F$ .)

The transformers in this paper are programs of the 2nd type. They don't contain any logic about taking the gradient of an arbitrary function, and one couldn't "retarget" them toward $L_{1}$ loss or some other function.

(One could probably construct similar layers that implement the gradient step for $L_{1}$ , but they'd again be programs of the 2nd type, just with a different hardcoded $g$ .)

Calling something like this an optimizer strikes me as vacuous: if you don't require the ability to adapt to a change of objective function, you can always take any program and say it's "optimizing" some function. Just pick a function that's maximal when you do whatever it is that the program does.

It's not vacuous to say that the transformers in the paper "implement gradient descent," as long as one means they "implement [gradient descent on $L_{2}$ loss]" rather than "implement [gradient descent] on [ $L_{2}$ loss]." They don't implement general gradient descent, but happen to coincide with the gradient step for $L_{2}$ loss.

If in-content learning in real transformers involves figuring out the objective function from the context, then this result cannot explain it. If we assume some fixed objective function (perhaps LM loss itself?) and ask whether the model might be doing gradient steps on this function internally, then these results are relevant.

[-]LawrenceC4y10

I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.

That being said, I'm still confused about the details.

Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me switch between three types of objectives, but there was a parameter inside of the neural network that I could change that causes the circuit to optimize different objectives? How much of the circuit do I have to change before it becomes a new circuit instead of retargeting the optimizer?

I guess part of answer to these questions might look like, "there might not be a clear cutoff, in the same sense that there's not a clear cutoff for most other definitions that we use in AI alignment ('agent' or 'deceptive alignment' for example)", while another part might be "this is left for future work".

*This is also similar to the definition used for inner misalignment in Shah et al's Goal Misgeneralization paper:

We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we learn a function fθbad that has robust capabilities but pursues an undesired goal.
It is quite challenging to define what a “capability” is in the context of neural networks. We provide a provisional definition following Chen et al. [11]. We say that the model is capable of some task X in setting Y if it can be quickly tuned to perform task X well in setting Y (relative to learning X from scratch). For example, tuning could be done by prompt engineering or by fine-tuning on a small quantity of data [52].

[-]LawrenceC4y20

Sure, edited the post to clarify.

[-]joswald4y40

Hi there - I am the first author! Thanks for this very nice write up. Regarding: "mechanistically understand the inner workings of optimized Transformers that learn in-context" - its definitely fair to say that we do this (only) for self-attention only Transformers! Also, I try to be more careful and (hopefully consistently) only claim this for our simple problems studied... working on v2 including language experiments and I am also trying to find a way how to verify the hypotheses in pretrained models. Thanks again!

[-]LawrenceC4y10

You're welcome, and I'm glad you think the writeup is good.

Thank you for the good work.

Moderation Log

More from LawrenceC

Curated and popular this week

8Comments

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:18 AM

[-]davidad4y51

I think it's too easy for someone to skim this entire post and still completely miss the headline "this is strong empirical evidence that mesa-optimizers are real in practice".

[-]tailcalled4y20

I don't think so.

[-]LawrenceC4y20

Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:

A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system

And the following definition of a mesa-optimizer:

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

[-]nostalgebraist4y714

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

Programs that implement the computation of taking the derivative. (, or perhaps $f, x \to f^{'} (x)$ .)
Programs that implement some particular function g, which happens to be the derivative of some other function. ( $x \to g (x)$ , where it so happens that $g = F^{'}$ for some $F$ .)

(One could probably construct similar layers that implement the gradient step for $L_{1}$ , but they'd again be programs of the 2nd type, just with a different hardcoded $g$ .)

[-]LawrenceC4y10

I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.

That being said, I'm still confused about the details.

*This is also similar to the definition used for inner misalignment in Shah et al's Goal Misgeneralization paper:

We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we learn a function fθbad that has robust capabilities but pursues an undesired goal.
It is quite challenging to define what a “capability” is in the context of neural networks. We provide a provisional definition following Chen et al. [11]. We say that the model is capable of some task X in setting Y if it can be quickly tuned to perform task X well in setting Y (relative to learning X from scratch). For example, tuning could be done by prompt engineering or by fine-tuning on a small quantity of data [52].

[-]LawrenceC4y20

Sure, edited the post to clarify.

[-]joswald4y40

[-]LawrenceC4y10

You're welcome, and I'm glad you think the writeup is good.

Thank you for the good work.

Moderation Log