Maze-solving agents: Add a top-right vector, make the agent go to the top-right

peligrietzer; lisathiergart

This is really cool, thanks for posting it. I also would not have expected this result. In particular, the fact that the top right vector generalizes across mazes is surprising. (Even generalizing across mouse position but not maze configuration is a little surprising, but not as much.)

Since it helps to have multiple interpretations of the same data, here's an alternative one: The top right vector is modifying the neural network's perception of the world, not its values. Let's say the agent's training process has resulted in it valuing going up and to the right, and it also values reaching the cheese. Maybe it's utility looks like x+y+10*[found cheese] (this is probably very over-simplified). In that case, the highest reachable x+y coordinate is important for deciding whether it should go to the top right, or if it should go directly to the cheese. Now if we consider how the top right vector was generated, the most obvious interpretation is that it should make the agent think there's a path all the way to the top right corner, since that's the difference between the two scenarios that were subtracted to produce it. So the agent concludes that the x+y part of its utility function is dominant, and proceeds to try and reach the top right corner.

Predictions:

Algebraic value editing works (for at least one "X vector") in LMs: 85 %
1. Most of the "no" probability comes from the attention mechanism breaking this in some hard-to-fix way. Some uncertainty comes from not knowing how much effort you'd put in to get around this. If you're going to stop after the first try, then put me down for 70% instead. I'm assuming here that an X-vector should generalize across inputs, in the same way that the top right vector generalizes across mazes and mouse-positions.
Algebraic value editing works better for larger models, all else equal: 55%
1. Seems like the kind of thing that might be true, but I'm really not sure.
If value edits work well, they are also composable 70%
1. Yeah, seems pretty likely
If value edits work at all, they are hard to make without substantially degrading capabilities: 50%
1. I'm too uncertain about your qualitative judgement of what "substantial" and "capabilities" mean to give a meaningful probability here. Performance in terms of logprob almost certainly gets worse, not sure how much, and it might depend on the X-vector. Specific benchmarks and thresholds would help with making a concrete prediction here.
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
1. "truth-telling" 50%
  1. This one seems different from and harder than the others. I can imagine a vector that decreases the network's truth-telling, but it seems a little less likely that we could make the network more likely to tell the truth with a single vector. We could find vectors that make it less likely to write fiction, or describe conspiracy theories, and we could add them to get a vector that would do both, but I don't think this would translate to increased truth telling in other situations where it would normally not tell the truth for other reasons. This assumes that your test-cases for the truth vector go beyond the test cases you used to generate it, however.
2. "love" 80%
3. "accepting death" 80%
4. "speaking French" 85%

[-]LawrenceC3y70

Great work, glad to see it out!

Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all?
Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
And the top-right vector also transfers across mazes? Why isn't it maze-specific?
To make up some details, why wouldn't an internal "I want to go to top-right" motivational information be highly entangled with the "maze wall location" information?

This was also the most surprising part of the results to me.

I think both this work and Neel's recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just... represented ~linearly (Olah et al's Features as Directions hypothesis). Note that Chris Olah's earlier work for features as directions were not done on transformers but also on conv nets without residual streams.

[-]jacquesthibs3y20

Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction.

I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example.

Edit: just realized that the next post might be with GPT-2. Exciting!

[-]RobertKirk3y10

I think the hyperlink for "conv nets without residual streams" is wrong? It's https://www.westernunion.com/web/global-service/track-transfer for me

[-]LawrenceC3y10

lol, thanks, fixed

[-]Aryan Bhatt3y20

I wish I knew why.

Same.

I don't really have any coherent hypotheses (not that I've tried for any fixed amount of time by the clock) for why this might be the case. I do, however, have a couple of vague suggestions for how one might go about gaining slightly more information that might lead to a hypothesis, if you're interested.

The main one involves looking at the local nonlinearities of the few layers after the intervention layer at various inputs, by which I mean examining diff(t) = f(input+t*top_right_vec) - f(input) as a function of t (for small values of t, in particular) (where f=nn.Sequential({the n layers after the intervention layer}) for various small integers n).

One of the motivations for this is that it feels more confusing that [adding works and subtracting doesn't] than that [increasing the coefficient strength does diff things in diff regimes, ie for diff coefficient strengths], but if you think about it, both of those are just us being surprised/confused that the function I described above is locally nonlinear for various values of t.^[1] It seems possible, then, that examining the nonlinearities in the subsequent few layers could shed some light on a slightly more general phenomenon that'll also explain why adding works but subtracting doesn't.

It's also possible, of course, that all the relevant nonlinearities kick in much further down the line, which would render this pretty useless. If this turns out to be the case, one might try finding "cheese vectors" or "top-right vectors" in as late a layer as possible^[2], and then re-attempt this.

^{^}
We only care more about the former confusion (that adding works and subtracting doesn't) because we're privileging t=0, which isn't unreasonable, but perhaps zooming out just a bit will help, idk
^{^}
I'm under the impression that the current layer wasn't chosen for much of a particular reason, so it might be a simple matter to just choose a later layer that performs nearly as well?

[-]TurnTrout3y20

I'm under the impression that the current layer wasn't chosen for much of a particular reason, so it might be a simple matter to just choose a later layer that performs nearly as well?

The current layer was chosen because I looked at all the layers for the cheese vector, and the current layer is the only one (IIRC) which produced interesting/good results. I think the cheese vector doesn't really work at other layers, but haven't checked recently.

^{^}

EDIT 4/16/23: The original version of this post used the word "patch", where I now think "modification" would be appropriate. In this post, we aren't "patching in" activations wholesale from other forward passes, but rather e.g. subtracting or adding activation vectors to the forward pass.

^{^}

In my experience, the top right corner must be reachable by the agent. I can't just plop down an isolated empty square in the absolute top right.

^{^}

We decided on this layer (block2.res1.resadd_out) for the cheese vector by simply subtracting the cheese vector from all available layers, and choosing the one layer which seemed interesting.

^{^}

See ^[7], though: Putting aside the $5 \times 5$ model, adding the cheese vector in seed 0 for the $6 \times 6$ model does increase cheese-seeking. Even though the cheese vector technique otherwise affects both models extremely similarly.

^{^}

This probably doesn't make sense in a strict sense, because the situations' chemical and electrical configurations probably can't add/subtract from each other.

^{^}

The analogy might break down here at step 4, if the top-right vector isn't well-described as making the network "want" the top-right corner more (in certain mazes). However, given available data, that description seems reasonable to me, where "wants X" grounds out as "contextually influences the policy to steer towards X." I could imagine changing my mind about that.

In any case, I think the analogy is still evocative, and points at hopes I have for AVE.

^{^}

The notebook results won't be strictly the same if you change model sizes. The plotly charts use preloaded data from the $5 \times 5$ model, so obviously that won't update.

Less trivially, adding the cheese vector seems to work better for $n = 6$ compared to $n = 5$ :

For the $6 \times 6$ net, if you **add** the cheese vector instead of subtracting it, you do increase cheese-seeking on seed 0! In contrast, this was not true for the $5 \times 5$ net.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

37

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

37

Background

Finding the top-right vector

Adding the top-right vector with different coefficient strengths

Subtracting the top-right vector has little effect

The top-right vector transfers across mazes

Composing the activation additions

The cheese vector technique generalizes to other pretrained models

Speculation on the importance of X-vectors

Mysteries of algebraic value editing

Predictions for algebraically editing LM forward passes

Conclusion

Appendix: The cheese vector replicates across pretrained models