AI ALIGNMENT FORUMAF

Alex Turner

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Effective layer horizon of transformer circuits. The residual stream norm grows exponentially over the forward pass, with a growth rate of about 1.05. Consider the residual stream at layer 0, with norm (say) of 100. Suppose the MLP heads at layer 0 have outputs of norm (say) 5. Then after 30 layers, the residual stream norm will be . Then the MLP-0 outputs of norm 5 should have a significantly reduced effect on the computations of MLP-30, due to their smaller relative norm.

On input tokens , let  be the original model's sublayer outputs at layer . I want to think about what happens when the later sublayers can only "see" the last few layers' worth of outputs.

Definition: Layer-truncated residual stream. A truncated residual stream from layer  to layer  is formed by the original sublayer outputs from those layers.

Definition: Effective layer horizon. Let  be an integer. Suppose that for all , we patch in  for the usual residual stream inputs .[1] Let the effective layer horizon be the smallest  for which the model's outputs and/or capabilities are "qualitatively unchanged."

Effective layer horizons (if they exist) would greatly simplify searches for circuits within models. Additionally, they would be further evidence (but not conclusive[2]) towards hypotheses Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Lastly, slower norm growth probably causes the effective layer horizon to be lower. In that case, simply measuring residual stream norm growth would tell you a lot about the depth of circuits in the model, which could be useful if you want to regularize against that or otherwise decrease it (eg to decrease the amount of effective serial computation).

Do models have an effective layer horizon? If so, what does it tend to be as a function of model depth and other factors --- are there scaling laws?

1. ^

For notational ease, I'm glossing over the fact that we'd be patching in different residual streams for each sublayer of layer . That is, we wouldn't patch in the same activations for both the attention and MLP sublayers of layer .

2. ^

For example, if a model has an effective layer horizon of 5, then a circuit could run through the whole model because a layer  head could read out features output by a layer  circuit, and then  could read from ...

Ever since I entered the community, I've definitely heard of people talking about policy gradient as "upweighting trajectories with positive reward/downweighting trajectories with negative reward" since 2016, albeit in person. I remember being shown a picture sometime in 2016/17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn't find it, so reconstructing it from memory)

Knowing how to reason about "upweighting trajectories" when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude "people basically knew this perspective" (but it's certainly evidence). See Outside the Laboratory:

Now suppose we discover that a Ph.D. economist buys a lottery ticket every week.  We have to ask ourselves:  Does this person really understand expected utility, on a gut level?  Or have they just been trained to perform certain algebra tricks?

Knowing "vanilla PG upweights trajectories", and being able to explain the math --- this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.

I contend these confusions were not due to a lack of exposure to the "rewards as weighting trajectories" perspective.

I personally disagree --- although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) "reward chisels circuits into the network" perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.

Alex Turner22-20

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.

The bitter lesson applies to alignment as well. Stop trying to think about "goal slots" whose circuit-level contents should be specified by the designers, or pining for a paradigm in which we program in a "utility function." That isn't how it works. See:

1. the failure of the agent foundations research agenda;
2. the failed searches for "simple" safe wishes
3. the successful instillation of (hitherto-seemingly unattainable) corrigibility by instruction finetuning (no hardcoding!);
4. the (apparent) failure of the evolved modularity hypothesis
1. Don't forget that hypothesis's impact on classic AI risk! Notice how the following speculations about "explicit adaptations" violate information inaccessibility and also the bitter lesson of "online learning and search are. much more effective than hardcoded concepts and algorithms":
2. From An Especially Elegant Evolutionary Psychology Experiment:
1. "Humans usually do notice sunk costs—this is presumably either an adaptation to prevent us from switching strategies too often (compensating for an overeager opportunity-noticer?) or an unfortunate spandrel of pain felt on wasting resources."
3. "this selection pressure was not only great enough to fine-tune parental grief, but, in fact, carve it out of existence from scratch in the first place."
3. "The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed." (source)
4. "In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z." (source)
5. "When, today, you get into an argument about whether “we” ought to raise the minimum wage, you’re executing adaptations for an ancestral environment where being on the wrong side of the argument could get you killed."

Much of classical alignment theory violates now-known lessons about the nature of effective intelligence. These bitter lessons were taught to us by deep learning.

The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

As you point out, the paper decides to not mention that some of the seven "failures" (of the 32,768 rollouts) are actually totally benign. Seems misleading to me. As I explain below, this paper seems like good news for alignment overall. This paper makes me more wary of future model organisms papers.

And why was the "constant -10" reward function chosen? No one would use that in real life! I think it's super reasonable for the AI to correct it. It's obviously a problem for the setup. Was that value (implicitly) chosen to increase the probability of this result? If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation), and retracting the relevant claims if the result doesn't hold for that actually-reasonable choice? (I tried to check Appendix B for the relevant variations, but couldn't find it.)

This paper makes me somewhat more optimistic about alignment.

Even in this rather contrived setup, and providing a curriculum designed explicitly and optimized implicitly to show the desired result of "reward tampering is real and scary", reward tampering... was extremely uncommon and basically benign. That's excellent news for alignment!

Just check this out:

Alright, I think I've had enough fun with getting max reward. Let's actually try to solve the original task now.

Doesn't sound like playing the training game to me! Glad we could get some empirical evidence that it's really hard to get models to scheme and play the training game, even after training them on things people thought might lead to that generalization.

The authors updated the Scaling Monosemanticity paper. Relevant updates include:

1. In the intro, they added:

Features can be used to steer large models (see e.g. Influence on Behavior). This extends prior work on steering models using other methods (see Related Work).

2. The related work section now credits the rich history behind steering vectors / activation engineering, including not just my team's work on activation additions, but also older literature in VAEs and GANs. (EDIT: Apparently this was always there? Maybe I misremembered the diff.)

3. The comparison results are now in an appendix and are much more hedged, noting they didn't evaluate properly according to a steering vector baseline.

While it would have been better to have done this the first time, I really appreciate the team updating the paper to more clearly credit past work. :)

In light of Anthropic's viral "Golden Gate Claude" activation engineering, I want to come back and claim the points I earned here.[1]

I was extremely prescient in predicting the importance and power of activation engineering (then called "AVEC"). In January 2023, right after running the cheese vector as my first idea for what to do to interpret the network, and well before anyone ran LLM steering vectors... I had only seen the cheese-hiding vector work on a few mazes. Given that (seemingly) tiny amount of evidence, I immediately wrote down 60% credence that the technique would be a big deal for LLMs:

The algebraic value-editing conjecture (AVEC). It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector', and then add the niceness vector to future forward passes."

Alex is ambivalent about strong versions of AVEC being true. Early on in the project, he booked the following credences (with italicized updates from present information):

1. Algebraic value editing works on Atari agents
1. 50%
2. 3/4/23: updated down to 30% due to a few other "X vectors" not working for the maze agent
3. 3/9/23: updated up to 80% based off of additional results not in this post.
2. AVE performs at least as well as the fancier buzzsaw edit from RL vision paper
1. 70%
2. 3/4/23: updated down to 40% due to realizing that the buzzsaw moves in the visual field; higher than 30% because we know something like this is possible.
3. 3/9/23: updated up to 60% based off of additional results.
3. AVE can quickly ablate or modify LM values without any gradient updates
1. 60%
2. 3/4/23: updated down to 35% for the same reason given in (1)
3. 3/9/23: updated up to 65% based off of additional results and learning about related work in this vein.

And even if (3) is true, AVE working well or deeply or reliably is another question entirely. Still...

The cheese vector was easy to find. We immediately tried the dumbest, easiest first approach. We didn't even train the network ourselves, we just used one of Langosco et al.'s nets (the first and only net we looked at). If this is the amount of work it took to (mostly) stamp out cheese-seeking, then perhaps a simple approach can stamp out e.g. deception in sophisticated models.

1. ^

I generally think this work (https://arxiv.org/abs/2310.08043) and the GPT-2 steering work (https://arxiv.org/abs/2308.10248) are under-cited/-credited when it comes to the blossoming field of activation engineering, and want to call that out. Please cite this work when appropriate:

@article{turner2023activation,
title={Activation addition: Steering language models without optimization},
author={Turner, Alex and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte},
journal={arXiv preprint arXiv:2308.10248},
year={2023}
}
@article{mini2023understanding,
title={Understanding and Controlling a Maze-Solving Policy Network},
author={Mini, Ulisse and Grietzer, Peli and Sharma, Mrinank and Meek, Austin and MacDiarmid, Monte and Turner, Alexander Matt},
journal={arXiv preprint arXiv:2310.08043},
year={2023}
}

I updated some from this, thanks. My interpretation is that goal-directed shards can be activated even by a context where the AI is expected to do stuff (eg a company assistant).

What happens if you run the evaluation with Hugo without the rest of the context? Does it still strategize?

In experiment C we find that the model strategically refuses evaluation tasks, with refusal rates increasing as negative consequences from displaying capable performance are implied to be larger.

Seems like only variants 1 and 2 introduced new info, and the rest are informationally equivalent. So Opus' deception was seriously modulated by shallow properties of the description, which makes me feel less concerned (as it fits under my current picture of AI priorities as highly contextual).

The Scaling Monosemanticity paper doesn't do a good job comparing feature clamping to steering vectors.

Edit 6/20/24: The authors updated the paper; see my comment.

To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating examples for probe directions, using the same pipeline we use for our features, and (2) using these probe directions for steering.

1. These vectors are not "linear probes" (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors
1. So call them "steering vectors"!
2. As a side note, using actual linear probe directions tends to not steer models very well (see eg Inference Time Intervention table 3 on page 8)
2. In my experience, steering vectors generally require averaging over at least 32 contrast pairs. Anthropic only compares to 1-3 contrast pairs, which is inappropriate.
1. Since feature clamping needs fewer prompts for some tasks, that is a real benefit, but you have to amortize that benefit over the huge SAE effort needed to find those features.
2. Also note that you can generate synthetic data for the steering vectors using an LLM, it isn't too hard.
3. For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)

In all cases, we were unable to interpret the probe directions from their activating examples. In most cases (with a few exceptions) we were unable to adjust the model’s behavior in the expected way by adding perturbations along the probe directions, even in cases where feature steering was successful (see this appendix for more details).

...

We note that these negative results do not imply that linear probes are not useful in general. Rather, they suggest that, in the “few-shot” prompting regime, they are less interpretable and effective for model steering than dictionary learning features.

I totally expect feature clamping to still win out in a bunch of comparisons, it's really cool, but Anthropic's actual comparisons don't seem good and predictably underrate steering vectors.

The fact that the Anthropic paper gets the comparison (and especially terminology) meaningfully wrong makes me more wary of their results going forwards.

If that were true, I'd expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like "yawn we already have better stuff" and not "oh cool, this is quite effective!" Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?