AI ALIGNMENT FORUM
AF

Christopher Olah — AI Alignment Forum

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including... (read 3642 more words →)

Replying toAnthropic's Core Views on AI Safety

Christopher Olah3y

Anthropic's Core Views on AI Safety

We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.

With that said, I think we tend to expect a bit of continuity. Empirically, even the "abrupt changes" we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model properties can happen quite fast over the course of training).

But we certainly wouldn't claim to know this with any confidence, and wouldn't take the possibility of extremely abrupt changes off the table!

Replying toAnthropic's Core Views on AI Safety

Christopher Olah3y

Anthropic's Core Views on AI Safety

how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?

I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.

(Obviously, within the company, there's a wide range of views. Some people are very pessimistic. Others are optimistic. We debate this quite a bit internally, and I think that's really positive! But I think there's a broad... (read 959 more words →)

Replying toOne-layer transformers aren’t equivalent to a set of skip-trigrams

Christopher Olah3y

One-layer transformers aren’t equivalent to a set of skip-trigrams

I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you're providing is mostly orthogonal to this argument.)

I think if you're uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)

As a meta point – I've left some thoughts below, but in general, I'd rather advance this dialogue by just writing future papers.

(1) The main... (read 792 more words →)

Replying toOne-layer transformers aren’t equivalent to a set of skip-trigrams

Christopher Olah3y

One-layer transformers aren’t equivalent to a set of skip-trigrams

Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it's slightly better, but not considerably?

Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it.

Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you're studying, and I worry that... (read more)

Replying toOne-layer transformers aren’t equivalent to a set of skip-trigrams

Christopher Olah3y

One-layer transformers aren’t equivalent to a set of skip-trigrams

Regarding the more general question of "how much should interpretability make reference to the data distribution?", here are a few thoughts:

Firstly, I think we should obviously make use of the data distribution to some extent (and much of my work has done so!). If you're trying to reverse engineer a regular computer program, it's extremely useful to have traces of that program running. So too with neural networks!

However, the fundamental thing I care about is understanding whether models will be safe off-distribution, so an understanding which is tied to a specific distribution – and especially to a narrow distribution – is less clear in how it advances my core goals. Explanations which... (read 774 more words →)

Replying toOne-layer transformers aren’t equivalent to a set of skip-trigrams

Christopher Olah3y

One-layer transformers aren’t equivalent to a set of skip-trigrams

Thanks for writing this up. It seems like a valuable contribution to our understanding of one-layer transformers. I particularly like your toy example – it's a good demonstration of how more complicated behavior can occur here.

For what it's worth, I understand this behavior as competition between skip-trigrams. We introduce "skip-trigrams" as a way to think of pairs of entries in the OV and QK-circuit matrices. The QK-circuit describes how much the attention head wants to attend to a given token in the attention softmax and implement a particular skip-trigram. The phenomenon you describe occurs when there are multiple skip-trigrams present with different QK-circuit values.

An analogy I find useful for thinking about this is... (read more)

Replying toPaper: Superposition, Memorization, and Double Descent (Anthropic)

Christopher Olah3y

Paper: Superposition, Memorization, and Double Descent (Anthropic)

I'm curious how you'd define memorisation? To me, I'd actually count this as the model learning features ...

Qualitatively, when I discuss "memorization" in language models, I'm primarily referring to the phenomenon of languages models producing long quotes verbatim if primed with a certain start. I mean it as a more neutral term than overfitting.

Mechanistically, the simplest version I imagine is a feature which activates when the preceding N tokens match a particular pattern, and predicts a specific N+1 token. Such a feature is analogous to the "single data point features" in this paper. In practice, I expect you can have the same feature also make predictions about the N+2, N+3, etc tokens... (read more)

Replying toPaper: Superposition, Memorization, and Double Descent (Anthropic)

Christopher Olah3y

Paper: Superposition, Memorization, and Double Descent (Anthropic)

In this toy model, is it really the case that the datapoint feature solutions are "more memorizing, less generalizing" than the axis-aligned feature solutions? I don't feel totally convinced of this.

Well, empirically in this setup, (1) does generalize and get a lower test loss than (2). In fact, it's the only version that does better than random. 🙂

But I think what you're maybe saying is that from the neural network's perspective, (2) is a very reasonable hypothesis when T < N, regardless of what is true in this specific setup. And you could perhaps imagine other data generating processes which would look similar for small data sets, but generalize differently. I think... (read more)

Replying toPaper: Superposition, Memorization, and Double Descent (Anthropic)

Christopher Olah3y

Paper: Superposition, Memorization, and Double Descent (Anthropic)

I feel pretty confused, but my overall view is that many of the routes I currently feel are most promising don't require solving superposition.

It seems quite plausible there might be ways to solve mechanistic interpretability which frame things differently. However, I presently expect that they'll need to do something which is equivalent to solving superposition, even if they don't solve it explicitly. (I don't fully understand your perspective, so it's possible I'm misunderstanding something though!)

To give a concrete example (although this is easier than what I actually envision), let's consider this model from Adam Jermyn's repeated data extension of our paper:

If you want to know whether the model is "generalizing" rather than... (read more)

Replying toPaper: Superposition, Memorization, and Double Descent (Anthropic)

Christopher Olah3y*

Paper: Superposition, Memorization, and Double Descent (Anthropic)

This is a good summary of our results, but just to try to express a bit more clearly why you might care...

I think there are presently two striking facts about overfitting and mechanistic interpretability:

(1) The successes of mechanistic interpretability have thus far tended to focus on circuits which seem to describe clean, generalizing algorithms which one might think of as the "non-overfitting parts of neural networks". We don't really know what "overfitting mechanistically is", and you could imagine a world where it's so fundamentally messy we just can't understand it!

(2) There's evidence that more overfit neural networks are harder to understand.

A pessimistic interpretation of this could be something like: Overfitting is fundamentally... (read 545 more words →)