## AI ALIGNMENT FORUMAF

Lawrence Chan

I do AI Alignment research. Currently at ARC Evals, though I still dabble in interpretability in my spare time.

I'm also currently on leave from my PhD at UC Berkeley's CHAI.

# Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

# Wiki Contributions

Great work, glad to see it out!

• Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all?
• Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
• Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
• And the top-right vector also transfers across mazes? Why isn't it maze-specific?
• To make up some details, why wouldn't an internal "I want to go to top-right" motivational information be highly entangled with the "maze wall location" information?

This was also the most surprising part of the results to me.

I think both this work and Neel's recent Othello post do provide evidence that at least for small-medium sized neural networks, things are just... represented ~linearly (Olah et al's Features as Directions hypothesis).  Note that Chris Olah's earlier work for features as directions were not done on transformers but also on conv nets without residual streams

Thanks!

(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF.

Quick clarifications:

• For challenge 1, was the MNIST CNN trained on all 60,000 examples in the MNIST train/validation sets?
• For both challenges, do the models achieve perfect train accuracy? When did you train them until?
• What sort of interp tools are allowed? Can I use pre-mech interp tools like saliency maps?

Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!

I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:

Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.

The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions about! On the other hand, using "prediction" can obscure the distinction, and end up with confused questions like "is GPT just an agent that just wants to minimize predictive loss?"

The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)

I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does clarify that it's not literally physical reality), e.g.:

I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:

• The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
• It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
• It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.

[...]

The next post will be all about the physics analogy, so here I’ll only tie what I said earlier to the simulation objective.

the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum.

To know the conditional structure of the universe[27] is to know its laws of physics, which describe what is expected to happen under what conditions.

I think insofar as people end up thinking the simulation is an exact match for physical reality, the problem was not in the simulators frame itself, but instead the fact that the word physics was used 47 times in the post, while only the first few instances make it clear that literal physics is intended only as a metaphor.

Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting.

We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.

Hopefully this will be fixed with the forthcoming arXiv paper!

At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!