## AI ALIGNMENT FORUMAF

Lawrence Chan

I do AI Alignment research. Currently at ARC Evals, though I still dabble in interpretability in my spare time.

I'm also currently on leave from my PhD at UC Berkeley's CHAI.

# Sequences

[Redwood Research] Causal Scrubbing

# Wiki Contributions

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).

The general pattern is the following (terminology borrowed from Terry Tao):

1. The pre-rigorous stage: Really new people don't know how ~anything works in a field, and so use high-level abstractions that aren't necessarily grounded in reality.
2. The rigorous stage: Intermediate people learn the concrete grounding behind a field, but get bogged down in minutia.
3. The post rigorous stage: Experts form correct high level abstractions informed by their understanding of the grounding, but still use the grounding when the high level abstractions break down.

I think that many experts mainly notice the 2->3 transition, but not the 1->2 one, and so often dissuade newbies by encouraging them to not work in the rigorous stage. I claim this is really, really bad, and that a solid understanding of the rigorous stage is a good idea for ~everyone doing technical work.

Here's a few examples:

• Terry Tao talks about this in math: Early students start out not knowing what a proof is and have to manipulate high-level, handwavy concepts. More advanced students learn the rigorous foundations behind various fields of math but are encouraged to focus on the formalism as opposed to 'focusing too much on what such objects actually “mean”'. Eventually, mathematicians are able to revisit their early intuitions and focus on the big picture, converting their intuitions to rigorous arguments when needed.
• The exact same thing is true in almost discipline with mathematical proofs, e.g. physics or theoretical CS.
• This happens in a very similar with programming as well.
• In many strategy games (e.g. Chess), you see crazy high level strategizing at the super low and high levels, while the middle levels are focused on improving technique.
• I'd also claim that something similar happens in psychology: freshmen undergrads come up with grand theories of human cognition that are pretty detached from reality, many intermediate researchers get bogged down in experimental results, while the good psychology researchers form high level representations and theories based on their knowledge of experiments (and are able to trivially translate intuitions into empirical claims).

Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation.

(Redwood is fully aware of this problem and there have been several efforts to fix it.)

Yeah, I think it was implicitly assumed that there existed some  such that no token ever had probability .

Thanks for the clarification!

I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad.

That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about.

Thanks!

just procrastination/lacking urgency

This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc.

Some more strategies I like for touching reality faster

I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!

I think this is a good word of caution. I'll edit in a link to this comment.

Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.

Some responses:

In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.

I think that the main value of the simulators framing was to push back against confused claims that treat (base) GPT3 and other generative models as traditional rational agents. That being said, I do think there are some reasons why the simulator framework adds value relative to "the model is doing next token prediction":

• The simulator framework incorporates specific facts about the token prediction task. We train generative models on tokens from a variety of agents, as opposed to a single unitary agent a la traditional behavior cloning. Therefore, we should expect different behaviors when the context implies that different agents are "natural". In other words,
• The simulator framework pushes back against "stochastic parrot" claims. In academia or on ML twitter (or, even more so, academic ML twitter), you often encounter claims that language models are "just" stochastic parrots -- i.e. they don't have "understanding" or "grounding". My guess is this comes from experience with earlier generations of language models, especially early n-gram/small HMM models that really do lack understanding or grounding. (This is less of a thing that happens on LW/AF.) The simulator framework provides a mechanistic model for how a sophisticated language model that does well on next token prediction task, could end up developing a complicated world model and agentic behavior.

My guess is you have a significantly more sophisticated, empirical model of LMs, such that the simulators framework feels like a simplification to your empirical knowledge + "the model is doing next token prediction". But I think the simulator framework is valuable because it incorporates additional knowledge about the LM task while pushing back against two significantly more confused framings. (Indeed, Janus makes these claims explicitly in the simulators post!)

(Paul has a post which talks about this 'what is actually the correct generalization' thing somewhere that I wanted to link, but I can't currently find it)

Are you thinking of A naive alignment strategy and optimism about generalization?

(Paul does talk about intended vs unintended generalization in a bunch of posts, so it's conceivable you're thinking about something more specific.)

## GPT-style transformers are purely myopic

I'm not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.

I do think people think variants of this, see the comments of Steering Behaviour: Testing for (Non-)Myopia in Language Models for example.

• C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXts trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases?

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by Daniel Filan.