All of LawrenceC's Comments + Replies


Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).

Thanks for writing this up! 

I'm curious about this:

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

What motivated people in particular? What was surprising?

1Jenny Nitishinskaya2d
I had cached impressions that AI safety people were interested in auditing, ELK, and scalable oversight. A few AIS people who volunteered to give feedback before the workshop (so biased towards people who were interested in the title) each named a unique top choice: scientific understanding (specifically threat models), model editing, and auditing (so 2/3 were unexpected for me). During the workshop, attendees (again, biased, as they self-selected into the session) expressed excitement most about auditing, unlearning, MAD, ELK, and general scientific understanding. I was surprised at the interest in MAD and ELK, I thought there would be more skepticism around those; though I can see how they might be aesthetically appealing for the slightly more academic audience.

Minor clarifying point: Act-adds cannot be cast as ablations.

Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.

Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper? 

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or... (read more)

3Alex Turner10d
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing? Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).

Glad to see that this work is out! 

I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model. 

The main funders are LTFF, SFF/Lightspeed/other S-process stuff from Jaan Tallinn, and Open Phil. LTFF is the main one that solicits independent researcher grant applications.

There's a lot of orgs, off the top of my head, there's Anthropic/OpenAI/GDM as the scaling labs with decent-sized alignment teams, and then there's a bunch of smaller/independent orgs:

  • Alignment Research Center
  • Apollo Research
  • CAIS
  • CLR
  • Conjecture
  • FAR
  • Orthogonal
  • Redwood Research

And there's always academia.

(I'm sure I'm missing a few though!)

(EDIT: added in RR and CLR)

Redwood Research?

I think this has gotten both worse and better in several ways.

It's gotten better in that ARC and Redwood (and to a lesser extent, Anthropic and OpenAI) have put out significantly more of their research. FAR Labs also exists is also doing some of the research proliferation that would've gone on inside of Constellation. 

It's worse in that there's been some amount of deliberate effort to build more of an AIS community in Constellation, e.g. with explicit Alignment Days where people are encouraged to present work-in-progress and additional fellowships and workshops. 

On net I think it's gotten better, mainly because there's just been a lot more content put out in 2023 (per unit research) than in 2022. 

I suspect the underfitting explanation is probably a lot of what's going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)? 

1Thomas Kwa1mo
Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal. I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.

Thanks for posting this, this seems very correct. 

I don't think so, unfortunately, and it's been so long that I don't think I can find the code, let alone get it running. 

I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!

Great work, glad to see it out!

  • Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all? 
    • Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
    • Why can I just add 10 times the top-right vector and still get roughly reasonable behavior? 
    • And the top-right vector also transfers across mazes? Why isn't it maze-specific? 
      • To make up
... (read more)
2Jacques Thibodeau6mo
Indeed! When I looked into model editing stuff with the end goal of “retargeting the search”, the finickiness and break down of internal computations was the thing that eventually updated me away from continuing to pursue this. I haven’t read these maze posts in detail yet, but the fact that the internal computations don’t ruin the network is surprising and makes me think about spending time again in this direction. I’d like to eventually think of similar experiments to run with language models. You could have a language model learn how to solve a text adventure game, and try to edit the model in similar ways as these posts, for example. Edit: just realized that the next post might be with GPT-2. Exciting!
1Robert Kirk6mo
I think the hyperlink for "conv nets without residual streams" is wrong? It's for me


(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF. 

2Oliver Habryka7mo
Yep, I think it's pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?  I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT. 

Quick clarifications:

  • For challenge 1, was the MNIST CNN trained on all 60,000 examples in the MNIST train/validation sets?
  • For both challenges, do the models achieve perfect train accuracy? When did you train them until?
  • What sort of interp tools are allowed? Can I use pre-mech interp tools like saliency maps? 

Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!

1Stephen Casper7mo
The MNIST CNN was trained only on the 50k training examples.  I did not guarantee that the models had perfect train accuracy. I don't believe they did.  I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.  At the end of the day, I (and possibly Neel) will have the final say in things.  Thanks :)

I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:

Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.

The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions abo... (read more)

The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)

I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does... (read more)

Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting. 

We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.

Hopefully this will be fixed with the forthcoming arXiv paper!

2Xuan (Tan Zhi Xuan)8mo
Great to know, and good to hear!

At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

Don't think there have been public writeups, but here's two relevant manifold markets:


In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work. 

Fwiw this does not seem to be in the Dan Hendrycks post you linked!

1Stephen Casper8mo
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

Google’s event where they’re presumably unveiling their response will happen Feb 8th at 2:30 PM CET/5:30 AM PT:

That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.

The negative result tells us that the strong form of the claim "regularization = navigability" is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You'll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.

Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:

Similarly, Figure 7 from also makes this point, but less str... (read more)

2Lawrence Chan8mo
That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.

As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking. 

(Source: Figure 28 from 

Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?

Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: 

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer t... (read more)

2Lawrence Chan8mo
As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking.  (Source: Figure 28 from  Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation. 

(Redwood is fully aware of this problem and there have been several efforts to fix it.) 

Thanks for the clarification! I'll have to think more about this. 

Yeah, I think it was implicitly assumed that there existed some  such that no token ever had probability .

1Jan Hendrik Kirchner7mo
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it's reasonable to assume that such an ε always exists in practice (if we also assume that the probability gets rounded to something < 1).

Thanks for the clarification!

I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad. 

That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about. 

1Kaj Sotala9mo
No worries! Yeah, I did drift towards more generic terms like "subsystems" or "parts" later in the series for this reason, and might have changed the name of the sequence if only I'd managed to think of something better. (Terms like "subagents" and "multi-agent models of mind" still gesture away from rational agent models in a way that more generic terms like "subsystems" don't.)


just procrastination/lacking urgency

This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc. 

Some more strategies I like for touching reality faster

I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!

4Neel Nanda9mo
Importantly, the bar for "good person to explain ideas to" is much lower than the bar for "is a good collaborator". Finding good collaborators is hard!

I think this is a good word of caution. I'll edit in a link to this comment.

Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.


Some responses:

In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world. 

I think that the main value of the simula... (read more)

  • C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXts trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases? 

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).

The general pattern is the following (terminology borrowed from Terry Tao):

  1. The pre-rigorous stage: Really new people don't know how ~anything works in a field, and so use high-level abstractions that aren't ne
... (read more)
1Neel Nanda9mo
I like the analogy! I hadn't explicitly made the connection, but strongly agree (both that this is an important general phenomena, and that it specifically applies here). Though I'm pretty unsure how much I/other MI researchers are in 1 vs 3 when we try to reason about systems! To be clear, I definitely do not want to suggest that people don't try to rigorously reverse engineer systems a bunch, and be super details oriented. Linked to your comment in the post.

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by ... (read more)

1Neel Nanda9mo
Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)

I've expanded the TL;DR at the top to include the nine theses. Thanks for the suggestion!

Thanks Nate!

I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact? 

Ditto the tweet thread, assuming I don't plan on tweeting this.

4Nate Soares9mo
It would still help like me to have a "short version" section at the top :-)

See also Superexponential Concept Space, and Simple Words, from the Sequences:

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power.  To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out.  You'd have to see every possible example, in fact.


From this perspective, learning doesn't just rely on inductive bias, it is nea

... (read more)

wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading.

I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where... (read more)

I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed". 

I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi... (read more)

Thanks for the clarification. I've edited in a link to this comment. 

Right, that's a decent objection.

I have three responses:

  • I think that the specific claim that I'm responding to doesn't depend on the AI designers choosing RL algorithms that don't train until convergence. If the specific claim was closer to "yes, RL algorithms if ran until convergence have a good chance of 'caring' about reward, but in practice we'll never run RL algorithms that way", then I think this would be a much stronger objection. 
  • How are the programmers evaluating the trained RL agents in order to decide how to tune their RL runs? For example,
... (read more)
2Alex Turner9mo
This is part of the reasoning (and I endorse Steve's sibling comment, while disagreeing with his original one). I guess I was baking "convergence doesn't happen in practice" into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing function (aka the "reward" function). From Reward is not the optimization target:  Perhaps my claim was even stronger, that "we can't really run real-world AGI-training RL algorithms 'until convergence', in the sense of 'surmounting exploration issues'." Not just that we won't, but that it often doesn't make sense to consider that "limit."  Also, to draw conclusions about e.g. running RL for infinite data and infinite time so as to achieve "convergence"... Limits have to be well-defined, with arguments for why the limits are reasonable abstraction, with care taken with the order in which we take limits.  * What about the part where the sun dies in finite time? To avoid that, are we assuming a fixed data distribution (e.g. over observation-action-observation-reward tuples in a maze-solving environment, with more tuples drawn from embodied navigation) * What is our sampling distribution of data, since "infinite data" admits many kinds of relative proportions?  * Is the agent allowed to stop or modify the learning process?   * (Even finite-time learning theory results don't apply if the optimized network can reach into the learning process and set its learning rate to zero, thereby breaking the assumptions of the theorems.) * Limit to infinite data and then limit to infinite time, or vice versa, or both at once?  I disagree. Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar [EDIT: seems in many ways extremely dissimilar] to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward. Where is the rein

There’s no such thing as convergence in the real world. It’s essentially infinitely complicated. There are always new things to discover.

I would ask “how is it that I don’t want to take cocaine right now”? Well, if I took cocaine, I would get addicted. And I know that. And I don’t want to get addicted. So I have been deliberately avoiding cocaine for my whole life. By the same token, maybe we can raise our baby AGIs in a wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can delibe... (read more)

I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.

That being said, I'm still confused about the details. 

Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me swi... (read more)

Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:

A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system

And the following definition of a mesa-optimizer:

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) fi

... (read more)

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

  1. Programs that implement the computation of taking the derivative.  (, or perhaps .)
  2. Programs that implement some particular function g, which happens to be the derivative of some other function.  (, where it so happens that  for some .)

The transformers in this paper are programs of the 2nd type.  They don't contain any l... (read more)

I really do empathize with the authors, since writing an abstract fundamentally requires trading off faithfulness to the paper content and the length and readability of the abstract. But I do agree that they could've been more precise without a significant increase in length.

Nitpick: I think instead of expanding on the sentence 

As a result we are able to train a more harmless and less evasive AI assistant than previous attempts that engages with harmful queries by more often explaining its objections to them than avoiding answering

My proposed rewrite ... (read more)

Load More