All of LawrenceC's Comments + Replies


(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF. 

2Oliver Habryka1mo
Yep, I think it's pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?  I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT. 

Quick clarifications:

  • For challenge 1, was the MNIST CNN trained on all 60,000 examples in the MNIST train/validation sets?
  • For both challenges, do the models achieve perfect train accuracy? When did you train them until?
  • What sort of interp tools are allowed? Can I use pre-mech interp tools like saliency maps? 

Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!

1Stephen Casper1mo
The MNIST CNN was trained only on the 50k training examples.  I did not guarantee that the models had perfect train accuracy. I don't believe they did.  I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.  At the end of the day, I (and possibly Neel) will have the final say in things.  Thanks :)

I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:

Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.

The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions abo... (read more)

The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)

I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does... (read more)

Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting. 

We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.

Hopefully this will be fixed with the forthcoming arXiv paper!

2Xuan (Tan Zhi Xuan)1mo
Great to know, and good to hear!

At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

Don't think there have been public writeups, but here's two relevant manifold markets:


In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work. 

Fwiw this does not seem to be in the Dan Hendrycks post you linked!

1Stephen Casper1mo
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

Google’s event where they’re presumably unveiling their response will happen Feb 8th at 2:30 PM CET/5:30 AM PT:

That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.

The negative result tells us that the strong form of the claim "regularization = navigability" is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You'll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.

Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:

Similarly, Figure 7 from also makes this point, but less str... (read more)

2Lawrence Chan2mo
That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.

As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking. 

(Source: Figure 28 from 

Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?

Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: 

Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer t... (read more)

2Lawrence Chan2mo
As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking.  (Source: Figure 28 from []  Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!

Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation. 

(Redwood is fully aware of this problem and there have been several efforts to fix it.) 

Thanks for the clarification! I'll have to think more about this. 

Yeah, I think it was implicitly assumed that there existed some  such that no token ever had probability .

1Jan Hendrik Kirchner24d
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it's reasonable to assume that such an ε always exists in practice (if we also assume that the probability gets rounded to something < 1).

Thanks for the clarification!

I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad. 

That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about. 

1Kaj Sotala3mo
No worries! Yeah, I did drift towards more generic terms like "subsystems" or "parts" later in the series for this reason, and might have changed the name of the sequence if only I'd managed to think of something better. (Terms like "subagents" and "multi-agent models of mind" still gesture away from rational agent models in a way that more generic terms like "subsystems" don't.)


just procrastination/lacking urgency

This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc. 

Some more strategies I like for touching reality faster

I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!

4Neel Nanda3mo
Importantly, the bar for "good person to explain ideas to" is much lower than the bar for "is a good collaborator". Finding good collaborators is hard!

I think this is a good word of caution. I'll edit in a link to this comment.

Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.


Some responses:

In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world. 

I think that the main value of the simula... (read more)

  • C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI?  Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXts trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases? 

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).

The general pattern is the following (terminology borrowed from Terry Tao):

  1. The pre-rigorous stage: Really new people don't know how ~anything works in a field, and so use high-level abstractions that aren't ne
... (read more)
1Neel Nanda3mo
I like the analogy! I hadn't explicitly made the connection, but strongly agree (both that this is an important general phenomena, and that it specifically applies here). Though I'm pretty unsure how much I/other MI researchers are in 1 vs 3 when we try to reason about systems! To be clear, I definitely do not want to suggest that people don't try to rigorously reverse engineer systems a bunch, and be super details oriented. Linked to your comment in the post.

Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME

In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Or see this post by ... (read more)

1Neel Nanda3mo
Thanks! That's a great explanation, I've integrated some of this wording into my MI explainer (hope that's fine!)

I've expanded the TL;DR at the top to include the nine theses. Thanks for the suggestion!

Thanks Nate!

I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact? 

Ditto the tweet thread, assuming I don't plan on tweeting this.

3Nate Soares3mo
It would still help like me to have a "short version" section at the top :-)

See also Superexponential Concept Space, and Simple Words, from the Sequences:

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power.  To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out.  You'd have to see every possible example, in fact.


From this perspective, learning doesn't just rely on inductive bias, it is nea

... (read more)

wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading.

I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where... (read more)

I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed". 

I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi... (read more)

Thanks for the clarification. I've edited in a link to this comment. 

Right, that's a decent objection.

I have three responses:

  • I think that the specific claim that I'm responding to doesn't depend on the AI designers choosing RL algorithms that don't train until convergence. If the specific claim was closer to "yes, RL algorithms if ran until convergence have a good chance of 'caring' about reward, but in practice we'll never run RL algorithms that way", then I think this would be a much stronger objection. 
  • How are the programmers evaluating the trained RL agents in order to decide how to tune their RL runs? For example,
... (read more)
2Alex Turner3mo
This is part of the reasoning (and I endorse Steve's sibling comment, while disagreeing with his original one). I guess I was baking "convergence doesn't happen in practice" into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing function (aka the "reward" function). From Reward is not the optimization target []:  Perhaps my claim was even stronger, that "we can't really run real-world AGI-training RL algorithms 'until convergence', in the sense of 'surmounting exploration issues'." Not just that we won't, but that it often doesn't make sense to consider that "limit."  Also, to draw conclusions about e.g. running RL for infinite data and infinite time so as to achieve "convergence"... Limits have to be well-defined, with arguments for why the limits are reasonable abstraction, with care taken with the order in which we take limits.  * What about the part where the sun dies in finite time? To avoid that, are we assuming a fixed data distribution (e.g. over observation-action-observation-reward tuples in a maze-solving environment, with more tuples drawn from embodied navigation) * What is our sampling distribution of data, since "infinite data" admits many kinds of relative proportions?  * Is the agent allowed to stop or modify the learning process?   * (Even finite-time learning theory results don't apply if the optimized network can reach into the learning process and set its learning rate to zero, thereby breaking the assumptions of the theorems.) * Limit to infinite data and then limit to infinite time, or vice versa, or both at once?  I disagree. Early stopping on a separate stopping criterion which we don't run gradients through, is not at all similar [EDIT: seems in many ways extremely dissimilar
7Steve Byrnes3mo
There’s no such thing as convergence in the real world. It’s essentially infinitely complicated. There are always new things to discover. I would ask “how is it that I don’t want to take cocaine right now”? Well, if I took cocaine, I would get addicted. And I know that. And I don’t want to get addicted. So I have been deliberately avoiding cocaine for my whole life. By the same token, maybe we can raise our baby AGIs in a wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading. We can call this “incomplete exploration”, but we’re taking advantage of the fact that the AGI itself can foresightedly and self-aware-ly ensure that the exploration remains incomplete. I feel like you’re arguing that what I’m saying could potentially fail, and I’m arguing that what I’m saying could potentially succeed. In which case, maybe we can both agree that it’s a potential but not inevitable failure mode that we should absolutely keep thinking about.

I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.

That being said, I'm still confused about the details. 

Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me swi... (read more)

Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:

A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system

And the following definition of a mesa-optimizer:

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) fi

... (read more)

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

  1. Programs that implement the computation of taking the derivative.  (, or perhaps .)
  2. Programs that implement some particular function g, which happens to be the derivative of some other function.  (, where it so happens that  for some .)

The transformers in this paper are programs of the 2nd type.  They don't contain any l... (read more)

I really do empathize with the authors, since writing an abstract fundamentally requires trading off faithfulness to the paper content and the length and readability of the abstract. But I do agree that they could've been more precise without a significant increase in length.

Nitpick: I think instead of expanding on the sentence 

As a result we are able to train a more harmless and less evasive AI assistant than previous attempts that engages with harmful queries by more often explaining its objections to them than avoiding answering

My proposed rewrite ... (read more)

You're welcome, and I'm glad you think the writeup is good. 

Thank you for the good work.

I think your claim is something like:

Without some form of regularization, some forms of RL can lead to trajectories that have zero probability wrt the base distribution (e.g. because they break a correlation that occurs on the pretraining distribution with 100% accuracy). However, sampling cannot lead to trajectories with zero probability?

As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sa... (read more)

2Erik Jenner3mo
No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward. Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the correlation). Or I could change the parameters of the distribution such that a single sample will typically have an X-value as high as the 99.9th percentile of the old distribution. In that case, the Y-value I typically get will depend a lot on how I changed the parameters. E.g. if I just shifted the X-component of the mean and nothing else, I won't get higher Y-values at all. I'm pretty unsure what kinds of parameter changes RLHF actually induces, I'm just saying that parameter updates can destroy correlations in a way that conditioning doesn't. This is with the same amount of selection pressure on the proxy in both cases.

Can you explain why RLHF is worse from a Causal Goodhart perspective?

1Erik Jenner3mo
As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused). Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation. I certainly don't think there will literally be cleanly separable circuits for truth and high rating, but I think the general idea plausibly applies. I don't see how anything comparable happens with filtering.

I'm surprised no one has brought up the quantilizer results, specifically the quantilizer optimality theorem from Taylor 2015:

Theorem 1 (Quantilizer optimality). Choose q=1/t. Then, a q-quantilizer maximizes expected U-utility subject to constraint 2. 

where constraint 2 is that you don't do more than t worse in expectation on any possible cost function, relative to the original distribution of actions. That is, quantilizers (which are in turn approximated by BoN), are the optimal solution to a particular robust RL problem. 

However, it turns out t... (read more)

Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn't generally induce more causal Goodhart problems than best-of-N selection does.

This seems correct insofar as your proxy reward does not have huge upward errors (that you don't remove via some sort of clipping). For example, if there's 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a 1/10,000 chance of sampling the adversarial sentence, while it's very tricky (i... (read more)

It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering.

It's also true that maximizing Reward - KL is Bayesian updating as the linked post shows, and it's true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a ... (read more)

This doesn't seem to be what Gao et al found: Figure 9 shows that the KL between RL and initial policy, at a given proxy reward score, still is significantly larger than the equivalent KL for a BoN-policy, as shown in Figure 1.

I agree with the general point, but I'll note that at equal proxy reward model scores, the RL policy has significantly more KL divergence with the base policy. 

0davidad (David A. Dalrymple)3mo
That’s not the case when using a global KL penalty—as (I believe) OpenAI does in practice, and as Buck appeals to in this other comment []. In the paper linked here a global KL penalty is only applied in section 3.6, because they observe a strictly larger gap between proxy and gold reward when doing so.

I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM.

Wouldn't surprise me if this was true, but I agree with you that it's possible the ship has already sailed on LLMs. I think this is more so the case if you have a novel insight about what paths are more promising to AGI (similar to the scaling hypothesis in 2018)---getting ~everyone to adopt that insight would significantly advance timelines, though I'd argue that publishing it (such that only the labs explicitly aimi... (read more)

Publishing capabilities work is notably worse than just doing the work.

  • I'd argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.

What's the mechanism you're thinking of, through which hype does damage?

I also doubt that good capabilities work will be published "without fanfare", given how watched this space is. 

My read is that fairly little current alignment work really feels "serial" to me. Assuming that you're mostly referring to conceptual alignment work, my read is that a lot of it is fairly confus

... (read more)
4Neel Nanda3mo
This ship may have sailed at this point, but to me the main mechanism is getting other actors to pay attention, focus on the most effective kind of capabilities work, and making it more politically feasible to raise support. Eg, I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM. Legibly making a ton of money with it falls in a similar category to me. Gopher is a good example of not really seeing much fanfare, I think? (Though I don't spend much time on ML Twitter, so maybe there was loads lol) Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this. I agree that risks from learned optimisation is important and didn't need this, and plausibly feels like a good example of serial work to me.

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI.
But, while this doesn't change the fact that we view OpenAI's effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

I wanted to give this a big +1. I think OpenAI is doing better than literally every single other major AI research org except probably Anthropic and Deepmind on trying to solve the AI-n... (read more)

Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.

I'm not sure I agree that this is unfair.

OpenAI is clearly on the cutting edge of AI research.

This is obviously a good reason to focus on them more.

OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.

Perhaps we have responsibility to scrutinize/criticize them more because of this... (read more)

LLM prompt engineering can replace weaker ML models

Epistemic status: Half speculation, half solid advice. I'm writing this up as I've said this a bunch IRL. 

Current large language models (LLMs) are sufficiently good at in-context learning that for many NLP tasks, it's often better and cheaper to just query an LM with the appropriate prompt, than to train your own ML model. A lot of this comes from my personal experience (i.e. replacing existing "SoTA" models in other fields with prompted LMs, and getting better performance), but there's also examples ... (read more)

Also, a cheeky way to say this:

What Grokking Feels Like From the Inside

What does grokking_NN feel like from the inside? It feels like grokking_Human a concept! :)

Load More