Wiki Contributions


Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying 'bricks fall because it's in a brick's nature to move towards the ground': both are repackaging an observation as an explanation.

The idea is that the framing 'learning at different speeds' lets you frame grokking and double descent as the same thing. More like generalizing 'bricks move towards the ground' and 'rocks move towards the ground' to 'objects move towards the ground'. I don't think we make any grand claims about explaining everything in the paper, but I'll have a look and see if there's edits I should make - thanks for raising these points.

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
  • Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)


If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).

Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
  • Goal: learn more about AGI by studying the systems you've built.
  • To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
  • This is operationally hard and needs lots of buy-in
  • Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.

Prong 2: scary demos and and convincing people that AGI is dangerous

  • Goal 1: shut it all down, or failing that slow down capabilities research.
  • Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
  • In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
    • Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
  • Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.

Prong 3: alignment research aka “understanding minds”

  • Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
  • Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
  • Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other oversight methods break)
    1. doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
  9. Capable enough to help us exit the acute risk period

Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.

Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of 'full alignment' (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:

  • Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
  • Too little capability (or too little 'coherence') and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.

I like that mini-game! Thanks for the reference

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

FWIW I would love to see the result of you two actually playing a few rounds of this game.

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.

Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

Load More