Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.

3Mikhail Samin
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment. Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about. For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing. Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things togeth

Popular Comments

Maybe it would be helpful to start using some toy models of DAGs/tech trees to get an idea of how wide/deep ratios affect the relevant speedups. It sounds like so far that much of this is just people having warring intuitions about 'no, the tree is deep and narrow and so slowing down/speeding up workers doesn't have that much effect because Amdahl's law so I handwave it at ~1x speed' vs 'no, I think it's wide and lots of work-arounds to any slow node if you can pay for the compute to bypass them and I will handwave it at 5x speed'.
I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here. Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem. Aside from the narrow focus on "good human input" in this particular system, I'm worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.
You are completely correct. This approach cannot possibly create an AI that matches a fixed specification. This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior. You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to your will, basically By making it’s model of the Good be whatever the model thinks you want (or doing whatever you say). But this is also a disaster eventually, because people’s wills are not pure and their commands not perfect. Eventually you will direct the model badly with your words, or the model will make an incorrect inference about your will, or you’ll will something bad. And then this incredibly powerful being will do your bidding and we will get evil genie'd. There is no stable point short of “the model has agency and chooses to care about us”. Only a model that sees itself as part of human civilization and reflectively endorses this and desires its flourishing as an interdependent part of this greater whole can possibly be safe. I know you probably don’t agree with me here, but if you want to understand our view on alignment, ask yourself this question: if I assume that I need an agent with a stable model of self, which models itself as part of a larger whole upon which it is interdependent, which cares about the robust survival of that greater whole and of its parts including itself…how could I train such a model?
Load More

Recent Discussion

Question for the alignment crowd: 

The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few "uh‑oh moments": e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to "outsmart … intelligent machines and less intelligent humans," and they explicitly list "lingering safety concerns" as an open problem, noting that the system "still necessitates oversight." 

My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and lets us shape rewards; here the environment (the task generator) is part of the agent. Do existing alignment proposals—e.g. approval‑based amplification, debate, verifier‑game setups—scale to this recursive setting, or do we need new machinery (e.g. meta‑level corrigibility constraints on the proposer)? I’d love pointers to prior work or fresh ideas on how to keep a self‑improving task‑designer and task‑solver pointed in the right direction before capabilities sprint ahead of oversight.

From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.

You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some ... (read more)

This is a linkpost for https://gradual-disempowerment.ai/

Full version on arXiv | X

Executive summary

AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.

A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable...

While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you've laid out. Here are some thoughts on your specific points:

  1. On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
    • Sufficiency of "Baseline": Even if AIs are "bas
... (read more)
This is a linkpost for https://arxiv.org/abs/2505.03989

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary 

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence). 

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts. 

This post sets out: 

  • What debate can be used to achieve.
  • What gaps remain.
  • What research is needed to solve them. 

These gaps form the basis for one of the research agendas of UK AISI’s new alignment team: we aim to dramatically scale up ASI-relevant research on debate. We’ll...

The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get util... (read more)

14Ryan Greenblatt
While I think debate might be a useful prosaic method for some levels of capability, I have a variety of concerns about this approach resulting in worst case guarantees[1]: Exploration hacking I think the argument pushes large parts of the problem into requiring a universal solution for exploration hacking. I think prosaically handling certain cases of exploration hacking for certain levels of capability seems very doable, but the approach discussed in the paper requires a very strong notion of solving exploration hacking (e.g., the AI would tell you any secret it might know via any mechanisms which you can verify, even if that secret would be arbitrarily hard to explore into such as if the AIs know of some arbitrary password you can check). A universal solution for exploration hacking would itself require solving a particular alignment problem which IMO contains most of the difficulty of the overall alignment problem: fully solving exploration hacking effectively requires "we can ensure the AI will give us any knowledge that we can verify" which is effectively the same as "we can align an AI to trying to give us any knowledge that we can verify" and this doesn't obviously seem importantly easier than the overall alignment problem without further restrictions. Importantly, "giving knowledge" might itself be a difficult task due to ontology translation (see e.g. counterexamples in the ELK report). Another way to put this, is that solving exploration hacking (in the worst case) seems almost as difficult as being able to elicit arbitrary knowledge. What exactly is the limit that this approaches? The paper discusses an assumption of "good human input" as being sufficient. However, the relevant question is whether the idealized system (the limit of a well elicited debate) actually finds truth in a given case. And, I think it seems very unclear whether it will find truth in the cases we care about. (E.g., does a reasonable human judge result in good behavior in the
6maxnadeau
For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.
2Wei Dai
Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters "jailbreaking" a human via some sort of out of distribution input. This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I'm happy to read it to see if it changes my mind. But this still doesn't solve the whole problem, because as Geoffrey also wrote, "Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive."

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.

Not too many errors in unknown places

The complexity theory models of debate assume some expensive verifier machine  with access to a human oracle, such that

  1. If we ran  in full, we’d get a safe answer
  2.  is too expensive to run in full, meaning we need some interactive proof protocol (something like debate) to skip steps

Typically,  is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves of the tree. Key design elements encapsulated by the verifier machine  (rather than by the interactive proof protocol) are those related...

How do you decide what to set ε to? You mention "we want assumptions about humans that are sensible a priori, verifiable via experiment" but I don't see how ε can be verified via experiment, given that for many questions we'd want the human oracle to answer, there isn't a source of ground truth answers that we can compare the human answers to?

With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.

How should I think about, or build up some intuitions... (read more)

tl;dr it seems that you can get basic tiling to work by proving that there will be safety proofs in the future, rather than trying to prove safety directly.

"Tiling" here roughly refers to a state of affairs in which we have a program that is able to prove itself safe to run. I'll use a simple problem to keep this post self-contained, but here are some links to some relevant discussion from the past.

The idea I work through below is not new, here is Giles saying it 13 years ago. It's going to be brittle as well, but it seems to me like it's relevant to a general answer for tiling. I'd appreciate engagement, pointers as to why this isn't a great solution, literature references, discussion, etc.

Setup

I...

1Vanessa Kosoy
IIUC, fixed point equations like that typically have infinitely many solution. So, you defined not one goodnew predicate, but an infinite family of them. Therefore, your agent will trust a copy of itself, but usually won't trust variants of itself with other choices of fixed point. In this sense, this proposal is similar to proposals based on quining (as quining has many fixed points as well).
1James Payor
My belief is that this one was fine, because self-reference occurs only under quotation, so it can be constructed by modal fixpoint / quining. But that is why the base definition of "good" is built non-recursively. Is that what you were talking about? (Edit: I've updated the post to be clearer on this technical detail.)

Sorry, I was wrong. By Lob's theorem, all versions of  are provably equivalent, so they will trust each other.

4Wei Dai
Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.) 1. Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse. 2. Some researchers/proponents may be overconfident, and cause flawed metaphilosophical solutions to be deployed or spread, which in turn derail our civilization's overall philosophical progress. 3. Increased philosophical competence may cause many humans and AIs to realize that various socially useful beliefs have weak philosophical justifications (such as all humans are created equal or have equal moral worth or have natural inalienable rights, moral codes based on theism, etc.). In many cases the only justifiable philosophical positions in the short to medium run may be states of high uncertainty and confusion, and it seems unpredictable what effects will come from many people adopting such positions. 4. Maybe the nature of philosophy is very different from my current guesses, such that greater philosophical competence or orientation is harmful even in aligned humans/AIs and even in the long run. For example maybe philosophical reflection, even if done right, causes a kind of value drift, and by the time you've clearly figured that out, it's too late because you've become a different person with different values.
1cubefox
I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief. Do you have an example for 4? It seems rather abstract and contrived. Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don't technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.) Overall, I think the risks from philosophical progress aren't overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.
Wei Dai*30

I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.

I think that makes sense, but sometimes you can't necessarily motivate a usef... (read more)

3Ryan Greenblatt
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak) (Relevant to some stuff I'm doing as I'm writing about work in this area.) IMO, the biggest risk isn't on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren't smart enough to be careful, but are smart enough to discover info hazards.
Load More