Chris Olah recently released a tweet thread describing how the Anthropic team thinks about AI alignment difficulty. On this view, there is a spectrum of possible scenarios ranging from ‘alignment is very easy’ to ‘alignment is impossible’, and we can frame AI alignment research as a process of increasing the probability of beneficial outcomes by progressively addressing these scenarios. I think this framing is really useful, and here I have expanded on it by providing a more detailed scale of AI alignment difficulty and explaining some considerations that arise from it.
The discourse around AI safety is dominated by detailed conceptions of potential AI systems and their failure modes, along with ways to ensure their safety. This article by the DeepMind safety team provides an overview of some of these failure modes. I believe that we can understand these various threat models through the lens of "alignment difficulty" - with varying sources of AI misalignment sorted from easy to address to very hard to address, and attempt to match up technical AI safety interventions with specific alignment failure mode scenarios. Making this uncertainty clearer makes some debates between alignment researchers easier to understand.
An easier scenario could involve AI models generalising and learning goals in ways that fit with common sense. For example, it could be the case that LLMs of any level of complexity are best understood as generative frameworks over potential writers, with Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI (CAI) selecting only among potential writers. This is sometimes called ‘alignment by default’.
A hard scenario could look like that outlined in ‘Deep Deceptiveness’, where systems rapidly and unpredictably generalise in ways that quickly obsolete previous alignment techniques, and they also learn deceptive reward-hacking strategies that look superficially identical to good behaviour according to external evaluations, red-teaming, adversarial testing or interpretability examinations.
When addressing the spectrum of alignment difficulty, we should examine each segment separately.
If we assume that transformative AI will be produced, then the misuse risk associated with aligned transformative AI does not depend on how difficult alignment is. Therefore, misuse risk is a relatively bigger problem the easier AI alignment is.
Easy scenarios should therefore mean more resources should be allocated to issues like structural risk, economic implications, misuse, and geopolitical problems. On the ‘harder’ end of easy, where RLHF-trained systems typically end up honestly and accurately pursuing oversimplified proxies for what we want, like ‘improve reported life satisfaction’, or ‘raise the stock price of company X’, we also have to worry about scenarios like Production Web or What Failure looks like 1 which require a mix of technical and governance interventions to address.
Intermediate scenarios are cases where behavioural safety isn’t good enough and the easiest ways to produce Transformative AI result in dangerous deceptive misalignment. This is when systems work against our interests but pretend to be useful and safe. This scenario requires us to push harder on alignment work and explore promising strategies like scalable oversight, AI assistance on alignment research and interpretability-based oversight processes. We should also focus on governance interventions to ensure the leading projects have the time they need to actually implement these solutions and then use them (in conjunction with governments and civil society) to change the overall strategic landscape and eliminate the risk of misaligned AI.
In contrast, if alignment is as hard as pessimistic scenarios suggest, intense research effort in the coming years to decades may not be enough to make us confident that Transformative AI can be developed safely. Alignment being very hard would call for robust testing and interpretability techniques to be applied to frontier systems. This would reduce uncertainty, demonstrate the truth of the pessimistic scenario, and build the motivation to stop progress towards Transformative AI altogether.
If we knew confidently whether we were in an optimistic or pessimistic scenario, our available options would be far simpler. Strategies that are strongly beneficial and even necessary to ensure successful alignment in an easy scenario are dangerous and harmful and in the extreme will cause an existential catastrophe in a hard scenario. This makes figuring out what to do much more difficult.
Here I will present a scale of increasing AI alignment difficulty, with ten levels corresponding to what techniques are sufficient to ensure sufficiently powerful (i.e. Transformative) AI systems are aligned. I will later get a bit more precise about what this scale is measuring.
At each level, I will describe what techniques are sufficient for the alignment of TAI, and then describe roughly what that scenario entails about how TAI behaves, and then list the main failure modes at that level.
Misuse and/or recklessness with training objectives.
RL of powerful models towards badly specified or antisocial objectives is still possible, including accidentally through poor oversight, recklessness or structural factors.
We need to ensure that the AI behaves well even in edge cases by guiding it more carefully using human feedback in a wide range of situations, not just crude instructions or hand-specified reward functions.
When done diligently, RLHF fine tuning works. One reason to think alignment will be this easy is if systems are naturally inductively biased towards honesty and representing the goals humans give them. In that case, they will tend to learn simple honest and obedient strategies even if these are not the optimal policy to maximise reward.
Even if human feedback is sufficient to ensure models roughly do what their overseer intends, systems widely deployed in the economy may still for structural reasons end up being trained to pursue crude and antisocial proxies that don’t capture what we really want.
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
We need methods to ensure that human-like oversight of AIs continues even for problems unaided humans can’t supervise.
Therefore, we need methods which, unlike Constitutional AI, get AIs to apply humanlike supervision more effectively than humans can. Some strategies along these lines are discussed here.
There are many sub-approaches here, outlined for example in the ELK report.
Human feedback is an insufficiently clear signal to align superhuman systems and so must be augmented.
Deceptive Human-Simulators arise by default in situationally aware AIs, but this tendency can be eliminated with superhuman behavioural feedback.
At this stage, we are entrusting the AIs aligned using techniques like those in 1-4 to perform research on better methods of oversight and to augment human understanding. We are then using those research outputs to improve our oversight processes or improve the overseer AI’s understanding of the behaviour of the AI in training.
There are many potential approaches here, including techniques like IDA and debate, which are discussed here.
Conceptual or Mechanistic Interpretability tools are used as part of the (AI augmented) oversight process.
Processes internal to the AIs that seem to correlate with deceptiveness can be detected and penalised by the AI or Human+AI overseers developed in 4 and 5.
The ELK report discusses some particular approaches to this, such as penalising correlates of deceptive thinking (like excessive computation time spent on simple questions).
Deceptive Human-Simulators arise by default in situationally aware AIs, even given perfect behavioural feedback.
At this level, even when we use the techniques in 2-6, AI systems routinely defeat oversight and continue unwanted behaviour. They do this by altering their internal processes to avoid detection by interpretability tools, and by ‘playing the training game’ to seem behaviourally safe. Crucially, though, it is still possible to contain these systems.
Therefore we can conduct (potentially dangerous) experiments with these AI systems to understand how they might generalise post-deployment. Here we would employ the interpretability and oversight tools in 4,5 and 6 and attempt to elicit misgeneralization and reward hacking behaviour from AIs. But we wouldn’t try to remove those behaviours using oversight-based training.
Instead, using techniques like red-teaming we try to understand the basis on which they occur, learning practical tricks and new theoretical insights about how and why misalignment arises, allowing us to develop new mitigations.
Sharp Left Turn
Past this point, we assume, following Nate Soares, that when systems become capable enough there is a sharp left turn. This is where AIs suddenly become much more strategically aware and able to consider a very wide range of plans, such that they quickly become able to avoid any attempts to align their behaviour.
All of the problematic behaviours either emerge too quickly to respond, or emerge only when systems are deployed, so experiments like those in 7 are useless or cannot be performed safely.
For this capability level and following Soares’ view, we need to have an understanding in advance of how AI models generalise their learning. Hopefully, these foundational insights will inspire new techniques for alignment.
There is a ‘sharp left turn’ where a system suddenly gains new concepts and much greater intelligence and generality, obsoleting previous alignment techniques.
Systems post-SLT are superintelligent and cannot be experimented on safely or contained.
Chris Olah’s original thread was quite informal, but I thought it would be useful to flesh this model out in more detail. His original 'how difficult is alignment' scale described various technical or mathematical projects, including building a working steam engine or solving P vs. NP, which represent an informal scale measuring, roughly, ‘how hard the alignment problem is overall’. But we can be more precise about what this scale measures. And we can place existing approaches to alignment along this alignment difficulty scale.
So, to be more precise about what the scale measures, I define alignment difficulty as the degree of complexity and effort required to successfully guide an AI system's behaviour, objectives, and decisions to conform with human values and expectations, well enough to effectively mitigate the risks posed by unaligned, potentially dangerous AI systems.
Scaling techniques indefinitely
One perspective on 'well enough' is that a technique should scale indefinitely, implying that it will continue to be effective regardless of the system's intelligence. For example, if RLHF always works on arbitrarily powerful systems, then RLHF is sufficient. This supersedes other criteria: if a technique always works, it will also work for a system powerful enough to help mitigate any risk.
Techniques which produce positively transformative AI
An alternative perspective is that a technique works ‘well enough' if it can robustly align AI systems that are powerful enough to be deployed safely (maybe in a research lab, maybe in the world as a whole), and that these AI systems are transformative in ways that reduce the overall risk from unaligned AI. Call such systems ‘positively transformative AI’.
Positively transformative AI systems could reduce the overall risk from AI by: preventing the construction of a more dangerous AI; changing something about how global governance works; instituting surveillance or oversight mechanisms widely; rapidly and safely performing alignment research or other kinds of technical research; greatly improving cyberdefense; persuasively exposing misaligned behaviour in other AIs and demonstrating alignment solutions, and through many other actions that incrementally reduce risk.
One common way of imagining this process is that an aligned AI could perform a ‘pivotal act’ that solves AI existential safety in one swift stroke. However, it is important to consider this much wider range of ways in which one or several transformative AI systems could reduce the total risk from unaligned transformative AI.
As well as disagreeing about how hard it is to align an AI of a certain capability, people disagree on how capable an aligned AI must be to be positively transformative in some way. For example, Jan Leike et al. think that there is no known indefinitely scalable solution to the alignment problem, but that we won’t need an indefinitely scalable alignment technique, just one that is good enough to align a system that can automate alignment research.
Therefore, the scale should be seen as covering increasingly sophisticated techniques which work to align more and more powerful and/or adversarial systems.
There’s a common debate in the alignment community about what counts as ‘real alignment research’ vs just advancing capabilities, see e.g. here for one example of the debate around RLHF, but similar debates exist around more sophisticated methods like interpretability-based oversight.
This scale helps us understand where this debate comes from.
Any technique on this scale which is insufficient to solve TAI alignment will (if it succeeds) make a system appear to be ‘aligned’ in the short term. Many techniques, such as RLHF, also make a system more practically useful and therefore more commercially viable and more capable. Spending time on these alignment techniques also trades off with other kinds of alignment work.
Given our uncertainty about what techniques are and aren’t sufficient, the boundary between what counts as ‘just capabilities research’ vs ‘alignment research’ isn’t exact. In other words, before the (unknown) point at which a given technique becomes sufficient, advancing less effective alignment techniques is at best diverting resources away from useful efforts, but more likely just advancing AI capabilities and helping to hide problems. In other words, advancing alignment techniques which are insufficient for solving the problem at best takes resources away from more useful work and at worst advances capabilities whilst concealing misaligned behaviour.
However, crucially, since we don’t know where on the scale the difficulty lies, we don’t know whether working on a given technique is counterproductive or not. Additionally, as we go further along the scale it becomes harder and harder to make progress. RLHF is already widely used and applied to cutting-edge systems today, with constitutional AI not far away (2-3), whereas the hopes for coming up with a new AI paradigm that’s fundamentally safer than deep learning (9) seem pretty thin.
Therefore, we get the phenomenon of, "everyone more optimistic about alignment than me is just working on capabilities."
I think that acknowledging our uncertainty about alignment difficulty takes some of the force out of arguments that e.g., constitutional AI research is net-negative. This kind of research is beneficial in some worlds, even though it creates negative externalities and could make AIs more dangerous in worlds where alignment is harder.
Chris Olah’s original point was that given this uncertainty we should aim to push the frontier further than it is already, so the optimal strategy is to promote any method which is past the ‘present margin of safety research’, i.e. would counterfactually not get worked on and works to reduce the risk in some scenarios.
In summary, I have:
I’d be interested to know what people think of my attempted ordering of alignment techniques by increasing sophistication and matching them up with the failure modes they’re meant to address. I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
Oversight which employs interpretability tools would catch failure modes that oversight without interpretability tools wouldn't, i.e. failure modes where, before deployment, a system seems safe according to superhuman behavioural oversight, but is actually being deceptive and power seeking.
Systems that aren't being deceptive in a strategically aware way might be misaligned in subtler ways that are still very difficult to deal with for strategic or governance reasons (e.g. strong competitive economic pressures to build systems that pursue legible goals), so you might object to calling this an 'easy' problem just because in this scenario RLHF doesn't lead to deceptive strategically aware agents. However, this is a scale of the difficulty of technical AI alignment, not how difficult AI existential safety is overall.
Ajeya Cotra’s report on AI takeover lists this as a potential solution: ‘We could provide training to human evaluators to make them less susceptible to manipulative and dishonest tactics, and instruct them to give reward primarily or entirely based on whether the model followed honest procedures rather than whether it got good results. We could try to use models themselves to help explain what kinds of tactics are more and less likely to be manipulative' - as an example of something that might do better than behavioral safety.
Therefore, I do not take her report to be arguing that external behavioural evaluation will fail to eliminate deception. Instead, I interpret her as arguing that any human-level external behavioural evaluation won’t work but that superior oversight processes can work to eliminate deception. However, even superhuman behavioural oversight is still a kind of 'behavioral safety'.
It is essential to distinguish the sharp left turn from very fast AI progress due to e.g., AIs contributing a continuously but rapidly increasing fraction of work to AI research efforts. While fast progress could pose substantial governance problems, it wouldn’t mean that oversight-based approaches to alignment fail.
If progress is fast enough but continuous, then we might have a takeoff that looks discontinuous and almost identical to the sharp left turn from the perspective of the wider world outside AI labs. However, within the labs it would be quite different, because running oversight strategies where more and more powerful AIs oversee each other would still be feasible.
Take any person’s view of how difficult alignment is, accounting for their uncertainty over this question. It could be possible to model the expected benefit of putting resources into a given alignment project, knowing that it could help to solve the problem, but it could also make the problem worse if it merely produces systems which appear safe. Additionally, this modelling has to account for how putting resources into this alignment solution takes away resources from other solutions (the negative externality). Would it be productive to try to model the optimal allocation of resources in this way, and what would the result of this modelling be?
I would order these differently.
Within the first section (prompting/RLHF/Constitutional):
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness. Among the three options, "Constitutional" AI applies the most optimization pressure toward deceiving humans (IIUC), RLHF the next most, whereas prompting alone provides zero direct selection pressure for deception; it is by far the safest option of the three. (Worlds Where Iterative Design Fails talks more broadly about the views behind this.)
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place. Just test the thing and see if it misbehaves.
The oversight stuff would be the next three hardest worlds (5th-7th). As written I think they're correctly ordered, though I'd flag that "AI research assistance" as a standalone seems far safer than using AI for oversight. The last three seem correctly-ordered to me.
I'd also add that all of these seem very laser-focused on intentional deception as the failure mode, which is a reasonable choice for limiting scope, but sure does leave out an awful lot.
deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness
The phenomenon that a 'better' technique is actually worse than a 'worse' technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn't unique to RLHF.
But I tend to agree with Richard's view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn't call non-strategically aware pursuit of inaccurate proxies for what we want 'deception', because in this scenario the system isn't being intentionally deceptive.
In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it's grabbing a ball but isn't just disappear if you're diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that's more or less the situation we're in right now with current AIs!
If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the 'you get what you measure' problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place.
I'm assuming you meant fourth-easiest here not fourth hardest. It's important to note that I'm not here talking about testing systems to see if they misbehave in a sandbox and then if they don't assuming you've solved the problem and deploying. Rather, I'm talking about doing science with models that exhibit misaligned power seeking, with the idea being that we learn general rules about e.g. how specific architectures generalize, why certain phenomena arise etc. that are theoretically sound and we expect to hold true even post deployment with much more powerful systems. Incidentally this seems quite similar to what the OpenAI superalignment team is apparently planning.
So it's basically, "can we build a science of alignment through a mix of experimentation and theory". So if e.g. we study in a lab setting a model that's been fooled into thinking it's been deployed, then commits a treacherous turn, enough times we can figure out the underlying cause of the behavior and maybe get new foundational insights? Maybe we can try to deliberately get AIs to exhibit misalignment and learn from that. It's hard to anticipate in advance what scientific discoveries will and won't tell you about systems, and I think we've already seen cases of experiment-driven theoretical insights, like simulator theory, that seem to offer new handles for solving alignment. How much quicker and how much more useful will these be if we get the chance to experiment on very powerful systems?
This helpful article by Holden Karnofsky also describes an increasing scale of alignment difficulty, although it's focused on a narrower range of the scale than mine (his scale covers 4-7) and is a bit more detailed about the underlying causes of the misalignment. Here's how my scale relates to his:
The "playing the training game" threat model, where systems behave deceptively only to optimize in-episode reward, corresponds to an alignment difficulty level of 4 or higher. This is because scalable oversight without interpretability tools (level 4) should be sufficient to detect and address this failure mode. The AI may pretend to be helpful during training episodes, but oversight exposing it to new situations will reveal its tendency toward deception.
(Situationally aware) Deception by default corresponds to a difficulty level of 6. If misaligned AIs form complex inner goals and engage in long-term deception, then scalable oversight alone will not catch intentionally deceptive systems that can maintain consistent deceitful behavior. Only interpretability tools used as part of the oversight process (level 6) give us the ability to look inside the system and identify deceptive thought patterns and tendencies.
Finally, the gradient hacking threat model, where AIs actively manipulate their training to prevent alignment, represents an alignment difficulty of 7 or higher. Even interpretability-based oversight can be defeated by sufficiently sophisticated systems that alter their internals to dodge detection.