Thanks to Evan Hubinger for comments on these ideas.
A “sharp left turn” is a point where capabilities generalize beyond alignment. In a sharp left turn an AI becomes much more capable than aligned, and so starts to exploit flaws in its alignment.
This can look like Goodharting, where strong optimization pressure causes outer alignment failure because the base goal isn’t identical to what we want. Or it can look like deceptive alignment, where the model is aligned to proxy goals that aren’t identical to the base goal, but we don’t notice until the model is capable enough to make the proxies fail.
However they happen, sharp left turns are bad.
An important question, though, is: when will the first sharp left turn happen?
Consider three scenarios:
- Weak: The sharp left turn happens before models are able to cause existential risks. For instance, maybe GPT-4 has a sharp left turn but isn’t able to reliably execute on its plans.
- Strong: The sharp left turn happens once capabilities are very super-human and dangerous if misused. In this world AI’s are doing the bulk of the world’s scientific research by the time alignment unravels.
- Human-level: The sharp left turn happens while capabilities are around human level, meaning both dangerous and not game-changingly helpful for research.
I think that the most dangerous of these is the human-level world.
In the “weak” scenario, we get to see the sharp left turn coming. We potentially get lots of experience with models looking aligned and then that alignment failing with increasing scale. In this world, we effectively get multiple shots at alignment because the first badly-misaligned models just aren’t that dangerous. We get to experiment with deceptively aligned models and get a grounded empirical understanding of why models become deceptively aligned and how we might stop that. In this world, alignment research gets a chance to become a paradigmatic scientific field before the models get too capable.
In the “strong” scenario, we get to do alignment research with super-human models that are still aligned. In this world, we don’t have to solve the technical challenge of alignment entirely on our own. Alignment research potentially gets much further along before we have to have an airtight solution and prevent sharp left turns.
Both of these seem like pretty hopeful worlds. By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools. We have to solve the problem entirely on our own, and we have to do it without any useful empirical feedback. This seems really bad.
The timing of sharp left turns depends on a few capability scales, some of which I conflated above:
- The danger scale, above which the model is capable enough to pose an existential risk.
- The research scale, above which the model is able to do scientific research better than humans.
- The left-turn scale, above which the model is able to undergo a sharp left turn. A useful anchor here is given by deceptive alignment, which suggests the scale at which a model can successfully reason about its own training process.
The three scenarios I outlined are characterized by different orderings of these scales:
- Weak: (left-turn < danger)
- Strong: (research < left-turn)
- Human-level: (left-turn ~ research, left-turn ~ danger)
There can be other orderings, leading to other outcomes. For instance, we could live in a world with (danger < left-turn < research). In that world, the left turn happens at a scale that’s dangerous but incapable of super-human research. Imagine a model that’s very capable of developing and executing plans using information freely available on the internet. Such a model could pose an existential risk by engineering pathogens that we already know how to make, without doing any novel R&D, and deploying them with competent but not superhuman logistics.
Capabilities are (sort of) multi-dimensional
Of course capabilities aren’t strictly one-dimensional, so talking about “the scale of capabilities” conflates a few different ideas. Concretely, I think a model with fixed research ability could be either more or less dangerous, depending on other dimensions of its capabilities like “competence interacting with people” and “robust plan design”. These skills vary significantly in humans, and it seems plausible that we can engineer models to be more or less capable along these different dimensions.
That’s not to say that there’s complete freedom here. There’s definitely correlation between capabilities (e.g. lots of capabilities improve with scale). And in the limit of extreme research ability a model should be able to learn whatever other capabilities it needs. Intelligence is closely related to general purpose optimization ability, and to the extent that this holds we should expect different capabilities to scale together.
But we might live in worlds where it only takes models twice as capable as humans to solve the technical challenges of alignment, and those models may not be reflectively stable and may have lots of flawed abilities. In those worlds there’s a lot of benefit to thinking about which capabilities we want to weaken in models, and to trying to arrange that.
If we could know when to expect a sharp left turn that might change what we do. If we knew that a sharp left turn would happen long before dangerous capabilities, we might focus on developing alignment benchmarks and more empirical study of alignment failures. If we knew that it would happen long after powerful research capabilities, we might focus on applying those capabilities towards alignment work.
It seems plausible we can learn more about this timing question by studying low-likelihood outputs, studying how one-or-multi-dimensional capabilities are, and reasoning about the kinds of capabilities that are dangerous and the kind that are helpful.
Along similar lines, even if we don’t know when to expect a sharp left turn, there may be things we can do to shift that timing in a favorable direction. For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the "weak" worlds). And we can preferentially focus alignment efforts on whatever the state of the art is at any time (currently large language models), rather than on very different architectures, so as to delay left turns until after we get superhuman models we can use to help with alignment research.
The usual framing is to align agent policies in a way that also ensures that they don't have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what's going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.
It seems useful to deconfuse agents that don't have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can't be obtained from the world as training data, so won't be useful for ensuring alignment. And to treat such situations as training data for ML purposes.
The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it's not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can't be collected as training data from the world, or won't occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it's relevant, competing/bargaining with others of its kind to find an equilibrium that doesn't have the more jarring sorts of behavioral phase changes, especially capability-related ones.