Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth
When trying to tackle a hard problem, a generally effective opening tactic is to Hold Off On Proposing Solutions: to fully discuss a problem and the different facets and aspects of it. This is intended to prevent you from anchoring to a particular pet solution and (if you're lucky) to gather enough evidence that you can see what a Real Solution would look like. We wanted to directly tackle the hardest part of the alignment problem, and make progress towards a Real Solution, so when we had to choose a project for SERI MATS, we began by arguing in a Google doc about what the core problem is. This post is a cleaned-up version of that doc.
The overall problem of alignment is the problem of, for an Artificial General Intelligence with potentially superhuman capabilities, making sure that the AGI does not use these capabilities to do things that humanity would not want. There are many reasons that this may happen such as instrumental convergent goals or orthogonality.
In each section below we make a different case for what the "core of the alignment problem" is. It's possible we misused some terminology when naming each section.
The document is laid out as follows: We have two supra-framings on alignment: Outer Alignment and Inner Alignment. Each of these is then broken down further into subproblems. Some of these specific problems are quite broad, and cut through both Outer and Inner alignment, we've tried to put problems in the sections we think fits best (and when neither fits best, collected them in an Other category) though reasonable people may disagree with our classifications. In each section, we've laid out some cruxes, which are statements that support that frame on the core of the alignment problem. These cruxes are not necessary or sufficient conditions for a problem to be central.
The core of the alignment problem is being able to precisely specify what we value, so that we can train an AGI on this, deploy it, and have it do things we actually want it to do. The hardest part of this is being mathematically precise about 'what we value', so that it is robust to optimization pressure.
The hardest part of this problem is being able to point robustly at anything in the world at all (c.f. Diamond Maximizer). We currently have no way to robustly specify even simple, crisply defined tasks, and if we want an AI to be able to do something like 'maximize human values in the Universe,' the first hurdle we need to overcome is having a way to point at something that doesn't break in calamitous ways off-distribution and under optimization pressure. Once we can actually point at something, the hope is that this will enable us to point the AGI at some goal that we are actually okay with applying superhuman levels of optimization power on.
There are different levels at which people try to tackle the pointers problem: some tackle it on the level of trying to write down a utility function that is provably resilient to large optimization pressure, and some tackle it on the level of trying to prove things about how systems must represent data in general (e.g. selection theorems).
Cruxes (around whether this is the highest priority problem to work on)
The hardest part of alignment is getting the AGI to generalize the values we give it to new and different environments. We can only ever test the AGI's behavior on a limited number of samples, and these samples cannot cover every situation the AGI will encounter once deployed. This means we need to find a way to obtain guarantees that the AGI will generalize these concepts when out-of-distribution in the way that we'd want, and well enough to be robust to intense optimization pressure (from the vast capabilities of the AGI).
If we are learning a value function then this problem falls under outer alignment, because the distribution shift breaks the value function. On the other hand, if you are training an RL agent, this becomes more of an inner alignment problem.
The best way to solve this problem is to specify a utility function that, for the most part, avoids instrumentally convergent goals (power seeking, preventing being turned off). This will allow us to make an AGI that is deferential to humans, so that we can safely perform a pivotal act, and hopefully buy us enough time to solve the alignment problem more robustly.
Encoding what we actually want in a loss function is, in fact, too hard and will not be solved, so we will ultimately end up training the AGI on a proxy. But proxies for what we value are generally correlated with what we actually value, until you start optimizing on the proxy somewhat unreasonably. An AGI will apply this unreasonable optimization pressure, since it has no reason to 'understand what we mean, rather than what we said.'
The core of the alignment problem is figuring out how to induce inner values into an AGI from an outer training loop. If we cannot do this, then an AGI might end up optimizing for something that corresponded to the outer objective on the training set but generalizes poorly.
The hardest part of this problem is avoiding the instantiation of malign learned optimizers within the AGI. These arise when training on the base reward function does not in fact cause the AGI to learn to optimize for that reward function, but instead to optimize for some mesa-objective that obtains good performance in the training environment.
One key insight for why this is the core of the alignment problem is human intelligence being a mesa-optimizer induced by evolution. Evolution found intelligence as a good method for performing well on the 'training set' of the ancestral environment, but now that human intelligence has moved to a new environment where we have things like contraception, the mesa objective of having sex has decoupled from the base objective of maximizing inclusive genetic fitness, and we pursue the former, and do not much care about the latter.
Some ways that mesa-optimization can happen:
The hardest part of this problem is being able to translate from a human ontology to an AGI's ontology. An AGI is likely to use a different ontology from us, potentially radically different, and also to learn new ontologies once deployed. Translating from a human ontology to an AGI's ontology is going to be hard enough, but we also need translation mechanisms that are robust to (potentially very large) ontology shifts.
We think that a lot of the paths to impact for interpretability research are as solutions to this problem. Whether it is: using interpretability to obtain guarantees that an AGI has learned precisely the concepts that we want, using interpretability tools in training to incentivize the AGI to learn the concepts we want, using interpretability to improve the efficiency/effectiveness of human feedback on AGI models during training, or other more detailed interpretability agendas; most of interpretability's impact comes down to its use in ontology identification.
We will end up using reward-circuitry mechanisms to train an AGI to optimize for what we want, however when we do this, it may instead learn to optimize only for maximizing its reward-circuitry. The hardest part of inner alignment is preventing these 'reward-optimizers' from developing, since as these get more capable, they will start pretending to be aligned to the outer-objective in the training process, and then once deployed and given access to their reward-circuitry, they will wirehead. This entails the AGI maximizing its reward circuitry, and then taking actions to maintain that high reward value, including eliminating threats that may try to turn it off, or modify it in any way (i.e. us).
When an agent interacts with the world, there are two possible ways the agent makes mistakes:
Thus, the training process of an AGI will improve its values or its world model, and since it eventually gets diminishing marginal returns from both of these, both the world model and the values must improve together. Therefore, it is very likely that the agent will have a sufficiently good world model to understand that it is in a training loop before it has fully aligned inner values.
This means we are likely to see deception: the agent will act aligned to the outer objective while it's in the training phase, in order to get humans to deploy it in the real world where it can pursue its true objectives. If we can see this misalignment then hitting it with gradient descent should be able to fix it, but deception gets around this, so the core problem in alignment is avoiding deceptive alignment during training.
The core problem in alignment is to figure out how to make an AI that does not act like an agent (and avoids malign subagents), and get this AI to solve the alignment problem (or perform a pivotal act). This tries to avoid the problem of corrigibility, by developing AIs that aren't (generally capable) optimizers (and hence won't stop you from turning them off).
The core of alignment is the specific distribution shift that happens at general intelligence: the sharp left turn — the AI goes from being able to only do narrow tasks similar to what it is trained on, to having general capabilities that allow it to succeed on different tasks.
Capabilities generalize by default: in a broad range of environments, there are feedback loops that incentivize the agent to be capable. When operating in the real world, you can keep on getting information about how well you are doing, according to your current utility function.
However, you can't keep on getting information about how "good" your utility function is. But there is nothing like this for alignment, nothing pushing the agent towards “what we meant” in situations far from the training distribution. In a separate utility function model, this problem appears when the utility function doesn’t generalize well, and in a direct policy selection model, this problem appears in the policy selector.
The frame we think gets at the core problem best is (drumroll please) distribution shift: robustly pointing to the right goal/concepts when OOD or under extreme optimization pressure. This frame gives us a good sense of why mesa-optimizers are bad, fits well with the sharp left turn framing, and explains why ontology identification is important. Even though this is what we landed on, it should not be the main takeaway -- the real post was the framings we made along the way.