# 18

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth

# Introduction

When trying to tackle a hard problem, a generally effective opening tactic is to Hold Off On Proposing Solutions: to fully discuss a problem and the different facets and aspects of it. This is intended to prevent you from anchoring to a particular pet solution and (if you're lucky) to gather enough evidence that you can see what a Real Solution would look like. We wanted to directly tackle the hardest part of the alignment problem, and make progress towards a Real Solution, so when we had to choose a project for SERI MATS, we began by arguing in a Google doc about what the core problem is. This post is a cleaned-up version of that doc.

## The Technical Alignment Problem

The overall problem of alignment is the problem of, for an Artificial General Intelligence with potentially superhuman capabilities, making sure that the AGI does not use these capabilities to do things that humanity would not want. There are many reasons that this may happen such as instrumental convergent goals or orthogonality.

## Layout

In each section below we make a different case for what the "core of the alignment problem" is. It's possible we misused some terminology when naming each section.

The document is laid out as follows: We have two supra-framings on alignment: Outer Alignment and Inner Alignment. Each of these is then broken down further into subproblems. Some of these specific problems are quite broad, and cut through both Outer and Inner alignment, we've tried to put problems in the sections we think fits best (and when neither fits best, collected them in an Other category) though reasonable people may disagree with our classifications. In each section, we've laid out some cruxes, which are statements that support that frame on the core of the alignment problem. These cruxes are not necessary or sufficient conditions for a problem to be central.

# Frames on outer alignment

The core of the alignment problem is being able to precisely specify what we value, so that we can train an AGI on this, deploy it, and have it do things we actually want it to do. The hardest part of this is being mathematically precise about 'what we value', so that it is robust to optimization pressure.

## The Pointers Problem

The hardest part of this problem is being able to point robustly at anything in the world at all (c.f. Diamond Maximizer). We currently have no way to robustly specify even simple, crisply defined tasks, and if we want an AI to be able to do something like 'maximize human values in the Universe,' the first hurdle we need to overcome is having a way to point at something that doesn't break in calamitous ways off-distribution and under optimization pressure. Once we can actually point at something, the hope is that this will enable us to point the AGI at some goal that we are actually okay with applying superhuman levels of optimization power on.

There are different levels at which people try to tackle the pointers problem: some tackle it on the level of trying to write down a utility function that is provably resilient to large optimization pressure, and some tackle it on the level of trying to prove things about how systems must represent data in general (e.g. selection theorems).

Cruxes (around whether this is the highest priority problem to work on)

• This problem being tractable relies on some form of the Natural Abstractions Hypothesis.
• There is, ultimately, going to end up being a thing like "Human Values," that can be pointed to and holds up under strong optimization pressure.
• We are sufficiently confused about 'pointing to things in the real world' that we could not reliably train a diamond maximizer right now, if that were our goal.

## Distribution Shift

The hardest part of alignment is getting the AGI to generalize the values we give it to new and different environments. We can only ever test the AGI's behavior on a limited number of samples, and these samples cannot cover every situation the AGI will encounter once deployed. This means we need to find a way to obtain guarantees that the AGI will generalize these concepts when out-of-distribution in the way that we'd want, and well enough to be robust to intense optimization pressure (from the vast capabilities of the AGI).

If we are learning a value function then this problem falls under outer alignment, because the distribution shift breaks the value function. On the other hand, if you are training an RL agent, this becomes more of an inner alignment problem.

Cruxes:

• Creating an AGI necessarily induces distribution shift because:
•  The real world changes from train to test time.
•  The agent becomes more intelligent at test time, which is itself a distribution shift.
• Understanding inductive biases well enough to get guarantees on generalization is tractable.
• We will be able to obtain bounds even for deep and fundamental distribution shifts.

## Corrigibility

The best way to solve this problem is to specify a utility function that, for the most part, avoids instrumentally convergent goals (power seeking, preventing being turned off). This will allow us to make an AGI that is deferential to humans, so that we can safely perform a pivotal act, and hopefully buy us enough time to solve the alignment problem more robustly.

Cruxes:

• Corrigibility is further along the easier/safer Pareto-frontier than Coherent Extrapolated Volition of Humanity.
• Corrigibility is a concept that is more "natural" than human values.

## Goodharting

Encoding what we actually want in a loss function is, in fact, too hard and will not be solved, so we will ultimately end up training the AGI on a proxy. But proxies for what we value are generally correlated with what we actually value, until you start optimizing on the proxy somewhat unreasonably. An AGI will apply this unreasonable optimization pressure, since it has no reason to 'understand what we mean, rather than what we said.'

Cruxes:

• The inner values of an AGI will be a proxy for our values. In other words, it will not be a True Name for what we care about, when we apply optimization pressure, it will perform worse and not better as measured by our true preferences.
• We can get a soft-optimization proposal that works to solve this problem (instead of having the AGI hard-optimize something safe).
• It is either impossible, or too hard, to specify the correct loss function and so we will end up using a proxy.

## General Cruxes for Outer Alignment

• Getting the AGI to do what we want it to do (when we learn how to specify that) is at least one of:
• Not as hard.
• Going to be solved anyway by making sufficient progress on these problems.
• Solvable through architectural restrictions.
• The best way to make progress on alignment is to write down a utility function for an AI that:
• Generalizes
• Is robust to large optimization pressure
• Specifies precisely what we want

# Frames on inner alignment

The core of the alignment problem is figuring out how to induce inner values into an AGI from an outer training loop. If we cannot do this, then an AGI might end up optimizing for something that corresponded to the outer objective on the training set but generalizes poorly.

## Mesa-Optimizers

The hardest part of this problem is avoiding the instantiation of malign learned optimizers within the AGI. These arise when training on the base reward function does not in fact cause the AGI to learn to optimize for that reward function, but instead to optimize for some mesa-objective that obtains good performance in the training environment.

One key insight for why this is the core of the alignment problem is human intelligence being a mesa-optimizer induced by evolution. Evolution found intelligence as a good method for performing well on the 'training set' of the ancestral environment, but now that human intelligence has moved to a new environment where we have things like contraception, the mesa objective of having sex has decoupled from the base objective of maximizing inclusive genetic fitness, and we pursue the former, and do not much care about the latter.

Some ways that mesa-optimization can happen:

• There is a learned inner optimizer, e.g. in a language model, that values things in the outside world, and so outputs things to hijack the output of the LM.
• You train an RL agent to accomplish a task, e.g. pick strawberries, but there is a distribution shift from training to test time, and though the goal aligned on the training distribution, the actual inner goal, e.g. take red things and put them in a shiny metal basket extrapolates off-distribution to pulling off someone's nose and throwing it at a lamppost
• You prompt an LLM to accomplish a task, e.g. be a twitter bot that spreads memes. You've now instantiated something that is optimizing the real world. The LLM's outer objective was just text prediction, but via prompting, we've induced a totally different mesa-objective.
• Some people think that this doesn't count because the optimizer is still optimizing the outer objective of text autocompletion.

Cruxes:

• We will not be able to prevent the instantiation of learned optimizers through architectural adaptations.
• Gradient descent selects for compressed and generalizable strategies, and optimization/search capabilities meet both of these requirements. See also: Conditions for Mesa-Optimization.

## Ontology Identification

The hardest part of this problem is being able to translate from a human ontology to an AGI's ontology. An AGI is likely to use a different ontology from us, potentially radically different, and also to learn new ontologies once deployed. Translating from a human ontology to an AGI's ontology is going to be hard enough, but we also need translation mechanisms that are robust to (potentially very large) ontology shifts.

We think that a lot of the paths to impact for interpretability research are as solutions to this problem. Whether it is: using interpretability to obtain guarantees that an AGI has learned precisely the concepts that we want, using interpretability tools in training to incentivize the AGI to learn the concepts we want, using interpretability to improve the efficiency/effectiveness of human feedback on AGI models during training, or other more detailed interpretability agendas; most of interpretability's impact comes down to its use in ontology identification.

Cruxes:

• Natural Abstraction Hypothesis will make this problem tractable, or even if NAH ends up holding only very weakly, then with enough work, we can overcome that.
• If working on circuits-style ontology identification, then at least one of:
• AGI will look a lot like modern systems.
• We will get good information about how to interpret future systems (although they may look very different) by working on modern systems.

We will end up using reward-circuitry mechanisms to train an AGI to optimize for what we want, however when we do this, it may instead learn to optimize only for maximizing its reward-circuitry. The hardest part of inner alignment is preventing these 'reward-optimizers' from developing, since as these get more capable, they will start pretending to be aligned to the outer-objective in the training process, and then once deployed and given access to their reward-circuitry, they will wirehead. This entails the AGI maximizing its reward circuitry, and then taking actions to maintain that high reward value, including eliminating threats that may try to turn it off, or modify it in any way (i.e. us).

Cruxes:

## Deception

When an agent interacts with the world, there are two possible ways the agent makes mistakes:

• Its values were not aligned with the outer objective, and so it does something intentionally wrong,
• Its world model was incorrect, so it makes an accidental mistake.

Thus, the training process of an AGI will improve its values or its world model, and since it eventually gets diminishing marginal returns from both of these, both the world model and the values must improve together. Therefore, it is very likely that the agent will have a sufficiently good world model to understand that it is in a training loop before it has fully aligned inner values.

This means we are likely to see deception: the agent will act aligned to the outer objective while it's in the training phase, in order to get humans to deploy it in the real world where it can pursue its true objectives. If we can see this misalignment then hitting it with gradient descent should be able to fix it, but deception gets around this, so the core problem in alignment is avoiding deceptive alignment during training.

Cruxes:

• Deception becomes a natural thing for an AGI to learn at high-enough capabilities levels.
• Detecting deception will be very hard at high-enough capabilities levels.

## Cruxes for Inner Alignment

• What matters is not so much the explicit outer utility function that we train the AGI on, but instead the values that the training process instantiates in the AGI.
• These values actually exist, and we're not just anthropomorphizing.
• The agent will learn to model the training process as a whole before it learns to value the utility function we are training it on.

# Other Frames

## Non-agentic AI/Oracle AI

The core problem in alignment is to figure out how to make an AI that does not act like an agent (and avoids malign subagents), and get this AI to solve the alignment problem (or perform a pivotal act). This tries to avoid the problem of corrigibility, by developing AIs that aren't (generally capable) optimizers (and hence won't stop you from turning them off).

Cruxes:

• A non-agentic AI can be intelligent enough to do something pivotal, e.g. writing the alignment textbook from the future.
• Training an LLM using methods like SSL is accurately described as learning a distribution over text completions, and then conditioning on the prompt.
• You can simulate an optimizer without being an optimizer yourself.

## The Sharp Left Turn

The core of alignment is the specific distribution shift that happens at general intelligence: the sharp left turn  — the AI goes from being able to only do narrow tasks similar to what it is trained on, to having general capabilities that allow it to succeed on different tasks.

Capabilities generalize by default: in a broad range of environments, there are feedback loops that incentivize the agent to be capable. When operating in the real world, you can keep on getting information about how well you are doing, according to your current utility function.

However, you can't keep on getting information about how "good" your utility function is. But there is nothing like this for alignment, nothing pushing the agent towards “what we meant” in situations far from the training distribution. In a separate utility function model, this problem appears when the utility function doesn’t generalize well, and in a direct policy selection model, this problem appears in the policy selector.

Cruxes:

• There will be a large, sudden distribution shift from below human level to far superhuman level.
• There will be no way to keep the AGI on-distribution for its training data.
• Capabilities generalize faster than alignment.
• For a given RL training environment, “strategies” are more overdetermined than “goals”.

# Conclusion

The frame we think gets at the core problem best is (drumroll please) distribution shift: robustly pointing to the right goal/concepts when OOD or under extreme optimization pressure. This frame gives us a good sense of why mesa-optimizers are bad, fits well with the sharp left turn framing, and explains why ontology identification is important. Even though this is what we landed on, it should not be the main takeaway -- the real post was the framings we made along the way.

Mentioned in
New Comment