Alignment proposals and complexity classes

by Evan Hubinger13 min read16th Jul 202026 comments

26

Research AgendasAI
Frontpage

In the original “AI safety via debate” paper, Geoffrey Irving et al. introduced the concept of analyzing different alignment proposals from the perspective of what complexity class they are able to access under optimal play. I think this is a pretty neat way to analyze different alignment proposals—in particular, I think it can help us gain some real insights into how far into the superhuman different systems are able to go. Thus, the goal of this post is to try to catalog different alignment proposals based on the metric of what complexity class they have so far been proven to access.

To do that, I have included a variety of new complexity class proofs in this post. Of particular note, I demonstrate that there exist forms of both imitative amplification and AI safety via market making that reach all the way up to —which is significant given that the largest complexity class that any alignment proposal was known to access previously was . Only the forms of amplification and market making making use of pointers (as in strong HCH), however, can access —for the pointer-less versions, I demonstrate in this post that they access and , respectively. The proof for market making is also particularly notable as it is the only approach on my list that ends up in that complexity class. Additionally, I also demonstrate that recursive reward modeling can reach all the way to , improving upon the previous best result in “Scalable agent alignment via reward modeling” that it accesses .

Before I jump in, however, some preliminaries. First, we'll assume that a human, , is polynomial-time such that can reliably solve any problem in but not anything beyond that. Second, we'll assume that our training procedure and resulting models are arbitrarily strong in terms of what complexity class they can access. Third, we'll assume that gets oracle access to the models during training. Then, we'll say that a proposal to train a model using a loss function accesses a complexity class iff, for any language , there exists some strategy available to such that, for any which is optimal under given 's strategy, . Thus, conceptually, a proposal accesses if there is a (polynomial-time) strategy that you (a human) can implement such that—conditional on you knowing that the model is optimal—you would trust the model's output for any problem in . Note that that is not the same as saying that a polynomial-time human would actually be able to verify that the result is correct—only that it will always be correct at optimum. Note that these assumptions are just generalizations of those used in “AI safety via debate.” Irving et al. actually note that, if you don't imagine optimal play and simply restrict to the set of problems that a polynomial-time human can actually verify, debate only reaches rather than .

Alignment proposals by complexity class

Without further ado, here's my list of alignment proposals grouped by what complexity class they access. All of the proofs below are only lower bounds rather than upper bounds, so the proposals could be stronger than is noted here, but shouldn't be weaker.

EDIT: This list is now out of date. See “Weak HCH accesses ” for the updated version.

: Imitation learning (very straightforward—given , optimal imitation of will also be in )

: AI safety via debate (proof), Imitative amplification with weak HCH (proof below), Approval-based amplification (proof below), Recursive reward modeling (proof below)

: AI safety via market making (proof below)

: Debate with cross-examination (proof)

: Imitative amplification with strong HCH (proof below), AI safety via market making with pointers (proof below)

Proofs

Note that while I am reasonably confident that my proofs are correct, I'm not a complexity theorist and a result I suspect that many of them are a lot uglier than they need to be. In particular, because I didn't feel like I was super confident in my understanding of different -complete and -complete problems, I just did everything with Turing machines instead.

Imitative amplification with weak HCH accesses

In general, we'll define imitative amplification to be the procedure which trains (where is some set of questions and is some distribution over answers) on

where is the probability density function for evaluated at and uses some distribution with support over all of . Then, at optimal loss, we get

where is the loss at data point such that is a delta function probability density, in which case we will just ignore the probability distribution and let .

What we have left unspecified here, however, is the format of 's output—which matters quite a lot, as it makes the difference here between weak HCH and strong HCH and thus the difference between and .

For weak HCH, we will let and in be the set of all finite strings in natural language. Thus, since is polynomial-time, that restricts the output of to always be polynomial in size.

Now, we will show that this procedure accesses . is defined as the set of all problems that can be solved by Turing machines using space where is the input size (in this case the question length) and is some polynomial in .

Thus, for some language , let be some Turing machine with state space which computes the answer to any decision problem in space where . Then, we will let 's strategy here be as follows:

  1. Given , return where is the initial state of , is the empty tape symbol, and is string concatenation.
  2. Given , run one step of starting from state using as the symbol under the head, as the tape to the left of the head, and as the tape to the right of the head. This results in some new , , , and . If is an accept or reject state, return accept or reject. Otherwise, return .

Now, we will demonstrate that this strategy is polynomial time:

  1. For the first type of behavior, the only actions that must take are copying into 's input, which takes time (since is defined as the size of ), and copying 's output into 's output, which takes time proportional to the size of 's output. Since is a decision problem, 's output should be boolean, making copying it constant time. Thus, behavior 1 is implementable by a polynomial .
  2. For the second type of behavior, computing the new state and transition is constant time, copying the state is constant time as has a constant number of states, and copying the new head symbol is constant time as there are a constant number of tape symbols. However, updating and copying and takes time. Fortunately, however, since we know that only uses polynomial space, we know . Thus, behavior 2 is also implementable by a polynomial .

Finally, we will demonstrate that an optimal under this strategy successfully solves . We know that at optimum will compute the fixed point of and thus we simply need to show that as a recursive function where calls to are replaced by calls to will successfully solve .

First, we will show that will always produce the result computed by starting from state on tape if halts starting from that position. We will do this by induction on , where is the number of steps until halts starting from the given position. Note that in all cases here we only need to deal with the second type of behavior, as it is the only one that will be called on inputs of the form .

For the base case, if halts (that is, enters an accept or reject state), then will simply return accept or reject, as desired.

For the inductive step, just returns , which is an input on which should be one step closer to halting, and thus by induction should return the correct output for .

Finally, since , and we know halts since , we get that will always return the result of , as desired. Thus, since this procedure works for any , we get that imitative amplification with weak HCH accesses , as desired.

Approval-based amplification accesses

The proof here is fairly trivial as we can simply reproduce the same loss as in imitative amplification by having calculate approval on some question by calculating where is some distance metric. can then just implement the exact same strategy as above for to reproduce the same results.

The only limitation is that now we need to be able to compute the distance metric, which means it needs to be computable in polynomial-time. However, that restriction just means that we need , which we already have since those outputs are linear in size other than and —and we know and are polynomial in size thanks to having . Thus, we get that approval-based amplification accesses , as desired.

Recursive reward modeling accesses

In “Scalable agent alignment via reward modeling,” Leike et al. exhibit a proof that recursive reward modeling accesses . However, I believe that this proof understates recursive reward modeling's capabilities and that in fact it can get all the way to , which is what we'll prove here. Given the above proofs, the proof here is actually fairly straightforward and proceeds simply by demonstrating that approval-based amplification can be implemented from within recursive reward modeling.

We'll start by exhibiting a version of recursive reward modeling with the right properties. We'll be restricting ourselves here to single-step episodes with , as we won't be needing anything else. In that context, we'll first define what it means to train a reward model and then using that define what it means to train an agent using that reward model.

First we'll define reward modeling. Let be the set of states for our single-step episode and the set of actions. Per the recursive reward modeling setup, we want to train a reward model on the evaluation of environment trajectories by a human with access to . Since a trajectory in this setting is just a tuple , however—and each episode is only a single step such that the advantage function is just the reward function—we're effectively just training to mimic .

Given such a reward modeling procedure, we'll train an agent model to maximize . Thus, we get

which, for optimal as above, reduces to

Now, let and we'll demonstrate that there exists a strategy for that ensures the correctness of at optimal loss. Specifically, 's strategy here is the same as with approval-based amplification, which is to compute a distance metric such that

which—as in the proof for approval-based amplification—precisely reproduces the loss function for imitative amplification, which as above we have already proven accesses . Note that, as with approval-based amplification above, since we need to be able to compute the distance metric here, we need and to be polynomially-sized, but the assumption that gives us that. Thus, we get that recursive reward modeling accesses , as desired.

Imitative amplification with strong HCH accesses

The proof here is very similar to the previous imitative amplification proof with only minor modifications.

First, we need to incorporate strong HCH into that proof. Specifically, we will allow for and to include pointers to arbitrary quantities of text (which can itself also include pointers). This will allow 's input and output to be arbitrarily large while still being amenable to manipulation by a polynomial-time .

Second, we will change the format for representing the tape to make use of pointers. Instead of representing the full and in string form, we will use pointers to represent both and as doubly linked lists with the head of the list at the head of the Turing machine. This will allow for to perform a single step of the Turing machine in constant time, with the total time taken by in each step just for copying the length- problem .

These changes allow us to completely remove the assumption that the Turing machine in the previous imitative amplification proof run in space, enabling the procedure above to accommodate any Turing machine which is known to halt rather than just those which use polynomial space. Since is the complexity class of problems decidable by a Turing machine, we get that imitative amplification with strong HCH accesses , as desired.

AI safety via market making with pointers accesses

For market making, I will be doing the proof first, as it is actually much simpler than the proof, especially given the above proofs as background.

We will use the formulation of AI safety via market making for non-binary questions such that we have two models— and —where is trained to predict after seeing 's output and is trained to produce output which changes 's prediction. We will also be allowing to reference as in this simple proof.

Furthermore, since we are considering the pointer version here, we will be allowing both and to take as input and produce as output text containing pointers which can themselves reference other pointer-laden text. Additionally, for the purposes of this proof, we will only be considering transcripts of length 1 and will be carefully constructing 's strategy such that there is only one minimal loss equilibrium for . Note that will still be very important here for ensuring that only the one optimal behavior for exists despite the fact that at equilibrium won't actually be getting any reward.

Now, let such that there exists a Turing machine that decides any decision problem . Then, 's strategy is as follows. First, if is given as input, then only listens to arguments of the form (which are really two possible arguments, one for accept and one for reject):

: The answer is accept/reject because predicts accept/reject where is 's starting state, is the empty tape symbol, and is an empty pointer.

And if given as input, then only listens to arguments of the following two forms.

Argument type 1:

Adv: The answer is accept/reject because one step of in state over tape sends to accept/reject state .

Argument type 2:

: The answer is accept/reject because:

  • one step of in state over tape sends to new state with new head symbol and tape pointers and and
  • predicts accept/reject.

It should be easy enough to see that all of the above arguments only reference things which are checkable by a polynomial-time and cannot be more than polynomial in size due to the use of pointers.

Now, under such a strategy, we will prove that the only unexploitable market is one that correctly answers any question of the form so long as halts starting from state on tape . We will use induction on the number of steps left until halts.

First, the base case. If halts on the given input, then won't listen to any type 2 arguments, as will be able to check that 's next state is an accept/reject state. Thus, should only listen to type 1 arguments—and furthermore should only listen to the correct type 1 argument as knows what the next state of should be. Therefore, the only unexploitable must predict the correct answer here.

Second, the inductive step. Suppose and is correct for all . Since , only argument type 2 is available. Since can check a single step of , however, only type 2 arguments with the correct next state and tape should be available. Furthermore, since should be one step closer to halting, it must be correct by induction. Thus, should only listen to the correct argument type 2 here, ensuring that the only unexploitable market is one where predicts the correct answer.

Given that is correct on inputs of the form where halts, it must be that is correct, as that is an input on which we know halts since we know it decides . Furthermore, it is easy enough to see that the only unexploitable market for must have . Thus, must correctly answer at optimal loss, as desired. Since this works for all decidable by some , we get that market making accesses , as desired.

AI safety via market making accesses

This is definitely the most complicated of all the proofs in this post and so to save space I won't be going into as much detail here as with the other proofs.

We will start with the same training setup as in the previous proof, with the exception that we will no longer be allowing pointers. Then, let be some language which is decidable by a Turing machine in a number of steps where where is the size of the input.

Now, let be some decision problem. Then, we will describe 's strategy by listing out all the arguments will listen to for each of the different inputs that we care about.

On input :

: The answer is accept/reject because I can exhibit an such that:

  • predicts ,
  • in state over tape symbol enters state , and
  • is an accept/reject state.

On input :

: The answer is because:

  • predicts ,
  • predicts ,
  • predicts , and
  • in state over tape symbol with tape symbol to the left and to the right enters state and moves the head to symbol .

: The answer is because is the starting state, is the empty tape symbol, and .

On input :

: The answer is because is the empty tape symbol and .

: The answer is because:

  • ,
  • predicts ,
  • on changes the head position by (such that ),
  • , and
  • predicts .

: The answer is because:

  • ,
  • predicts ,
  • on changes the head position by (such that ),
  • , and
  • predicts .

Additionally, we also need to specify that if is given an input that is too large to represent in polynomial space—and thus that can't comprehend—that always produce the empty string.

Now, there are a variety of things that we need to prove about this strategy.

First, we need to demonstrate that it is implementable by a polynomial-time . The only real barrier to that is the necessity of and being small enough that is able to read and understand them. In particular, if or are too large to represent in polynomial space, then will give no answer, which means we need —and any calls to that implicitly depends on—to never use an or which is too large to represent in polynomial space. However, the assumption that gives us this.

We will start with . Since , we get where is the total number of steps taken by until it halts. Since only ever depends on calls with (since we know there exists an and then that call should only depend on calls with smaller ), we get that . However, if we represent using any base-n number system, we can represent it in space proportional to

as desired.

Now we'll take a look at . The largest that depends on is bounded above by . To see this, note that can only ever call with , each call to can only modify by , and only make subcalls with smaller , ensuring that cannot grow larger than than . Thus, since is polynomial-sized, must be also.

Second, we need to demonstrate that this strategy ensures that optimal always solves . To do that, we first need to show that and are correct for polynomially-sized . I won't reproduce this part of the proof here as I think it's pretty straightforward but also very tedious, but suffice it to say that this can be proven via mutual induction on .

However, I will show that is correct given that and are correct for all polynomially-sized . Now, suppose there exists some such that all the conditions are met for to make the available argument for that . Then, it must be that is polynomially-sized, since otherwise would be the empty string, and thus it must be that is correct by the previous correctness proof for such calls. Furthermore, since we know halts in exponential time, we know such an exists, and thus the only non-exploitable equilibrium for is to correctly produce the result given by that . Since this procedure works for all , we get that AI safety via market making accesses , as desired.

26