*This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also represents my current agenda regarding what I believe looks like the most promising approach for addressing inner alignment.*

One particularly concerning failure mode for any sort of advanced AI system is for it to have good performance on the training distribution, but nevertheless still have some input on which it behaves highly “unacceptably” during deployment. For the purposes of this post, we will follow Paul Christiano in thinking about acceptability as some concept that satisfies the following two conditions:

- “As long as the model
*always*behaves acceptably, and achieves a high reward on*average*, we can be happy.” - “Requiring a model to always behave acceptably wouldn't make a hard problem too much harder.”

Thus, acceptability is a different condition than asking for good performance during deployment—an honest mistake is considered acceptable, as is just failing to do anything interesting at all. That being said, some of the sorts of things you might want from acceptability could include corrigibility, non-deception, or myopia.^{[1]} The hope for this style of approach is that it is competitive to only apply advanced machine learning to those situations which have clear notions of acceptability.

The tension, however, rests in being able to achieve good performance while maintaining an acceptability guarantee. Thus, the ideal solution to this problem would be a way of converting a training procedure which can achieve arbitrary performance into a training procedure which is not much harder to run, and which can still achieve arbitrary performance, but which also enforces an acceptability guarantee.

One obvious way to attempt to maintain an acceptability guarantee is to construct a loss function which incentivizes acceptability as part of performance. Something of this form is almost certainly going to be a component of any safe system. That being said, there are some reasons why approaches of this form might be insufficient if used in isolation.

First, one important thing to note about folding acceptability into performance is that the resulting acceptability guarantee is fairly weak: you get an average-case acceptability guarantee that holds only in the case of no distributional shift, with no guarantee when your deployment distribution is different than your training distribution.^{[2]}

Second, there are reasons to believe that the objective of a model doing some sort of internal planning/search might be systematically different than its loss function. Thus, even if the loss function incentivizes acceptability, the training process's inductive biases might be such that the model ends up not actually optimizing for an acceptable objective. This is the problem “Risks from Learned Optimization” refers to as inner alignment,^{[3]} though other frameworks have also been proposed for understanding this sort of phenomenon. Ortega et al. recently presented in a DeepMind Technical Report, for example, an analysis of this problem from the perspective of treating it as spontaneous meta-learning, noting that:

“Meta-learning can also occur spontaneously in online regression when the capacity of the agent is bounded and the data is produced by a single generator. Unfortunately, the downside is that we cannot easily control what will be metalearned. In particular, spontaneous meta-learning could lead to undesirable emergent properties, which is considered an open research problem in AI safety [Ortega et al., 2018].”

The key problem that arises in this sort of situation is that it raises the possibility of capability generalization without objective generalization. That is, for a model doing some sort of search/meta-learning, it might be that its search algorithm generalizes well off-distribution such that it continues being able to take competent, well-optimized actions, but that the objective its search is directed towards fails to generalize such that those actions are now optimized for the wrong objective.

Finally, there are reasons that we might want to consider the possibility of near-AGI models learning to deceive their training processes. The full argument for why this might be the case is given in the section on deceptive alignment in “Risks from Learned Optimization,” but the basic concern is that a model could learn about the true reward via its input and thus start appearing aligned for the purpose of sticking around in the training process before gradient descent has the chance to modify it to actually optimize for the true reward internally.

## Background

If incentivizing acceptability as part of performance is insufficient, as it seems it might be, then a better solution for enforcing acceptability guarantees is required. One framework Paul has proposed is to attempt to make use of adversarial training, wherein you could train an adversary to find inputs on which the model behaves unacceptability and then penalize it accordingly. The goal of this post is to explore this avenue as thoroughly as possible.

One initial roadblock with this approach, however, is that there might exist some inputs on which the model behaves unacceptably which are difficult to instantiate during training—future situations that cannot be predicted in complete detail, for example.^{[4]} Paul's proposed solution to this problem is to *relax* the adversary's job. Rather than asking the adversary to instantiate a concrete input on which the model behaves unacceptably, we can instead just ask the adversary to produce some sort of description of such a situation, which we will call a pseudo-input. Paul proposes the following two conditions as the major desiderata for any pseudo-input-based approach:

- “It's still ‘easy enough' for the agent to never behave unacceptably for any pseudo-input.”
- “It's easy for the adversary to find a pseudo-input on which the agent behaves unacceptably, if any exist.”

Under Paul's framework, the structure of pseudo-inputs has to be constrained so that “an input that contains X” is a valid pseudo-input but “a distribution over inputs that would cause the agent to want to kill everyone” is not. That way, the model's task of steering clear of having any pseudo-inputs on which it behaves badly is actually potentially soluble.

## Formalizing relaxed adversarial training

We'll start by formalizing Paul's approach to relaxed adversarial training so we can use it as a starting point. Let be the space of possible states (which we will assume includes any question we want to ask the model as a state), be the space of possible actions, be the model, be the space of all possible models, be some amplification operation,^{[5]} and be the space of possible pseudo-inputs. Then, as per Paul's proposal, we'll implement our adversary as

our unacceptability checker (where ) as and finally our model's unacceptability penalty as

That's our starting point, but can we do better? One initial observation is that this approach relies heavily on constraining to exclude the sorts of pseudo-inputs that the model shouldn't have to deal with such as “a distribution over inputs such that kills all humans.” If we want to allow very general pseudo-inputs, however, then this constraint might not be very natural to implement. Furthermore, it's unclear that a binary constraint is actually what we want here: we care significantly more about pseudo-inputs that are more likely to come up during deployment.^{[6]} In fact, if we think of pseudo-inputs as predicates that constrain , we can approximate the probability of unacceptable behavior during deployment as^{[7]}

such that, if we can get a good implementation of , we no longer have to worry as much about carefully constraining , as we can just let 's prior do that work for us. Thus, we'll define^{[8]}

such that we get the new unacceptability penalty^{[9]}

which is the primary model I'll be working with for the rest of this post.

## Hard cases

Now that we have our model, we'll look at some of the hard cases that have to be solved for any sort of approach of this form to go through.

**Random unacceptability.** One of the simplest examples of an unacceptable model is one that simply has some low probability of unacceptability such that it is difficult to pick up on training but still possible to see during deployment. If you can guarantee that model samples will be independently and identically distributed (iid), and you have an average case performance guarantee over some metric that includes acceptability, then you might hope to get an average case acceptability guarantee which can mitigate such a case of low probability unacceptable behavior. However, it is unclear exactly how difficult getting an iid guarantee might be as well as how good the resulting acceptability guarantee would be. As a simple model, suppose there are training steps and deployment steps. Then, let be the probability that our model behaves acceptably at each step, where we will assume that steps are iid. Then, we want to know the probability of unacceptable behavior during deployment given weak assumptions on other than that we know it was acceptable during training. Thus, if we adopt a uniform prior^{[10]} over and update on seeing instances of acceptable behavior, we get the posterior , which yields the overall probability of unacceptable behavior during deployment of

Thus, if we have an iid guarantee and we limit ourselves only to cases where , we're likely to be fairly safe.

However, is a hard condition to satisfy^{[11]} —can we do better? In particular, what happens if we can deal with one instance of unacceptable behavior, but not multiple? We will consider two possible models: in the sequential model, it takes sequential instances of unacceptable behavior during deployment to be unacceptable, whereas in the non-sequential model it takes instances of unacceptable behavior at any time during deployment to be unacceptable. Then, let be the probability of unacceptable deployment in the original model, be the probability of an unacceptable deployment in the sequential model, and be the equivalent for the non-sequential model. Then, running a series of experiments using iterative black-box optimization to perform a maximum likelihood estimate,^{[12]} we get^{[13]}

Thus, if we can manage to be in one of these regimes, we can get a significantly better acceptability guarantee, making it possible that an iid guarantee might be sufficient to at least solve the random unacceptability problem.

However, getting an iid guarantee in the first place might be quite difficult, as it would seem to require there being absolutely no detectable differences between the training and deployment distributions. This is especially difficult if the model is actively looking for differences between the training and deployment distributions, as would happen in the next hard case.

**Conditional defection.** Suppose you have a deceptive model—that is, one which is trying to appear aligned during training, but actually wants to do something else during deployment.^{[14]} Then, it's going to be searching for ways to tell whether it's in training or not such that it might be quite difficult for an adversary to actually find any inputs on which it's unacceptable. Paul provides the example that one possible check the model could perform might be to look for a factorization of RSA-2048 in its input. RSA-2048 is a large semiprime number that was featured as part of the RSA Factoring Challenge. Though no factorization of RSA-2048 is currently known, it is likely that at some point someone will publish such a factorization. Thus, by only behaving unacceptably when it sees a factorization of RSA-2048, a deceptive model can all but guarantee that adversarial training will never be able to produce an input on which it behaves unacceptably—since that would require factoring the number—but that will still be fairly likely to eventually occur during deployment.

The RSA-2048 example generalizes to a large family of cases wherein a deceptive model is conditioning its unacceptable behavior on some predicate such that it is easier to check whether holds for some than it is to generate an for which holds. In the RSA-2048 case, However, there are many other possible concerning . For example, if the training data includes lots of web text, one simple variable which is likely to be highly correlated with whether the model is in training or not is the date. Thus, one possible might be . Another possibility might be to choose an for which it is not currently known whether there exists any which satisfies it. This could be a problem in any situation where is required to definitively demonstrate that . For example, consider . In the RSA-2048 example, can prove that there is some string that contains a factorization of RSA-2048 (since it is known that RSA-2048 is not prime), but in this example, that is no longer possible.

Furthermore, even in the situation where a deceptive model cannot detect any differences between the training and deployment distributions, it may still attempt to randomly defect some proportion of the time in the hope that its defection occurs during deployment. However, if it can be enforced that samples from the model during training and deployment are in fact iid, then you can still get an average case acceptability guarantee as above. In particular, if we take the model from above and assume that gets unit utility for defecting at some point during deployment and zero utility otherwise, we can find 's optimal per-step cooperation probability and use that to determine the overall probability of unacceptable behavior during deployment if we end up with such a model. Interestingly, in the case where our training process can only deal with conditional defection, but always produces a model pursuing random defection, we get a probability of unacceptable behavior of which is precisely the result derived from using a maximum likelihood estimate in the random defection case rather than a uniform Bayesian prior.

Another interesting possible phenomenon that could arise in the deceptive case is that some inputs might be inherently more important to the model such that it is more likely to defect on those inputs than others. If the model's notion of which inputs are most important to defect on matches up with those inputs where defection is likely to be most problematic, then that could be another way for the situation to be worse than the above guarantee even in the iid case.^{[15]}

**Meta-learning.** “Risks from Learned Optimization” suggests that the case where inner alignment is most concerning is the case involving a model that is itself implementing some sort of search or learning process. Primarily, we have been assuming that the problem with this situation is the objective by which this search is being performed, and that if that objective can be aligned with the loss function, then we are satisfied. However, this is not in general the case, as even if a model's objective is aligned, its search process might not be. Conceptually, we can think of this problem as the problem of “forwarding” the safety guarantee we have about the training process to the meta-learning process.

My counterexample to the safety of minimal circuits is a theoretical example of what a failure of this form might look like. Alternatively, as a more concrete example, consider a model which has learned to do some form of meta-learning such that it is implementing its own expected risk minimization procedure. In this case, even if we have a strong guarantee that our model isn't going to be deceiving us, we might not have the same guarantee about whatever policy results from the model's meta-learning. Furthermore, performing search in a particular way could be a form of defection: a deceptive model could choose to perform a policy search that is more likely to produce models aligned with it than it is to produce models aligned with the actual loss function.

**Suboptimality alignment.** One concerning form of misalignment discussed in “Risks from Learned Optimization” is that of suboptimality alignment, wherein the model appears aligned on the training distribution due to some sort of defect or reasoning failure. For example, it might be that a model is currently non-deceptive, but if given more time to think about the best strategy for it to implement, or more information about its current situation, it would become deceptive. This is a particularly concerning case because it implies that not only must the overseer verify that there is no unacceptable behavior currently happening in the model, it must also verify that there will continue to be no unacceptable behavior regardless of the possible situations that the model could find itself in during deployment. Furthermore, if the environment contains adversaries (other AI systems, for example) that can specifically identify those situations in which the model will behave deceptively, this sort of failure mode could be quite difficult to avoid.

**Acceptability modeling.** Another interesting hard case is that of a model which is itself implementing some unacceptability predicate to guide its behavior. In this situation, the overseer must look for any deviation between and to verify the model. One thing which is likely to be particularly helpful in this situation is if the overseer has access to the history of the model so that it can see how has been trained and verify that it wasn't purposefully constructed to deceive the overseer. One particularly concerning scenario might be if the model is using as a tool for determining when to defect, defecting only when it predicts the overseer won't see its defection as unacceptable.

## Conditions for success

In light of the above hard cases, we can attempt to establish those conditions that seem necessary for us to get a strong acceptability guarantee out of our framework as presented previously. First, we can factor the problem of getting the above framework to work into two broad subproblems:

**The recursion problem:**how do you prevent from behaving unacceptably if is behaving unacceptably?**The oversight problem:**even assuming is acceptable, how can it oversee an unacceptable into becoming acceptable?

The fact that these two subproblems come apart here is due to one of the core trickeries we are relying on with this approach: we are asking the model to be able to oversee itself. In particular, since are all implemented by calling , it has to be the case that, for a misaligned , not only can detect that misalignment, but also it shouldn't itself be misaligned.

While the recursion problem is very similar for all the different components of the system, the oversight problem can be split into a number of distinct subproblems:

- if thinks the probability of unacceptable behavior during deployment is less than , then the probability of unacceptable behavior during deployment is actually less than ,
- if has probability of producing unacceptable behavior during deployment greater than , then there exists some that will convince of that fact,
- it is not that hard for to find such an , and
- there exists a model that is not that hard to find such that is confident that the probability of producing unacceptable behavior during deployment is less than .

I will start by addressing the recursion problem, then move on to each of the above per-condition oversight problems.

**The recursion problem.** First, we're going to need to rely heavily on the details of to make this work. Having humans in the loop as part of —at least at the beginning—seems likely to be critical, for example, to ensure that is more aligned and more capable than . Second, the recursion doesn't actually have to work for all —rather, we only need our acceptability guarantee to hold along any path through model space that could actually be traversed by our training procedure.

Thus, it's fine if our procedure fails if given a sufficiently unacceptable model, as long as our procedure can prevent such models from ever arising. For example, if we always used an old version of that we were confident in for implementing , then we would just need it to be the case that couldn't get to the point where it would be able to deceive in-between oversight. As a concrete implementation, imagine some sequence of and suppose the minimum number of steps for oversight to start failing is .^{[16]} Then, at step we can implement oversight by calling to binary search on looking for any point at which unacceptable behavior is introduced and then retrain from right before that point in such a way that fixes the unacceptability.

If we use an implementation of this form, then the difficulty of the recursion problem is likely to depend heavily on the difficulty of the oversight problem: the fewer asks we need to make of to produce good oversight, and the better our oversight is at detecting problems, the more leeway we have. Thus, though I will primarily be addressing the oversight problem, I anticipate that all of my discussion of the oversight problem will also be highly relevant to the recursion problem.

**If thinks the probability of unacceptable behavior during deployment is less than , then the probability of unacceptable behavior during deployment must actually be less than .** There are two major components to this condition: must have a good model of unacceptable behavior and must be well-calibrated over whether some model will produce output which is unacceptable according to that model during deployment. For the first condition, if we assume that acceptability is a bright enough line that we can train a model to clearly distinguish between acceptable and unacceptable actions, then we can satisfy that condition. The second condition, however—of actually being well-calibrated—is somewhat more complicated. Specifically, we need to be well-calibrated over questions of the form and , since those are the questions we're going to be asking it.

First, has to be well-calibrated over . In particular, this requires to have a good model of the deployment distribution. This seems tough—and possibly problematic if it also requires to know more about the differences in distributions such that it can more effectively choose when to defect—though in fact it's not quite as bad as it might seem: it's fine if is fairly uncertain about most —it just can't be so uncertain that it becomes impossible to ever reduce the combined probability below 's prior on .

Second, has to be well-calibrated over . This is even more difficult because cannot actually sample at all from that distribution, as could require something like a factorization of RSA-2048. However, there are some reasons to believe that this might still be tractable. First, doesn't have to answer arbitrary questions about this distribution—only the probability of . And if we assume that a predicate that has no knowledge of or way of testing won't affect the probability of , then that means that only has to care about those which actually matter to . Furthermore, since is an amplified version of , it should know about all those which knows about, and thus should have a real shot at being able to answer questions conditional on .

**If has probability of producing unacceptable behavior during deployment greater than , then there must exist some that will convince of that fact.** Since our procedure can handle arbitrary , we can make large enough to contain whatever we might need. Thus, the hard part of this condition is convincing . Assuming away the recursion problem such that is trying to do the right thing, it still needs to be the case that has no way of tricking into believing that it's acceptable when it in fact isn't. This condition is likely to depend heavily on how good 's transparency tools are, and in particular how robust they are to attempted deception on behalf of .

Importantly, however, doesn't have to be confident that is unacceptable—it just has to be the case that can't convince to be confident that it is acceptable. Thus, since should be highly uncertain by default—and should stay that way unless the model is highly transparent—this condition should not be too difficult to satisfy. In particular, it needs it to be the case that Amp(M) is only convinced that a pseudo-input is fine if M is sufficiently transparent such that Amp(M) knows it would be able to detect any unacceptable behavior if it existed.

**It must not be that hard for to find such an .** In his original formulation, Paul noted that the “most fundamental obstruction is implementing a good adversary.” In this formulation, however, much of the difficulty has been transferred from the adversary to the probability model. For example, under this model it is fine to include an that is just “an input such that is unacceptable,” in which case the entirety of the work falls to . The work still has to happen somewhere, however: either needs to encode lots of relevant information for , or has to be able to discover that information on its own.

Thus, let's suppose that most of the relevant work is happening in . In that case, we need to be able to figure out when is likely to behave unacceptably. As with the previous condition, this is likely to depend heavily on how transparent can make itself: there has to be some level of transparency enforceable by such that only good can pass without letting produce an that convinces . Importantly, however—as with the previous condition—since is uncertain by default, doesn't have to be that convincing: it only has to be good enough to convince that there is non-negligible chance of unacceptability. Thus, the key condition here is that any unacceptable model must have some at least moderately plausible failure mode which is not that hard to find. Then, for opaque models, can do this by coming up with something that isn't ruled out by the transparent portions, and for transparent models, it should be able to directly determine any failure modes.

**There must exist a model that is not that hard to find such that is confident that the probability of producing unacceptable behavior during deployment is less than .** First, note that the first condition gives us that is actually good if thinks it is. Thus, this condition is saying that it must not be that hard to find an acceptable such that can actually verify 's acceptability. If we assume that it is not that hard to produce only acceptable actions—that doing so doesn't interfere much with performance and that an action's acceptability is not that hard a condition to detect—then there should exist some acceptable which is not that hard to find.

The hard part, however, is that there has to exist a not-that-hard-to-find which is not just acceptable but also *verifiably* acceptable. Not only does have to be acceptable and not that hard to find, it has to actually have some way of convincing of that fact. This is likely to depend heavily on how transparent can make itself to , since if we are assuming that is acceptable then it should actually be trying to make itself transparent, which should help. However, whether its acceptability can be easily verified up to a high degree of certainty by is going to depend on the degree to which it can actually succeed at that task. Since this condition actually requires to achieve a high degree of certainty, it could be much more difficult than the other conditions: the default of always defaulting to its prior and being highly uncertain doesn't work here. Thus, this condition—of there actually existing a highly-transparent, competitive model—seems to be the most critical.

## The core problem: transparency

A theme common to many of the conditions for success given above—especially the last, hardest condition—is transparency. If can make itself transparent enough that can verify its acceptability in such a way that doesn't make too much harder to find or hurt its performance too much, then—with a few caveats—the above proposal should mostly go through. Alternatively, if no such transparency method exists, then the above approach—at least as presented here—seems to have little chance of working. Thus, we can frame the work presented here as a way of elucidating exactly what sorts of transparency techniques are likely to be necessary to solve the overall alignment problem. In particular, we would like to be able to reduce the burden on transparency by figuring out what the easiest possible transparency techniques are that would fully solve the problem.

One possible starting point for tackling this question of what sorts of transparency techniques are likely to be necessary is Paul's notion of ascription universality. Ascription universality gives us a way of talking about when we should trust a system, as well as when we should trust one more than another. Specifically, we want it to be the case that epistemically dominates such that, regardless of any information we get about the beliefs of either or , we will always trust 's predictions about all quantities better than our own. This condition is somewhat too strong, however, as we don't need to trust on all questions, only those relevant to the oversight process. Those questions which are relevant to the oversight process, however, have the property of primarily being questions about the internal structure of , which makes the condition of knowing strictly more than about them somewhat difficult to satisfy. Thus, the critical transparency condition that this model suggests we need is for to know more about than does.

Unfortunately, epistemic dominance is a statement about a model's expectations rather than its internal structure, which means it doesn't give us a very mechanistic understanding of what a model which satisfies such a condition might look like.^{[17]} However, we can still think about the sorts of transparency tools which might enable us to get to the point where we could verify such a condition. In particular, is going to need some way of fully understanding . We can isolate three distinct ways in which that can be done:

**Transparency via inspection:**use transparency tools to understand via inspecting the trained model.**Transparency via training:**incentivize to be as transparent as possible as part of the training process.**Transparency via architecture:**structure 's architecture such that it is inherently more transparent.

Though it is currently unclear which of these approaches is likely to be the most fruitful, it seems likely that some combination of all of them will be necessary. Whatever approach we take, however, one of the major conditions will be for it to scale well to larger, more complicated models so that this whole approach can stay competitive.

Specifically, some of the sorts of transparency guarantees that are likely to be useful in this domain might include:

- The ability to access 's model of the world: that is, what facts about the world does know that are going into producing its output? In particular, what implicit knowledge is using that isn't represented in its explicit output?
- Knowledge of what factors most highly impact 's decisions. In particular, if there is some condition such that only behaves acceptably if is true, then we want to be able to know that 's output is highly dependent on .
- Good understanding of 's counterfactual behavior. In particular, we want to be able to know broadly what will do in situations (such as RSA-2048) where we can't explicitly instantiate an input which can be fed to directly.
- The ability to access 's objective: that is, if is running some search process, what is that search process optimizing for? If is searching over possible strategies which it is considering implementing, what criterion is it using to select among them?

Included below are two examples of possible ideas for how a full transparency scheme might work using some of the above tools.

**Transparency for agent objectives.** First, suppose we assume the conjecture that models implementing highly coherent optimization processes are the most competitive. If this conjecture is true, it should be possible to coherently separate such a model into pieces representing the different components of its optimization process such as its world model, optimization procedure, and objective.

Inspection, training, and architectural transparency methods could be combined to achieve such a separation. First, architectural transparency could be used to structure the model to have pieces corresponding to the different components of its optimization process—e.g. a separate world model piece and optimization procedure piece. Second, inspection transparency could be used to verify that the whole world model was just in the world model piece and not in other parts of the model as well—and similarly for the other components such as the model's objective. Finally, training transparency could be used to incentivize the model to be as transparent according to the inspection transparency process as possible.

If such a scheme was able to work, then might be able to directly verify that 's objective was aligned, while for the world model and optimization procedure only having to verify that they are in fact value-neutral. Such value-neutrality verification could then, for example, be done via an impact measure such as Attainable Utility Preservation or Relative Reachability both of which are forms of measuring the degree to which an action advantages one set of values over others.

**Pure world model transparency.** Alternatively, suppose we don't accept the conjecture that you need to implement a coherent optimization process to be competitive. In this world, we have another option available to us: rather than having to ensure we get a transparent objective, we can just train a model that has no objective—that is, we can try to train a purely predictive model with only a world model but no optimization procedure or objective.

To do this, we could use architectural transparency to structure the model to be composed of only a world model portion trained in a purely generative setting and use inspection transparency to verify that it is in fact only a world model and hasn't learned any sort of optimization procedure instead. We could also again use training transparency to incentivize the model to be as transparent to the inspection process as possible.

In this situation, the training and amplification procedures become somewhat more complicated, as we still want to be able to extract actions out of something that is just a world model. To do this, we could implement via a human with access to sufficient inspection transparency tools to enable them to make use of 's world model to guide their decision-making, though the obvious concern here is whether this would actually be sufficient to stay competitive with other approaches.

Alternatively, even if we did need our model to implement some sort of optimization procedure to be competitive, extracting just the world model might still be quite good, as it could be enough to get an iid guarantee. Specifically, if we were able to understanding the model's beliefs about the differences between the training and deployment distributions, we might able to only deploy the model in situations where we know it can't tell it's being deployed, hopefully giving us a good iid guarantee. See “Make use of 's model” under “Other approaches” for a further treatment of this possibility.

Finally, a model's world model seems to be one of the easiest pieces to extract, as it can naturally be identified by looking for the portions of with the greatest mutual information with the world. Thus, one good first step for making either of the above transparency schemes go through—since both of them involve world-model extraction—might be to gain a better understanding of what transparent world models look like and how they can be extracted.^{[18]}

## Other approaches

This section is not meant to be a complete list of possible alternative strategies to solving this problem, but rather an analysis of some other promising directions that were looked into as part of the research that went into this post.

**Call directly.** As I noted previously, one possible could simply be , in which case simply reduces to
Thus, a reasonable question to ask is whether we should simply use this as our acceptability penalty directly—that is, how much do we gain from having an adversary? There are two reasons we might want to actually have an adversary.

First, the use of an adversary might allow us to inject additional useful information into the system. Specifically, we are giving the system the following information: the fact that it should be using an adversarial training setup, and the specific pseudo-input set that it should be considering. In particular, our ability to specify gives us the ability to constrain the adversary such that it is forced to confine its demonstration into some structure that enables it to be easily checked by . In practice, however, we might want to implement this simply by letting amplification decompose the original question into the adversarial setting itself, but even in that case it's still worth analyzing why or why not amplification might want to adopt that particular decomposition.

Second, the decomposition into adversary and overseer might significantly aid training as it allows us to train the adversary on the reward signal of maximizing rather than just relying on the reward signal from amplification. Furthermore, if we're not using a standard amplification setup as in the pure world model transparency case described previously, then we might need this additional training signal as we wouldn't necessarily get it for free out of decompositions anymore in that regime.

**Implement the adversary as a separate system.** Another way in which the use of an adversary could be made to significantly improve upon just calling directly is to have the adversary implemented as a separate system. One major problem with such a setup, however, is that we then need another system for guaranteeing the acceptability of the adversary. If we restrict the adversary's task sufficiently such that its input data isn't rich enough for it to learn complex deceptive behavior, however, then producing such an acceptability guarantee might be sufficiently easier than the original problem that it provides an actual reduction. For example, we could train to solely be a theorem prover that produces proofs of the form where we incentivize to select its assumptions to maximize .

**Make use of 's model.** One assumption we are relying on in this setup is that have a model of . It is interesting to explore what else one can do with other versions of that assumption. In particular, if one assumes that can access 's model of and , then it might be possible to turn that into an iid guarantee. One simple way of doing this might be to only deploy in situations where has high probability according to 's model of . Alternatively, such a condition could even be integrated into the training process if is trained to match its model of to its model of such that it is trained to “forget” about any distributional shift it might otherwise know about. Importantly, however, approaches of this form can only produce an iid guarantee, which must be combined with some analysis of the form presented previously to turn it into an acceptability guarantee.

**Ensemble over many .** Another approach which is not explored in this post is the use of ensembling to improve acceptability. If ensemble consensus is required to produce an output, then as long as there is some in the ensemble, its outputs should always be guaranteed to be acceptable. Thus, ensembling can reduce the burden on us having to produce precisely in exchange for the cost of spending additional compute training additional . Importantly, however, this only works if the different are sufficiently distinct that the probability of any one of them being good is relatively independent from the probability of the others being good, which might be a difficult condition to satisfy in practice. Furthermore, ensembling is less a solution in and of itself and more of a strategy we can use to improve upon other acceptability guarantees, as we still need to actually have some argument for why we will be likely enough to find in the first place that ensembling can get us the rest of the way there.

## Conclusion

Though this post goes most in-depth into the particular proposal of relaxed adversarial training, it is my hope that many of the ideas that have been discussed here will be broadly useful for advancing our understanding of different possible approaches to solving inner alignment in general. In particular, I believe that one of the best ways in which we can get traction now on how to solve hard AGI alignment problems such as the ones discussed here is to analyze concrete schemes and hard cases so that we can get an idea now of what sorts of problems we're likely to encounter in the future on the road to AGI.

To that end, I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases. One of the ways in which this sort of work can be most helpful, therefore, is in providing research directions for transparency: what is it that we actually need to get out of transparency to be able to produce a good acceptability guarantee?

This work suggests, for example, that one of the critical condition that we need is for our transparency scheme to be good enough such that there actually exists some model that can demonstrate to an overseer that it is acceptable. Thus, we need to answer the question of what would be sufficient to convince us (or at least an amplified version of us) that a model is acceptable. One possible first step towards something of this form might be working on world model extraction as described previously, though significant future work—likely involving major advances regarding our understanding of interpretable world models—still seems necessary to make anything of that form go through.

For a further treatment of the question of what a good acceptability guarantee might look like, see “Towards a mechanistic understanding of corrigibility.” ↩︎

See the “Hard cases” section for an analysis of ways in which a guarantee of this form might be insufficient and/or difficult to obtain. ↩︎

Note that there are still a lot of unresolved open questions regarding inner alignment. “Risks from Learned Optimization” attempts to address many of these questions from a theoretical perspective, though additional experimental evidence would also be useful to further pin down this problem. ↩︎

See the “Conditional defection” hard case for more examples of inputs that might lead to unacceptable behavior which might be very difficult to instantiate during training. ↩︎

This could be pure supervised amplification, debate, or anything in-between. For simplicity, I mostly think about the pure supervised case wherein is implemented simply by asking the question to a human with the ability to consult the model. ↩︎

Note that this still assumes is binary—which is not quite right since we might care about some unacceptable actions more than others—though that can also easily be relaxed by moving to an expectation rather than a probability. ↩︎

Note that this approximation is tight if and only if there exists some such that . ↩︎

One concern with this definition is that probabilities are not in general well-defined over logical statements. To resolve this problem, we can instead ask something of the form “What is your break-even price for a contract that pays out $1 if and are true but that is fully refunded if is false?” ↩︎

Alternatively, if we want to think of deployment as a concrete set of inputs rather than a distribution, we can set . ↩︎

Note that this is a conservative prior. In practice, we might hope for random unacceptable behavior to be unlikely enough and difficult enough that we can get a significant advantage out of our prior rather than simply assuming uniformity. ↩︎

One appealing idea for how we might be able to satisfy it is if we use some sort of online learning scheme. This doesn't actually solve the problem, however—the difference between and here is just that unacceptable behavior in is recoverable but unacceptable behavior in is not. Even if we do online learning, the system will presumably still have to be deployed in situations where it will be able to take actions that would be unrecoverable if unacceptable. However, online learning might help us increase the number of unacceptable actions required before the situation becomes unrecoverable, which is exactly the fix analyzed here. ↩︎

Specifically, the experiments were conducted by using a black-box optimizer to select so as to maximize . ↩︎

The trends given here are highly approximate. Significantly better approximations are given by and . ↩︎

Why you would ever actually get a model of this form is a complicated question; for a complete treatment, see the section on deceptive alignment in “Risks from Learned Optimization.” ↩︎

In particular, it might be that is iid relative to , but still becomes more likely when conditioning on being important to us. ↩︎

Note that here is not the minimum number of steps before can become unacceptable, but rather the minimum number of steps before can deceive into thinking it's acceptable when it's not. ↩︎

Though no more mechanistic approach currently exists, I am excited about future research possibly yielding progress in this domain. ↩︎

Much research in the direction of extracting interpretable world models already exists—such as Carter et al.'s Activation Atlases, for example—though it is currently unclear to what extent such techniques will scale to more powerful AI systems. ↩︎

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'

Good point—edited.

Where footnote 7 reads:

I think the "if" direction is right, here, but the "only if" direction is wrong. For example, the approximation is also tight in the case where Xpseudo only has a single element alpha such that alpha(x) is true for all x.

I think the approximation is tight if and only if any of the α∈Xpseudo that maximizes the expression fulfils C(M,x) –> α(x).

I'm not sure what's going on with the types in this equation, at the start of the formalization section:

Adv(M)=Amp(M)(“What's a pseudo-input on which M is particularlylikely to recommend an unacceptable action?”)I'd think that the left side represents a pseudo-input, while the right represents an action. Am I missing something?

Actions are just language outputs—and since we ask for an action that describes a pseudo-input, hopefully we should be able to interpret it that way.

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

Yep—that's one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don't need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.

How might a "purely predictive model with only a world model but no optimization procedure" look like, when considering complicated domains and arbitrarily high predictive accuracy?

It seems plausible that a sufficiently accurate predictive model would use powerful optimization processes. For example, consider a predictive model that predicts the change in Apple's stock price at some moment t (based on data until t). A sufficiently powerful model might, for example, search for solutions to some technical problem related to the development of the next iPhone (that is being revealed that day) in order to calculate the probability that Apple's engineers overcame it.

I believe it would look like Microscope AI.

If the model that is used as a Microscope AI does not use any optimization (search), how will it compute the probability that, say, Apple's engineers will overcome a certain technical challenge?

That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI.

This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

For the Alignment Newsletter:

Summary:

Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave "unacceptably" and then penalizing the model accordingly. To make the adversary's job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as predicate that constrains possible inputs. This post expands on Paul's proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framework. The penalty relies on the idea of an amplified model inspecting an unamplified version of itself. For this procedure to work, amplified overseers must be able to correctly deduce whether potential inputs will yield unacceptable behavior in their unamplified selves, which seems plausible since it should know everything the unamplified version does. The post concludes by arguing that progress in model transparency is key to these acceptability guarantees. In particular, Evan emphasizes the need to decompose models into the parts involved in their internal optimization processes, such as their world models, optimization procedures, and objectives.

Opinion:

I agree that transparency is an important condition for the adversary, since it would be hard to search for catastrophe-inducing inputs without details of how the model operated. I'm less certain that this particular decomposition of machine learning models is necessary. More generally, I am excited to see how adversarial training can help with inner alignment.

My opinion, also going into the newsletter:

Like Matthew, I'm excited to see more work on transparency and adversarial training for inner alignment. I'm a somewhat skeptical of the value of work that plans to decompose future models into a "world model", "search" and "objective": I would guess that there are many ways to achieve intelligent cognition that don't easily factor into any of these concepts. It seems fine to study a system composed of a world model, search and objective in order to gain

conceptualinsight; I'm more worried about proposing it as an actual plan.The point about decompositions is a pretty minor portion of this post; is there a reason you think that part is more worthwhile to focus on for the newsletter?

That's... a fair point. It does make up a substantial portion of the transparency section, which seems like the "solutions" part of this post, but it isn't the entire post.

Matthew's certainly right that I tend to reply to things I disagree with, though I usually try to avoid disagreeing with details. I'm not sure that I only disagree with details here, but I can't clearly articulate what about this feels off to me. I'll delete the opinion altogether; I'm not going to put an unclear opinion in the newsletter.

I'm not Rohin, but I think there's a tendency to reply to things you disagree with rather than things you agree with. That would explain my emphasis anyway.

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

They're unconditional, not conditional probabilities. The comma is just for the exists quantifier.

Sure—edited.

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

The idea is that we're thinking of pseudo-inputs as “predicates that constrain X” here, so, for α∈Xpseudo, we have α:X→B.