Towards a mechanistic understanding of corrigibility

evhub

Acceptability

To be able to use something like relaxed adversarial training to verify a model, a necessary condition is having a good notion of acceptability. Paul Christiano describes the following two desiderata for any notion of acceptability:

"As long as the model always behaves acceptably, and achieves a high reward on average, we can be happy."
"Requiring a model to always behave acceptably wouldn't make a hard problem too much harder."

While these are good conditions that any notion of acceptability must satisfy, there may be many different possible acceptability predicates that meet both of these conditions—how do we distinguish between them? Two additional major conditions that I use for evaluating different acceptability criteria are as follows:

It must be not that hard for an amplified overseer to verify that a model is acceptable.
It must be not that hard to find such an acceptable model during training.

These conditions are different than Paul's second condition in that they are statements about the ease of training an acceptable model rather than the ease of choosing an acceptable action. If you want to be able to do some form of informed oversight to produce an acceptable model, however, these are some of the most important conditions to pay attention to. Thus, I generally think about choosing an acceptability condition as trying to answer the question: what is the easiest-to-train-and-verify property such that all models that satisfy that property^[1] (and achieve high average reward) are safe?

Act-Based Corrigibility

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified. Not only is such an agent corrigible, Paul argues, but it will also want to make itself more corrigible, since having it be more corrigible is a component of our short-term preferences (Paul calls this the "broad basin" of corrigibility). While such act-based corrigibility would definitely be a nice property to have, it's unclear how exactly an amplified overseer could go about verifying such a property. In particular, if we want to verify such a property, we need a mechanistic understanding of act-based corrigibility rather than a behavioral one, since behavioral properties can only be verified by testing every input, whereas mechanistic properties can be verified just by inspecting the model.

One possible mechanistic understanding of corrigibility is corrigible alignment as described in "Risks from Learned Optimization," which is defined as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." While this gives us a starting point for understanding what a corrigible model might actually look like, there are still a bunch of missing pieces that have to be filled in. Furthermore, this notion of corrigibility looks more like instrumental corrigibility rather than act-based corrigibility, which as Paul notes is significantly less likely to be robust. Mechanistically, we can think of this lack of robustness as coming from the fact that "pointing" to the base objective is a pretty unstable operation: if you point even a little bit incorrectly, you'll end up with some sort of corrigible pseudo-alignment rather than corrigible robust alignment.

We can make this model more act-based, and at least somewhat mitigate this robustness problem, however, if we imagine pointing to only the human's short-term preferences. The hope for this sort of a setup is that, as long as the initial pointer is "good enough," there will be pressure for the mesa-optimizer to make its pointer better in the way in which its current understanding of short-term human preferences recommends, which is exactly Paul's "broad basin" of corrigibility argument. This requires it to be not that hard, however, to find a model with a notion of the human's short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

In particular, it needs to be the case that it is not that hard to find an agent which will correct mistakes in its own prior over what the human's short-term preferences are. From a naive Bayesian perspective, this seems unlikely, as it seems strange for an agent to be incentivized to change its own prior. However, this is actually a very natural state for an agent to be in: if I trust your beliefs about X more than I trust my own, then that means I would endorse a modification of my prior to match yours. In the context of act-based corrigibility, we can think about this from a mechanistic perspective as having a pre-prior that encodes a belief that the human prior over human short-term preferences is to be preferred. Furthermore, pre-priors are generally epistemically valuable for agents to have, as a pre-prior can encourage an agent to correct its own cognitive biases. Thus, agents with pre-priors should be incentivized by most training processes, and thus shouldn't be too difficult to find.

Indifference Corrigibility

Instrumental and act-based corrigibility are not the only forms of corrigibility that have been discussed in the literature, however: there's also indifference corrigibility, wherein the agent is indifferent to modifications the human might make to it such as shutting it off. While this form of corrigibility doesn't in and of itself guarantee acceptability, if you had a way of verifying indifference corrigibility, there might be ways to turn that back into act-based corrigibility.

First, however, if we want to be able to verify indifference corrigibility, we are going to need a mechanistic understanding of it. Let's suppose we accept the argument that a competitive model will likely be a mesa-optimizer such that it will be running some sort of optimization process coupled with some sort of mesa-objective. Furthermore, let's suppose that its optimization process is computed over some sort of world model, which I think is a reasonable assumption for any competitive mesa-optimizer that wants to be able to take actions in the world. In such a case, we can think of indifference corrigibility as a form of stop gradient in the mesa-optimizer's world model which its optimization process doesn't optimize through,^[2] which is a highly mechanistic property that we might expect an amplified overseer to be able to verify.

Again, however, we need agents with stop gradients to form a natural enough class such that they are not that hard to find. While it might naively seem as if agents with stop gradients would be very rare, in fact the exact opposite might be the case. Consider an agent such as the mesa-optimizer described previously which is running an optimization process over some world model, but which has preferences over the state of the world rather than the state of its world model. Such an agent will have an incentive to keep its world model aligned with the world, but only in one direction: it will want to modify its world model to match the world, but not modify the world to match its world model. This is a type of stop gradient, as it is a restriction on the agent to only minimize the difference between its world model and the world through one pathway but not through another. This example seems to suggest that stop gradients aren't that uncommon among agents such that it might be fairly achievable to find such an agent.

Even if we can train and verify indifference corrigibility, however, one of the big remaining problems with such an approach is that an indifference corrigible mesa-optimizer might do search in such a way that it finds a policy which is not itself indifference corrigible. We can think about this problem as arising from the fact that indifference corrigibility doesn't tile, though I more often think of this problem as the problem of "forwarding the guarantee," wherein even if we know how to do safe search on the outside, we have to make sure that any search inside of our model is similarly safe as well. This is a particularly interesting type of inner alignment problem in that, rather than being about how to ensure that a mesa-optimizer's objective is aligned, it is about how to ensure that a mesa-optimizer's search is safe even given that its objective is aligned. However, it seems plausible that this sort of problem could be resolved by ensuring that the model has a meta-preference towards any policies it produces also respecting the same stop gradient. In particular, the overseer could verify that any search over policies done by the model enforce the constraint that every policy have such a stop gradient.

Even once we can verify that our model is indifference corrigible and that it will forward that guarantee to any other search it might perform, however, there is still the question of how we might be able to use such a mechanism to produce a safe system. One way in which indifference corrigibility could be used to produce safety is to enforce that our model behave myopically. We can think of a myopic agent as one that only considers how best to answer the single question that you give to it rather than considering any sort of long-term consequences, which can be implemented as a stop gradient preventing any optimization outside of that domain. While myopia on its own is still insufficient to guarantee acceptability, it seems like it would at least prevent deceptive alignment, as one of the conditions for deceptive alignment is that the mesa-optimizer must have something to gain from cooperating now and then defecting later, which is not true for a myopic agent. Thus, if directed at a task which we are confident is outer aligned, such as pure supervised amplification (training a model to approximate a human consulting that model), and combined with a scheme for preventing standard pseudo-alignment (such as adversarial training), myopia verification might be sufficient to resolve the rest of the inner alignment problem by preventing deceptive alignment.

Conclusion

If we want to be able to do relaxed adversarial training to produce safe AI systems, we are going to need a notion of acceptability which is not that hard to train and verify. Corrigibility seems to be one of the most promising candidates for such an acceptability condition, but for that to work we need a mechanistic understanding of exactly what sort of corrigibility we're shooting for and how it will ensure safety. I think that both of the paths considered here—both act-based corrigibility and indifference corrigibility—look like promising research directions for attacking this problem.

Or at least all models that we can find that satisfy that property. ↩︎
Thanks to Scott Garrabrant for the stop gradient analogy. ↩︎

This requires it to be not that hard, however, to find a model with a notion of the human’s short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

I'm curious if you have a better understanding of "short-term preferences" than I do. I'm not sure what a definition of it could be, however from earlier writings of Paul I guess it's things like "gain resources" and "keep me in control". But it seems like a human might have really bad/wrong understandings of what constitute "resources" (e.g., I may not realize that something is a really valuable resource for achieving my long-term goals so it's not part of my short-term preferences to get more of it) and "control" (if I listen to some argument on the Internet or from my AI, and have my mind changed by it, maybe I won't be in control anymore but I don't realize this) so it's hard for me to see how having an AI optimize for my short-term preferences will lead to reaching my long-term goals, especially in a competitive environment with other AIs around.

So I would be interested to see:

a definition of "short-term preferences"
a (verifiable) mechanism by which an AI/model could learn just the short-term preferences of a human, or distinguish between short-term and long-term preferences after learning both
an explanation of why optimizing for humans' own understanding of short-term preferences is good enough for avoiding x-risk

Another way to explain my doubt on 3 is, if other AIs are optimizing for a superhuman understanding of what short-term preferences will lead to long-term goals, and my AI is only optimizing for a human understanding of that, how is my AI going to be competitive?

ETA: See also Strategic implications of AIs’ ability to coordinate at low cost which is another way that a corrigible AI may not be competitive. Would be interested in your thoughts on that topic as well.

Myopia feels like it has the wrong shape. As I understand it, deceptive alignment stems from the instrumental convergence of defecting later: the model is incentivized to accrue power. We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

I think this requires less of a capabilities hit than myopia does. This predicate might be precisely the right shape to cut off the mesa optimizer's instrumental incentive for power and deceptive alignment. At the least, it feels like a much better fit. However, this might not tile by default? Not sure.

I think that we have different pictures of what outer alignment scheme we're considering. In the context of something like value learning, myopia would be a big capabilities hit, and what you're suggesting might be better. In the context of amplification, however, myopia actually helps capabilities. For example, consider a pure supervised amplification model—i.e. I train the model to approximate a human consulting the model. In that case, a non-myopic model will try to produce outputs which make the human easier to predict in the future, which might not look very competent (e.g. output a blank string so the model only has to predict the human rather than predicting itself as well). On the other hand, if the model is properly myopic such that it is actually just trying to match the human as closely as possible, then you actually get an approximation of HCH, which is likely to be a lot more capable. That being said, unless you have a myopia guarantee like the one above, a competitive model might be deceptively myopic rather than actually myopic.

I like this reply and I think there's something subtle going on with the meaning of "myopic" here and I'm going to try to think about it more.

I intuitively agree that myopia seems to have the wrong shape, although I seem to be thinking about it differently. I think myopia seems risky in that a system with a time horizon of one second would happily consume all the world's resources in a half second, if doing so helped it achieve its goal in the next half second. You have to bank on making the time window so short that nothing like that is possible. But how short is short enough?

This feeds into my general impression that we should in most cases be thinking about getting the system to really do what we want, rather than warping its utility function to try and de-motivate it from making trouble.

This feeds into my general impression that we should in most cases be thinking about getting the system to really do what we want, rather than warping its utility function to try and de-motivate it from making trouble.

A decomposition that's been on my mind lately: we can center our framing on the alignment and motivation of the system's actual goal (what you're leaning towards), and we can also center our framing on why misspecifications are magnified into catastrophically bad behavior, as opposed to just bad behavior.

We can look at attempts to e.g. find one simple easy wish that gets what we want ("AI alignment researchers hate him! Find out how he aligns superintelligence with one simple wish!"), but by combining concepts like superexponential concept space/fragility of value and Goodhart's law, we can see why there shouldn't be a low complexity object-level solution. So, we know not to look.

My understanding of the update being done on your general impression here is: "there are lots of past attempts to apply simple fixes to avoid disastrous / power-seeking behavior, and those all break, and also complexity of value. In combination with those factors, there shouldn't be a simple way to avoid catastrophes because nearest-unblocked-solution."

But I suggest there might be something missing from that argument, because there isn't yet common gears-level understanding of why catastrophes happen by default, so how do we know that we can't prevent catastrophes from being incentivized? Like, it seems imaginable that we could understand the gears so well that we can avoid problems; after all, the gears underlying catastrophic incentives are not the same as the gears underlying specification difficulty.

It may in fact just be the case that yes, preventing catastrophic incentives does not admit a simple and obviously-correct solution! A strong judgment seems premature; it isn't obvious to me whether this is true. I do think that we should be thinking about why these incentives exist, regardless of whether there is a simple object-level solution.

I think your characterization of my position is a little off. I'm specifically pointing heuristically against a certain kind of utility function patching, whereas you seem to be emphasizing complexity in your version.

I think my claim is something like "hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle."

I agree that a really good understanding could provide a solution.

However, I also suspect that any effective solution (and many ineffective solutions) which works by warping the utility function (adding penalties, etc) will by "interpretable" as an epistemic state (change the beliefs rather than the utility function). And I suspect the good solutions correspond to beliefs which accurately describe critical aspects of the problem! EG, there should just be a state of belief which a rational agent can be in which makes it behave corrigibly. I realize this claim has not been borne out by evidence thus far, however.

So IIUC, you're advocating trying to operate on beliefs rather than utility functions? But I don't understand why.

I think my claim is something like "hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle."

There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we're comfortable saying there's a solution because the AI obviously grading the future before it happens. No matter how smart you are, you're grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, "there's no way anyone could drive around this"!

I agree that exotic decision algorithms or preference transformations are probably not going to be useful for alignment, but I think this kind of activity is currently more fruitful for theory building than directly trying to get decision theory right. It's just that the usual framing is suspect: instead of exploration of the decision theory landscape by considering clearly broken/insane-acting/useless but not yet well-understood constructions, these things are pitched (and chosen) for their perceived use in alignment.

What do you mean "these things"?

Also, to clarify, when you say "not going to be useful for alignment", do you mean something like "...for alignment of arbitrarily capable systems"? i.e. do you think they could be useful for aligning systems that aren't too much smarter than humans?

We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.

I generally don't read links when there's no context provided, and think it's almost always worth it (from a cooperative perspective) to provide a bit of context.

Can you give me a TL;DR of why this is relevant or what your point is in posting this link?

The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to 'Reframing Impact':

OK, thanks.

The TL;DR seems to be: "We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans."

This seems good... can you confirm my understanding below is correct?

2) RE: "How much utility is available": I guess we can just set a targeted level of utility gain, and it won't matter if there are plans we'd consider reasonable that would exceed that level? (e.g. "I'd be happy if we can make 50% more paperclips at the same cost in the next year.")

1) RE: "A lower bound": this seems good because we don't need to know how extreme catastrophes could be, we can just say: "If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic".

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified.

Similar to my (now largely resolved) confusion about how Paul uses "corrigibility", I also have a confusion about how "corrigibility" is used here. In particular, is "act-based corrigibility" synonymous with "respects our short-term preferences" (and if so do you mean "preferences-on-reflection") or is it a different property (i.e., corrigibility_MIRI or a broader version of that) that you think "an agent respects our short-term preferences" is likely to have? It seems to me from context that you mean the former (synonymous), because earlier you wrote:

what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?

and "respects our short-term preferences" seems to be the "candidate property" that you're naming "act-based corrigibility" because it's a "mechanistic" property that might be easy to train and verify whereas corrigibility in the MIRI sense (or my current understanding of Paul's sense) does not seem mechanistic or easy to verify.

Can you please confirm whether my guess of your original intended meaning is correct? And if it is, please consider changing your wording here (or in the future) to be more consistent with Paul's clarification of how he uses "corrigibility"?

Part of the point that I was trying to make in this post is that I'm somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don't give enough of a basis for actually verifying them. So I can't really give you a definition of act-based corrigibility that I'd be happy with, as I don't think there currently exists such a definition.

That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.

I thought Evan's response was missing my point (that "act-based corrigibility" as used in OP doesn't seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by "act-based corrigibility" Evan meant both "a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work."

The three of us talked a bit about finding better terms for these concepts but didn't come up with any good candidates. My current position is that using "act-based corrigibility" this way is quite confusing and until we come up with better terms we should probably just stick with "achieving corrigibility using act-based agents" and "the kind of corrigibility that act-based agents may be able to achieve" depending on which concept one wants to refer to.

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility - as long as the pointer really is correct. I agree though that this solution doesn't seem stable to mistakes in the 'pointing', which is very concerning and makes me start to lean toward something more like act-based corrigibility being safer.

I'm still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space. I think maybe I'm stuck imagining complex/unnatural indifference, as in finding agents indifferent to whether a stop-button is pressed, and that my intuition might change if I spend more time thinking about examples like myopia or world-model <-> world interaction, where the indifference seems to have more 'natural' boundaries in some sense.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility - as long as the pointer really is correct.

The 'type of corrigibly' you are referring to there is corrigibly at all; rather, it's alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right.

I'm still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space.

I tend to agree. I'm hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.

I'm not sure it's the same thing as alignment... it seems there's at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:

"classic notion of alignment": The AI has the correct goal (represented internally, e.g. as a reward function)
"CIRL notion of alignment": AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner's mind)
"corrigibility": something else

One thing I found confusing about this post + Paul's post "Worst-case guarantees" (2nd link in the OP: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d) is that Paul says "This is the second guarantee from “Two guarantees,” and is basically corrigibility." But you say: "Corrigibility seems to be one of the most promising candidates for such an acceptability condition". So it seems like you guys might have somewhat different ideas about what corrigibility means.

Can you clarify what you think is the relationship?

I don't think there's really a disagreement there—I think what Paul's saying is that he views corrigibility as the right way to get an acceptability guarantee.

This requires it to be not that hard, however, to find a model with a notion of the human’s short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

So I would be interested to see:

a definition of "short-term preferences"
a (verifiable) mechanism by which an AI/model could learn just the short-term preferences of a human, or distinguish between short-term and long-term preferences after learning both
an explanation of why optimizing for humans' own understanding of short-term preferences is good enough for avoiding x-risk

I like this reply and I think there's something subtle going on with the meaning of "myopic" here and I'm going to try to think about it more.

This feeds into my general impression that we should in most cases be thinking about getting the system to really do what we want, rather than warping its utility function to try and de-motivate it from making trouble.

I agree that a really good understanding could provide a solution.

So IIUC, you're advocating trying to operate on beliefs rather than utility functions? But I don't understand why.

I think my claim is something like "hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle."

What do you mean "these things"?

We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.

I generally don't read links when there's no context provided, and think it's almost always worth it (from a cooperative perspective) to provide a bit of context.

Can you give me a TL;DR of why this is relevant or what your point is in posting this link?

OK, thanks.

The TL;DR seems to be: "We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans."

This seems good... can you confirm my understanding below is correct?

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified.

what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility - as long as the pointer really is correct.

I'm still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space.

I tend to agree. I'm hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.

I'm not sure it's the same thing as alignment... it seems there's at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:

"classic notion of alignment": The AI has the correct goal (represented internally, e.g. as a reward function)
"CIRL notion of alignment": AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner's mind)
"corrigibility": something else

Can you clarify what you think is the relationship?

I don't think there's really a disagreement there—I think what Paul's saying is that he views corrigibility as the right way to get an acceptability guarantee.

20

Towards a mechanistic understanding of corrigibility

20

Acceptability

Act-Based Corrigibility

Indifference Corrigibility

Conclusion