Impact measurement and value-neutrality verification

by Evan Hubinger 1mo15th Oct 20196 min read10 comments

16


Recently, I've been reading and enjoying Alex Turner's Reframing Impact sequence, but I realized that I have some rather idiosyncratic views regarding impact measures that I haven't really written up much yet. This post is my attempt at trying to communicate those views, as well as a response to some of the ideas in Alex's sequence.

What can you do with an impact measure?

In the "Technical Appendix" to his first Reframing Impact post, Alex argues that an impact measure might be "the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things—without assuming anything about the objective."

Personally, I am quite skeptical of this use case for impact measures. As it is phrased—and especially including the link to Robust Delegation—Alex seems to be implying that an impact measure could be used to solve inner alignment issues arising from a model with a mesa-objective that is misaligned relative to the loss function used to train it. However, the standard way in which one uses an impact measure is by including it in said loss function, which doesn't do very much if the problem you're trying to solve is your model not being aligned with that loss.[1]

That being said, using an impact measure as part of your loss could be helpful for outer alignment. In my opinion, however, it seems like that requires your impact measure to capture basically everything you might care about (if you want it to actually solve outer alignment), in which case I don't really see what the impact measure is buying you anymore. I think this is especially true for me because I generally see amplification as being the right solution to outer alignment, which I don't think really benefits at all from adding an impact measure.[2]

Alternatively, if you had a way of mechanistically verifying that a model behaves according to some impact measure, then I would say that you could use something like that to help with inner alignment. However, this is quite different from the standard procedure of including an impact measure as part of your loss. Instead of training your agent to behave according to your impact measure, you would instead have to train it to convince some overseer that it is internally implementing some algorithm which satisfies some minimal impact criterion. It's possible that this is what Alex actually has in mind in terms of how he wants to use impact measures, though it's worth noting that this use case is quite different than the standard one.

That being said, I'm skeptical of this use case as well. In my opinion, developing a mechanistic understanding of corrigibility seems more promising than developing a mechanistic understanding of impact. Alex mentions corrigibility as a possible alternative to impact measures in his appendix, though he notes that he's currently unsure what exactly the core principle behind corrigibility actually is. I think my post on mechanistic corrigibility gets at this somewhat, though there's definitely more work to be done there.

So, I've explained why I don't think impact measures are very promising for solving outer alignment or inner alignment—does that mean I think they're useless? No. In fact, I think a better understanding of impact could be extremely helpful, just not for any of the reasons I've talked about above.

Value-neutrality verification

In Relaxed adversarial training for inner alignment, I argued that one way of mechanistically verifying an acceptability condition might be to split a model into a value-neutral piece (its optimization procedure) and a value-laden piece (its objective). If you can manage to get such a separation, then verifying acceptability just reduces to verifying that the value-laden piece has the right properties[3] and that the the value-neutral piece is actually value-neutral.

Why is this sort of a separation useful? Well, not only might it make mechanistically verifying acceptability much easier, it might also make strategy-stealing possible in a way which it otherwise might not be. In particular, one of the big problems with making strategy-stealing work under an informed-oversight-style scheme is that some strategies which are necessary to stay competitive might nevertheless be quite difficult to justify to an informed overseer. However, if we have a good understanding of the degree to which different algorithms are value-laden vs. value-neutral, then we can use that to short-circuit the normal evaluation process, enabling your agent to pursue any strategies which it can definitely demonstrate are value-neutral.

This is all well and good, but what does it even mean for an algorithm to be value-neutral and how would a model ever actually be able to demonstrate that? Well, here's what I want out of a value-neutrality guarantee: I want to consider some optimization procedure to be value-neutral if, relative to some set of objectives , it doesn't tend to advantage any subset of those objectives over any other. In particular, I want it to be the case that if I start with some distribution of resources/utility/etc. over the different objectives then I don't want that distribution to change if I give each access to the optimization process . Specifically, what this does is that it guarantees that the given optimization process is compatible with strategy-stealing in that, if we deploy a corrigible AI running such an optimization process in service of many different values in , it won't systematically advantage some over others.

Interestingly, however, what I've just described is quite similar to Attainable Utility Preservation (AUP), the impact measure put forward by Turner et al. Specifically, AUP measures the extent to which an algorithm relative to some set of objectives advantages those objectives relative to doing nothing. This is slightly different from what I want, but it's quite similar in a way which I think is no accident. In particular, I think it's not hard to extend the math of AUP to apply to value-neutrality verification. That is, let be some optimization procedure over objectives , states , and actions . Then, we can compute 's value-neutrality by calculating

where measures the expected future discounted utility for some policy ,[4] is some null policy, and is the operator that finds the standard deviation of the given set. What's being measured here is precisely the extent to which , if given to each , would enable some to get more value relative to others. Now, compare this to the AUP penalty term, which, for a state and action is calculated as

where measures the expected future discounted utility under the optimal policy after having taken action in state and is some scaling constant.

Comparing these two equations, we can see that there's many similarities between and , but also some major differences. First, as presented here is a function of an agent's entire policy, whereas is only a function of an agent's actions.[5] Conceptually, I don't think this is a real distinction—I think this just comes from the fact that I want neutrality to be an algorithmic/mechanistic property, whereas AUP was developed as something you could use as part of an RL loss. Second—and I think this is a real distinction— takes a standard deviation, whereas takes a mean. This lets us think of both and as effectively being moments of the same distribution—it's just that is the first moment and is the second. Third, drops the absolute value present in , since we care about benefiting all values equally, not just impacting them equally.[6] Outside of those differences, however, the two equations are quite similar—in fact, I wrote just by straightforwardly adopting the AUP penalty to the value-neutrality verification case.

This is why I'm optimistic about impact measurement work: not because I expect it to greatly help with alignment via the straightforward methods in the first section, but because I think it's extremely applicable to value-neutrality verification, which I think could be quite important to making relaxed adversarial training work. Furthermore, though like I said I think a lot of the current impact measure work is quite applicable to value-neutrality verification, I would be even more excited to see more work on impact measurement specifically from this perspective. (EDIT: I think there's a lot more work to be done here than just my writing down of . Some examples of future work: removing the need to compute of an entire policy over a distribution (the deployment distribution) that we can't even sample from, removing the need to have some set which contains all the values that we care about, translating other impact measures into the value-neutrality setting and seeing what they look like, more exploration of what these sorts of neutrality metrics are really doing, actually running RL experiments, etc.)

Furthermore, not only do I think that value-neutrality verification is the most compelling use case for impact measures, I also think that specifically objective impact can be understood as being about value-neutrality. In "The Gears of Impact" Alex argues that "objective impact, instrumental convergence, opportunity cost, the colloquial meaning of 'power'—these all prove to be facets of one phenomenon, one structure." In my opinion, I think value-neutrality should be added to that list. We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption. Thus, even if you disagree with me that value-neutrality verification is the most compelling use case for impact measures, I still think you should believe that if you want to understand objective impact, it's worth trying to understand strategy-stealing and value neutrality, because I think they're all secretly talking about the same thing.


  1. This isn't entirely true, since changing the loss might shift the loss landscape sufficiently such that the easiest-to-find model is now aligned, though I am generally skeptical of that approach, as it seems quite hard to ever know whether it's actually going to work or not. ↩︎

  2. Or, if it does, then if you're doing things right the amplification tree should just compute the impact itself. ↩︎

  3. On the value-laden piece, you might verify some mechanistic corrigibility property, for example. ↩︎

  4. Also suppose that is normalized to have comparable units across objectives. ↩︎

  5. This might seem bad—and it is if you want to try to use this as part of an RL loss—but if what you want to do instead is verify internal properties of a model, then it's exactly what you want. ↩︎

  6. Thanks to Alex Turner for pointing out that the absolute value bars don't belong in . ↩︎

16