Review

Operationalizing compatibility with strategy-stealing

14Vanessa Kosoy

4Evan Hubinger

4Adam Shimi

2Adam Shimi

8Mark Xu

4Daniel Kokotajlo

New Comment

6 comments, sorted by Click to highlight new comments since: Today at 5:12 PM

Notice that Eliezer's definition + the universal prior is essentially the same as the computationally unbounded variant of my definition of goal-directed intelligence, *if you fix the utility function and prior*. I think that when you're comparing different utility functions in your definition of strategy-stealing, it's better to emulate my definition and correct by the description complexity of the utility function. Indeed, if you have an algorithm that works equally well for all utility functions but contains a specification of the utility function inside its source code, then you get "spurious" variance if you don't correct for it (usually superior algorithms would also have to contain a specification of the utility function, so the more complex the utility function the less of them will be).

I think that when you're comparing different utility functions in your definition of strategy-stealing, it's better to emulate my definition and correct by the description complexity of the utility function.

Yep, that's a good point—I agree that correcting for the description length of the utility function is a good idea there.

Indeed, if you have an algorithm that works equally well for all utility functions but contains a specification of the utility function inside its source code, then you get "spurious" variance if you don't correct for it

Is the issue in that case that the hardcoded utility will require far less optimization power?

If it's that, I'm not sure why this is considered a problem. After all, such an algorithm is not really compatible with strategy stealing, even if it could become so quite easily. This seems like a distinction we should ask from the definition.

Using the perspective from The ground of optimization means you can get rid of the action space and just say "given some prior and some utility function, what percentile of the distribution does this system tend to evolve towards?" (where the optimization power is again the log of this percentile)

We might then say that an optimizing system is compatible with strategy stealing if it's retargetable for a wide set of utility functions in a way that produces an optimizing system that has the same amount of optimization power.

An AI that is compatible with strategy stealing is one such way to of producing an optimizing system that is compatible with strategy stealing with a particularly easy form of retargeting, but the difficulty of retargeting provides another useful dimension along which optimizing systems vary, e.g. instead of the optimization power the AI can direct towards different goals, you have a space of the "size" of allowed retargeting and the optimization power applied toward the goal for all goals and retargeting sizes.

If we use a normal universal prior, won't that lead to some pretty weird results? If someone in a very simple universe decides to write down a policy in a prominent location, then that policy is very simple in the universal prior, even if it takes trillions of pages of code.

Maybe using a speed prior fixes these problems?

Thanks to Noa Nabeshima and Kate Woolverton for helpful comments and feedback.## Defining optimization power

One of Eliezer's old posts which I think has stood the test of time the best is his “Measuring Optimization Power.” In it, Eliezer defines optimization power as follows.

^{[1]}Let A be some action space and p be some probability measure over actions. Then, for some utility function U:A→R and particular action a∗∈A, Eliezer defines the bits of optimization power in a∗ asoptp,U(a∗)=−log2∫{a∈A | U(a)≥U(a∗)}p(a)da

which, intuitively, is the number of times that you have to cut the space in half before you get an action as good according to U as a∗.

In my opinion, however, a better, more intuitive version of the above definition can be obtained by using quantilizers. A q-quantilizer relative to some utility function U and base distribution over actions p is a system which randomly selects an action from the top q fraction of actions from p sorted by U. Thus, a 0.1-quantilizer selects actions randomly from the top 10% of actions according to U. Intuitively, you can think about this procedure as being basically equivalent to randomly sampling 1q actions from p and picking the best according to U.

Now, using quantilizers, we can give a nice definition of optimization power for an entire model. That is, given a model M, let q∗∈(0,1) be the smallest fraction

^{[2]}such that a q∗-quantilizer with base distribution p is at least as good^{[3]}at satisfying U as M. Then, let optp,U(M)=−log2q∗. What's nice about this is that it gives us a measure of optimization power for a whole model and a nice intuitive picture of what it would look like for a model to have that much optimization power—it would look like a q∗-quantilizer.Both of these definitions do still leave the distribution p unspecified, but if we want a very general notion of optimization power then I would say that p should probably be some sort of universal prior such that simple policies are weighted more heavily than their more complex counterparts. If we use the universal prior, we get the nice property that the more complex the policy needed to optimize some utility function, the more optimization power is needed. Thus, we can replace optp,U with just optU where p is assumed to be some universal prior.

## Compatibility with strategy-stealing

Now, given such a definition of optimization power, I think we can give a nice definition of what it would mean for an AI system/training procedure to be compatible with the strategy-stealing assumption. Intuitively, we will say that an AI system/training procedure train:U→M which maps utility functions onto models is

compatible with strategy-stealingif optU(train(U)) doesn't vary much over some set of utility functions Y—that is, if train isn't better at optimizing for (or producing models which optimize for) some objectives in Y than others. We can make this definition more precise for a set of utility functions Y if we ask for stdev{optU(train(U)) | U∈Y} to be small.^{[4]}This definition is very similar to my definition of value-neutrality, as they are both essentially pointing at the same concept. What's nice about using optU here, though, is that it lets us compare very difficult-to-satisfy utility functions with much easier-to-satisfy ones on equal footing, as we're just asking for train to produce actions which always score in the top whatever percent—which should be equally easy to achieve regardless of how inherently difficult U is to satisfy.^{[5]}Notably, this definition of compatibility with strategy-stealing is somewhat different than others' notions in that it is about a property of a

single AI system/training procedurerather than a property of a deployment scenario or a collection of AI systems.^{[6]}As a result, however, I think it makes this notion of compatibility with strategy-stealing much more meaningful in more homogenous and/or unipolar takeoff scenarios.In particular, consider a situation in which we manage to build a relatively powerful and relatively aligned AI system such that you can tell it what you want and it will try to do that. However, suppose it's not compatible with strategy-stealing such that it's much better at achieving some of your values than others—because it's much better at achieving the easy-to-measure ones, as our current training procedures are, for example. As a result, such a world could easily end up going quite poorly simply because the easy-to-measure values end up getting all of the resources at the expense of the hard-to-measure ones.

As a specific example, suppose we manage to build such a system and align it with Sundar Pichai. Sundar, as Google's CEO, wants Google to make money, but presumably he also cares about lots of other things like wanting other people to be happy, he and his family to be safe, the world to be in a generally good spot, and so on. Now, even if we manage to build an AI system which is aligned with Sundar in the sense that it tries to do whatever Sundar tells it to do, if it's much better at satisfying some of Sundar's values than others of Sundar's values—much better at making Google money than putting the world in a generally good spot, for example—then I think that could end quite poorly. Sundar might choose to give this AI lots of flexible power and influence so that it can make Google a bunch of money, for example, and not realize that it will cause his other values to lose out in the long run—or Sundar could be effectively forced to do so due to Google needing to compete with other companies using similar systems.

In such a situation, compatibility with strategy-stealing is a pretty important desideratum and also one which is relatively independent of (at least a naive version of) intent alignment. The Sundar example also suggests what Y should be, which is the set of possible values that we might actually want our AI systems to pursue. As long as we produce systems that are compatible with strategy-stealing under such a Y, then that should ensure that scenarios like the above don't happen.

Note that I'm taking some liberties in making Eliezer's definition somewhat more formal. ↩︎

It is worth noting that there is a possibility for the smallest fraction to be undefined here if U is flat past some point—if all actions past the top 0.1% perform equally according to U, for example. ↩︎

If we want to be precise, we can define the quantilizer being “at least as good” as M according to U to mean that the quantilizer gets at least as much reward in expectation as the M on some given POMDP with reward function U. Alternatively, if we're okay with a more intuitive definition, we can just say that the quantilizer is at least as good as M if a U maximizer would choose to instantiate the quantilizer over M. ↩︎

We can also let Y be a distribution instead of a set. ↩︎

There is a difficulty here if some of the U∈Y are flat over large regions of the space such that actions look much more optimized than they actually are, such as the case that was mentioned previously in which U is flat past some point such that optU is undefined past that point. This sort of difficulty can be ruled out, however, if we have the condition that ∀U∈Y, ∀q1>q2, E[U(q1-quantilizer(s)) | s]<E[U(q2-quantilizer(s)) | s]. ↩︎

I am currently mentoring Noa Nabeshima in writing up a better summary/analysis of these different notions of strategy-stealing which should hopefully help resolve some of these confusions. ↩︎