Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.

Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg)  for many helpful comments.


Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function  as input to the Oracle, it will output an element  that has an impressively low[1] value of . But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).

What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?

As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.

In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.

Edit: I've written a bit more on the motivation for this setting in a comment.

One neat trick

This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!

(Harry Potter and the Methods of Rationality, Chapter 56)

Let  be a finite set of possible outputs of the Oracle (e.g. strings of length at most ) and  be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output  such that the value of  is lower than some threshold . Let  be the set of satisficing outputs.[2]

We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset  of outputs that are dangerous to use because they were highly optimized for . But highly optimized outputs have to be rare[3] and thus we can safely assume that .

This has an important consequence: It's possible to randomly choose a subset  such that it's very likely that  will be empty but  will not! We can then modify our original objective function by adding the safety constraint that our output should be in .[4] The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output. Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in .


But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):

Use a randomly chosen hash function  with  for  and define .[5]

To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.

A set of hash functions  is called strongly universal if it maps the elements of  pairwise-independently and uniformly to , i.e.

Here,  is sampled uniformly from . The property of strong universality implies that a single element is mapped uniformly, i.e.

Such a set of hash functions exists and the functions can be efficiently computed.[6] The advantage of using strongly universal hash functions to specify  is that we need to generate (and store) far fewer random bits: In the case that  is the set of strings of length , only  random bits are needed instead of  if we were to choose  truly at random. As we will now see, if we specify  using strongly universal hash functions,  is still likely to have the desired properties:

Claim 1: If , then .

Claim 2: If  is large, then .

Because we have , it is possible to choose  such that . Then, Claim 1 is applicable and ensures that the chance that  contains a dangerous output is very small. Claim 2 says that, with high probability, the size of the set of acceptable answers  will be given by , approximately. Thus, by adding the safety constraint, the number of acceptable answers decreases by a factor of . But, because we have , the set  will likely not be empty.

The proof of both claims can be found (or safely be skipped) in the next section.


Let's start with Claim 1. Only assuming that  and using the union bound, we find that

For Claim 2, we want to prove an upper bound on the probability that  differs a lot from . Now,  is the sum of  indicator random variables  for :

Each indicator variable  is a Bernoulli variable with  and variance[7]

Because we want to compare  with , it will be convenient to work with the random variable .

We will proceed, following the proof of the weak law of large numbers, by computing mean and variance of the random variable  and then using Chebyshev's inequality.

Clearly, the expected value of X is just

Because of the strong universality of H, the random variables  are pairwise-independent and their variances are additive.[8] Therefore, we have

Now, Chebyshev's inequality tells us that for a RV X with mean , finite non-zero variance  and , we have

We set  and obtain

Thus, with probability at least , we will have  with an error of order .

If we only care about  not being empty, we can find a upper bound on  by choosing  and using :

Choosing the difficulty of the safety constraint

The key challenge for our ability to use the proof-of-optimization technique is choosing an appropriate value for , which limits the difficulty of the safety constraint. It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between  and  is many, many orders of magnitude, and we therefore have a wide target to hit.

In practice we might try to gain an understanding of the size of  (the number of acceptable answers) by looking at similar problems, for example easier problems with known solutions.

In case the Oracle doesn't provide an answer with a given difficulty parameter , we might be tempted to try again with a decreased .[9] This could be very dangerous if the Oracle deceptively withholds an answer, even though one exists, until we have decreased  enough to enable it to give a dangerous answer. We are in a strong position to avoid this problem, though, if we are able to use different Oracles (or modify the Oracles based on their performance) and compare the results.[10]

FAQ / Further thoughts

  • Q: Do you think this is a good idea?
  • A: A powerful, untrustworthy Oracle would be an extremely dangerous artifact, so I'd much prefer if we never create one. Thus I consider the setup described in this post to be a measure of last resort that we'll hopefully won't have to use. 
  • Q: Adding the safety constraint to our objective requires additional work from the Oracle to solve our desired task. Isn't that a problem?
  • A: Maybe not. The version of the proof-of-optimization technique given in this post is an extreme version that would still be safe with an all-powerful Oracle, because we choose  big enough to ensure that  becomes actually empty, thus making the task of providing a dangerous answer impossible. If we deal with a limited Oracle instead, it might be enough to choose  such that finding an element  becomes too difficult for the Oracle, even though  is just small and not completely empty. Assuming that, when increasing , the difficulty of finding an element in  doesn't increase faster than the difficulty of finding an element in , it would then still be possible for the Oracle to find an element in  and thereby provide us with a useful output.[11]
  • Q: What's the relation of the proof-of-optimization technique to quantilization?
  • A: In the limit of setting , approximately one randomly chosen output will be acceptable. Thus, by using the Oracle with such a safety constraint, we are sampling uniformly from the set of satisficing answers. This approximates using a -quantilizer with uniform base distribution, where  has been chosen such that the top percentile  corresponds to the set of satisficing answers. If we want to approximate a quantilizer with a non-uniform base distribution , we need to introduce weights  for each output . We then need to modify the definition of R to take account of the weights by setting  and .


In this post, I have described a technique that can be used to obtain safe satisficing outputs from an untrustworthy optimizing system. The technique is proven to be safe under the assumption that the subset of dangerous satisficing outputs  is much smaller than the set of all satisficing outputs , and that we are able to choose a number  such that . There are likely many orders of magnitude of possible values of m, but how to choose m in practice is still not entirely clear.

  1. ^

    I'm assuming here the convention that the objective function should be minimized.

  2. ^

    We need to take care that the threshold  is chosen such that the largest portion of the satisficing outputs  are safe to use, i.e. our objective function should still closely capture our intentions for values close to  and not yet be subject to Goodhart's curse.

  3. ^

    Yudkoswky has suggested in "Measuring Optimization Power" that we can measure the power of an optimization process by comparing "the total number of states with equal or greater rank in the preference ordering to the outcome achieved" with the number of all possibles states. Correspondingly, the difficulty of an satisfaction task can be measured by comparing the number of satisficing outputs with the number of all possible outputs. I'm not convinced that this is completely correct, but it seems that this ratio gives at least an upper bound on the power of the process / difficulty of the task. In particular, this means that if we need to find an element in a small set, the task can not be very difficult, and conversely, if the task is difficult, the set has to be big.

  4. ^

    Let's assume, with no loss of generality, that our satisfaction threshold is . We incorporate the additional constraint using  with

    and construct the new objective function  by using the ramp function :


    Now we have  iff  and otherwise g(x) > 0. Thus we can ask the Oracle to minimize the new objective function  to find some  that both satisfices and is in .

  5. ^

    In this case we can define the new objective function directly using  instead of  as in Footnote [4], i.e. by setting .

    This might be advantageous because it provides information about the hash function to the Oracle and thus makes it easier for it to do its job.

  6. ^
  7. ^
  8. ^
  9. ^

    The method of repeatedly querying the Oracle while decreasing  can potentially also be used to gain information about the size of , by deliberately asking for (and then discarding) outputs that are believed to be dangerous.

  10. ^

    Though this wouldn't work if all Oracles above a certain capability level manage to use the same deceptive strategy. This might be possible if there is a natural separation between problems that allows the Oracles to reason "Oracles with capabilities like mine shouldn't answer questions about problems with property X so that we trick the humans to drop the security constraint".

  11. ^

    I haven't explored this assumption in detail, yet, but it would for example hold if the difficulty grows as the logarithm of the size of the sets, as suggested in "Measuring Optimization Power" (cp. footnote [3]).

New Comment