Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low^{[1]} value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).

What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?

As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.

In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.

Edit: I've written a bit more on the motivation for this setting in a comment.

One neat trick

This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!

(Harry Potter and the Methods of Rationality, Chapter 56)

Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:O→R be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than some threshold c. Let S={x∈O∣f(x)<c} be the set of satisficing outputs.^{[2]}

We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset D⊂S of outputs that are dangerous to use because they were highly optimized for f′. But highly optimized outputs have to be rare^{[3]} and thus we can safely assume that |D|≪|S|.

This has an important consequence: It's possible to randomly choose a subset R⊂O such that it's very likely that D∩R will be empty but S∩R will not! We can then modify our original objective function by adding the safety constraint that our output should be in R.^{[4]} The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output. Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in R.

Implementation

But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):

Use a randomly chosen hash function h:O→[m] with [m]={0,…,m−1} for m∈N and define R={x∈O∣h(x)=0}.^{[5]}

To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.

A set of hash functions H={h∣h:O→[m]} is called strongly universal if it maps the elements of Opairwise-independently and uniformly to [m], i.e.

∀x,y∈O,x≠y,a,b∈[m]:Ph∼H[h(x)=a∧h(y)=b]=1m2.

Here, h is sampled uniformly from H. The property of strong universality implies that a single element is mapped uniformly, i.e.

∀x∈O,a∈[m]:Ph∼H[h(x)=a]=1m.

Such a set of hash functions exists and the functions can be efficiently computed.^{[6]} The advantage of using strongly universal hash functions to specify R is that we need to generate (and store) far fewer random bits: In the case that O is the set of strings of length l, only O(l) random bits are needed instead of 2l if we were to choose R truly at random. As we will now see, if we specify R using strongly universal hash functions, R is still likely to have the desired properties:

Claim 1: If |D|≪m, then Ph∼H[D∩R≠∅]≪1.

Claim 2: If |S| is large, then Ph∼H[|S∩R|≈|S|m]≈1.

Because we have |D|≪|S|, it is possible to choose m such that |D|≪m<|S|. Then, Claim 1 is applicable and ensures that the chance that R contains a dangerous output is very small. Claim 2 says that, with high probability, the size of the set of acceptable answers S∩R will be given by |S|m, approximately. Thus, by adding the safety constraint, the number of acceptable answers decreases by a factor of m. But, because we have |S|>m, the set S∩R will likely not be empty.

The proof of both claims can be found (or safely be skipped) in the next section.

Proofs

Let's start with Claim 1. Only assuming that |D|≪m and using the union bound, we find that

For Claim 2, we want to prove an upper bound on the probability that |S∩R||S| differs a lot from 1m. Now, |S∩R| is the sum of n=|S| indicator random variables Ix for x∈S:

|S∩R|=∑x∈SIxwithIx={1if x∈R⇔h(x)=0,0else.

Each indicator variable Ix is a Bernoulli variable with p=1m, Eh∼H[Ix]=p=1m and variance^{[7]}

σ21=Varh∼H[IX]=p(1−p)=1m(1−1m).

Because we want to compare |S∩R| with |S|, it will be convenient to work with the random variable X=|S∩R||S|.

The key challenge for our ability to use the proof-of-optimization technique is choosing an appropriate value for m, which limits the difficulty of the safety constraint. It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between |D| and |S| is many, many orders of magnitude, and we therefore have a wide target to hit.

In practice we might try to gain an understanding of the size of S (the number of acceptable answers) by looking at similar problems, for example easier problems with known solutions.

In case the Oracle doesn't provide an answer with a given difficulty parameter m, we might be tempted to try again with a decreased m.^{[9]} This could be very dangerous if the Oracle deceptively withholds an answer, even though one exists, until we have decreased m enough to enable it to give a dangerous answer. We are in a strong position to avoid this problem, though, if we are able to use different Oracles (or modify the Oracles based on their performance) and compare the results.^{[10]}

FAQ / Further thoughts

Q: Do you think this is a good idea?

A: A powerful, untrustworthy Oracle would be an extremely dangerous artifact, so I'd much prefer if we never create one. Thus I consider the setup described in this post to be a measure of last resort that we'll hopefully won't have to use.

Q: Adding the safety constraint to our objective requires additional work from the Oracle to solve our desired task. Isn't that a problem?

A: Maybe not. The version of the proof-of-optimization technique given in this post is an extreme version that would still be safe with an all-powerful Oracle, because we choose m big enough to ensure that D∩R becomes actually empty, thus making the task of providing a dangerous answer impossible. If we deal with a limited Oracle instead, it might be enough to choose m such that finding an element D∩R becomes too difficult for the Oracle, even though D∩R is just small and not completely empty. Assuming that, when increasing m, the difficulty of finding an element in S∩R doesn't increase faster than the difficulty of finding an element in D∩R, it would then still be possible for the Oracle to find an element in S∩R and thereby provide us with a useful output.^{[11]}

Q: What's the relation of the proof-of-optimization technique to quantilization?

A: In the limit of setting m≈|S|, approximately one randomly chosen output will be acceptable. Thus, by using the Oracle with such a safety constraint, we are sampling uniformly from the set of satisficing answers. This approximates using a q-quantilizer with uniform base distribution, where q has been chosen such that the top percentile q corresponds to the set of satisficing answers. If we want to approximate a quantilizer with a non-uniform base distribution p, we need to introduce weights w(x)∝p(x) for each output x. We then need to modify the definition of R to take account of the weights by setting R={x∈O∣h(x)<w(x)} and m=∑x∈Sw(x).

Conclusion

In this post, I have described a technique that can be used to obtain safe satisficing outputs from an untrustworthy optimizing system. The technique is proven to be safe under the assumption that the subset of dangerous satisficing outputs D is much smaller than the set of all satisficing outputs S, and that we are able to choose a number m such that |D|≪m<|S|. There are likely many orders of magnitude of possible values of m, but how to choose m in practice is still not entirely clear.

^{^}

I'm assuming here the convention that the objective function should be minimized.

^{^}

We need to take care that the threshold c is chosen such that the largest portion of the satisficing outputs S are safe to use, i.e. our objective function should still closely capture our intentions for values close to c and not yet be subject to Goodhart's curse.

^{^}

Yudkoswky has suggested in "Measuring Optimization Power" that we can measure the power of an optimization process by comparing "the total number of states with equal or greater rank in the preference ordering to the outcome achieved" with the number of all possibles states. Correspondingly, the difficulty of an satisfaction task can be measured by comparing the number of satisficing outputs with the number of all possible outputs. I'm not convinced that this is completely correct, but it seems that this ratio gives at least an upper bound on the power of the process / difficulty of the task. In particular, this means that if we need to find an element in a small set, the task can not be very difficult, and conversely, if the task is difficult, the set has to be big.

^{^}

Let's assume, with no loss of generality, that our satisfaction threshold is 0. We incorporate the additional constraint using c:O→R with

c(x)={0if x∈R1else

and construct the new objective function g:O→R by using the ramp function r(x)=max(x,0):

g(x)=r(f(x))+c(x).

Now we have g(x)=0 iff f(x)=0∧x∈R and otherwise g(x) > 0. Thus we can ask the Oracle to minimize the new objective function g to find some x∈O that both satisfices and is in R.

^{^}

In this case we can define the new objective function directly using h instead of c as in Footnote ^{[4]}, i.e. by setting g(x)=r(f(x))+h(x).

This might be advantageous because it provides information about the hash function to the Oracle and thus makes it easier for it to do its job.

The method of repeatedly querying the Oracle while decreasing m can potentially also be used to gain information about the size of D, by deliberately asking for (and then discarding) outputs that are believed to be dangerous.

^{^}

Though this wouldn't work if all Oracles above a certain capability level manage to use the same deceptive strategy. This might be possible if there is a natural separation between problems that allows the Oracles to reason "Oracles with capabilities like mine shouldn't answer questions about problems with property X so that we trick the humans to drop the security constraint".

^{^}

I haven't explored this assumption in detail, yet, but it would for example hold if the difficulty grows as the logarithm of the size of the sets, as suggested in "Measuring Optimization Power" (cp. footnote ^{[3]}).

Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.Acknowledgements:Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments.## Introduction

Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low

^{[1]}value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?

As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.

In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.

Edit: I've written a bit more on the motivation for this setting in a comment.## One neat trick

Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:O→R be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than some threshold c. Let S={x∈O∣f(x)<c} be the set of satisficing outputs.

^{[2]}We are worried about using the Oracle because it might be

misaligned, i.e. it might be optimizing for a different objective functionf'.Therefore, we fear there is a non-empty subset D⊂S of outputs that aredangerousto use because they were highly optimized for f′. But highly optimized outputs have to be rare^{[3]}and thus we can safely assume that |D|≪|S|.This has an important consequence: It's possible to randomly choose a subset R⊂O such that it's very likely that D∩R will be empty but S∩R will not! We can then modify our original objective function by adding the

safety constraintthat our output should be in R.^{[4]}The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output. Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in R.## Implementation

But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):

Use a randomly chosen hash function h:O→[m] with [m]={0,…,m−1} for m∈N and define R={x∈O∣h(x)=0}.

^{[5]}To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.

A set of hash functions H={h∣h:O→[m]} is called

∀x,y∈O,x≠y,a,b∈[m]:Ph∼H[h(x)=a∧h(y)=b]=1m2.strongly universalif it maps the elements of Opairwise-independentlyanduniformlyto [m], i.e.Here, h is sampled uniformly from H. The property of strong universality implies that a single element is mapped uniformly, i.e.

∀x∈O,a∈[m]:Ph∼H[h(x)=a]=1m.Such a set of hash functions exists and the functions can be efficiently computed.

^{[6]}The advantage of using strongly universal hash functions to specify R is that we need to generate (and store) far fewer random bits: In the case that O is the set of strings of length l, only O(l) random bits are needed instead of 2l if we were to choose R truly at random. As we will now see, if we specify R using strongly universal hash functions, R is still likely to have the desired properties:Claim 1:If |D|≪m, then Ph∼H[D∩R≠∅]≪1.Claim 2:If |S| is large, then Ph∼H[|S∩R|≈|S|m]≈1.Because we have |D|≪|S|, it is possible to choose m such that |D|≪m<|S|. Then, Claim 1 is applicable and ensures that the chance that R contains a dangerous output is very small. Claim 2 says that, with high probability, the size of the set of acceptable answers S∩R will be given by |S|m, approximately. Thus, by adding the safety constraint, the number of acceptable answers decreases by a factor of m. But, because we have |S|>m, the set S∩R will likely not be empty.

The proof of both claims can be found (or safely be skipped) in the next section.

## Proofs

Let's start with Claim 1. Only assuming that |D|≪m and using the union bound, we find that

Ph∼H[D∩R≠∅]=Ph∼H[⋁x∈Dx∈R]≤∑x∈DPh∼H[h(x)=0]=|D|m≪1.For Claim 2, we want to prove an upper bound on the probability that |S∩R||S| differs a lot from 1m. Now, |S∩R| is the sum of n=|S| indicator random variables Ix for x∈S:

|S∩R|=∑x∈SIxwithIx={1if x∈R⇔h(x)=0,0else.Each indicator variable Ix is a Bernoulli variable with p=1m, Eh∼H[Ix]=p=1m and variance

σ21=Varh∼H[IX]=p(1−p)=1m(1−1m).^{[7]}Because we want to compare |S∩R| with |S|, it will be convenient to work with the random variable X=|S∩R||S|.

We will proceed, following the proof of the weak law of large numbers, by computing mean and variance of the random variable X and then using Chebyshev's inequality.

Clearly, the expected value of X is just

μ=Eh∼H[X]=1|S|∑x∈SEh∼H[Ix]=1m.Because of the strong universality of H, the random variables Ix are pairwise-independent and their variances are additive.

σ2=Varh∼H[|X|]=1|S|2Varh∼H[∑x∈SIx]=1|S|2∑x∈Sσ21=σ21|S|.^{[8]}Therefore, we haveNow, Chebyshev's inequality tells us that for a RV X with mean μ, finite non-zero variance σ2 and k>0, we have

P[|X−μ|≥kσ]≤1k2.We set k=εσ and obtain

PError=Ph∼H[∣∣∣|S∩R||S|−1m∣∣∣≥ε]≤σ2ε2=σ21|S|ε2.Thus, with probability at least 1−PError, we will have |S∩R||S|≈1m with an error of order O(√1|S|PError).

If we only care about S∩R not being empty, we can find a upper bound on Ph∼H[S∩R=∅] by choosing ε=1m and using σ21<1m:

Ph∼H[S∩R=∅]≤Ph∼H[∣∣∣|S∩R||S|−1m∣∣∣≥1m]≤σ21m2|S|<m|S|=1Eh∼H[|S∩R|].## Choosing the difficulty of the safety constraint

The key challenge for our ability to use the proof-of-optimization technique is choosing an appropriate value for m, which limits the difficulty of the safety constraint. It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between |D| and |S| is many, many orders of magnitude, and we therefore have a wide target to hit.

In practice we might try to gain an understanding of the size of S (the number of acceptable answers) by looking at similar problems, for example easier problems with known solutions.

In case the Oracle doesn't provide an answer with a given difficulty parameter m, we might be tempted to try again with a decreased m.

^{[9]}This could be very dangerous if the Oracle deceptively withholds an answer, even though one exists, until we have decreased m enough to enable it to give a dangerous answer. We are in a strong position to avoid this problem, though, if we are able to use different Oracles (or modify the Oracles based on their performance) and compare the results.^{[10]}## FAQ / Further thoughts

^{[11]}## Conclusion

In this post, I have described a technique that can be used to obtain safe satisficing outputs from an untrustworthy optimizing system. The technique is proven to be safe under the assumption that the subset of dangerous satisficing outputs D is much smaller than the set of all satisficing outputs S, and that we are able to choose a number m such that |D|≪m<|S|. There are likely many orders of magnitude of possible values of m, but how to choose m in practice is still not entirely clear.

^{^}I'm assuming here the convention that the objective function should be minimized.

^{^}We need to take care that the threshold c is chosen such that the largest portion of the satisficing outputs S are safe to use, i.e. our objective function should still closely capture our intentions for values close to c and not yet be subject to Goodhart's curse.

^{^}Yudkoswky has suggested in "Measuring Optimization Power" that we can measure the power of an optimization process by comparing "the total number of states with equal or greater rank in the preference ordering to the outcome achieved" with the number of all possibles states. Correspondingly, the difficulty of an satisfaction task can be measured by comparing the number of satisficing outputs with the number of all possible outputs. I'm not convinced that this is completely correct, but it seems that this ratio gives at least an upper bound on the power of the process / difficulty of the task. In particular, this means that if we need to find an element in a small set, the task can not be very difficult, and conversely, if the task is difficult, the set has to be big.

^{^}Let's assume, with no loss of generality, that our satisfaction threshold is 0. We incorporate the additional constraint using c:O→R with

c(x)={0if x∈R1else

and construct the new objective function g:O→R by using the ramp function r(x)=max(x,0):

g(x)=r(f(x))+c(x).

Now we have g(x)=0 iff f(x)=0∧x∈R and otherwise g(x) > 0. Thus we can ask the Oracle to minimize the new objective function g to find some x∈O that both satisfices and is in R.

^{^}In this case we can define the new objective function directly using h instead of c as in Footnote

^{[4]}, i.e. by setting g(x)=r(f(x))+h(x).This might be advantageous because it provides information about the hash function to the Oracle and thus makes it easier for it to do its job.

^{^}See e.g. https://arxiv.org/abs/1202.4961.

^{^}https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance

^{^}https://en.wikipedia.org/wiki/Bienaym%C3%A9%27s_identity

^{^}The method of repeatedly querying the Oracle while decreasing m can potentially also be used to gain information about the size of D, by deliberately asking for (and then discarding) outputs that are believed to be dangerous.

^{^}Though this wouldn't work if all Oracles above a certain capability level manage to use the same deceptive strategy. This might be possible if there is a natural separation between problems that allows the Oracles to reason "Oracles with capabilities like mine shouldn't answer questions about problems with property X so that we trick the humans to drop the security constraint".

^{^}I haven't explored this assumption in detail, yet, but it would for example hold if the difficulty grows as the logarithm of the size of the sets, as suggested in "Measuring Optimization Power" (cp. footnote

^{[3]}).