# 0

Summary: if we're not sure what the right utility function is, we might use the minimax decision rule to create a low-impact AI that is not bad for any possible utility function. There are also some flawed ways to combine minimax with value learning and corrigibility that might be worth improving on. This is a writeup of one idea from my visit at MIRI last week, during which I worked mostly with Benja and also with Nate to some extent.

# The minimax decision rule

The minimax decision rule maximizes the minimum expected utility under some set of uncertain parameters. The uncertainty that a minimaxer has over these parameters can be considered a form of Knightian uncertainty.

For reduced-impact AI it will be useful to consider the utility function itself as an unknown parameter. Suppose we have some set of utility function representatives. A utility function representative is a function mapping each outcome to a real number, defining a VNM preference relation; notably, a single VNM preference relation can correspond with multiple utility function representatives, which are translations and scalings of each other. Now we could initially define the minimax rule as:

where is the agent's policy (contained in the set ), consists of some universe-locating observations programmed into the AI, and is the outcome. is meant to locate our universe in a way that prevents the minimaxer from making decisions from "behind the veil" (which might cause it to, say, optimize at the expense of in our universe and vice versa in another universe). Minimaxers are not VNM in general due to failing the axiom of independence.

Why might we consider programming a reduced-impact AI to use minimax? If we are uncertain about the correct utility function, but know that it is in some set, then we could define to contain a representative for each utility function. To reduce impact, we would like each representative to assign 0 utility to the expected status quo given (i.e. ). This way, the AI will only take actions if no utility function in our set loses out relative to the status quo (conditioned on ). This is similar to Stuart Armstrong's satisficer design. With such a design, the AI might be unable to take any actions other than shutting down (for example, if for some both and appear in ), but at least it will not reduce any utility in expectation given (compared to when it shuts down immediately).

Note that since these expected utility guarantees talk about expected utilities given , we will want the AI to have all the information we have in order to be confident that, from our perspective, expected utility does not decrease. It is not enough for the creation of the minimaxer to be good a priori: it must also be good given the information we already know about the world. The fact that it is dangerous not to give the system enough observations is worrying because (1) we don't expect to be able to write down all the information in our memory, and (2) it might also indicate that imperfect inference algorithms would make the minimaxer dangerous. Also note that evaluating the expected utility of the status quo relies on hard-to-calculate logical information (such as the probability that people create friendly AI), and it is not clear what happens when we don't have logical omniscence.

# Combining minimax with value learning

Often, we don't just want to minimax over possible utility functions, we also want to learn the utility function. These goals are not necessarily opposed. If we are in a value learning framework, then we believe that the distribution of as a function of depends on the correct utility function representative , so we should rewrite the rule: Note that we are not concerned with the prior distribution over , only the distribution over outcomes and observations as a function of and . This resembles frequentism. In fact, we can use frequentist statistics to select minimax policies (similar to minimax estimators).

Consider the following toy example. There are 2 utility functions, and . Assume that there are no universe-locating observations. We believe that our next observation (which may be either or ) depends on the utility function. Specifically, . After seeing the observation, we may choose the outcome , or . Utilities are as follows:

So is good for , is good for , and is a compromise. If we could not see any observations, then the minimax rule would select . However, consider an alternative policy . If we see , select , and if we see , select . Now we have . Therefore, minimax will prefer this policy to the one that always selects .

What this tells us is that, if we know what distribution over possible worlds (including counterfactuals) we expect for each utility function, we might be able to use minimax to conservatively learn the utility function without committing to any prior over utility functions. In fact, we could even have Knightian uncertainty over the distribution over possible worlds to expect for each utility function.

However, since we're also conditioning on , we can't learn anything about the utility function that screens off. If we want to contain quite a lot of information, then this is a serious problem: we won't actually learn anything new about from future observations, so we do not take into account the fact that the outcome distribution depends on . I currently don't see a clear way to resolve this that doesn't relax the safety guarantees we get from conditioning on .

# Combining minimax with corrigibility

When creating a corrigible system, we want an AI try to maximize some objective while behaving relatively conservatively and responding to shutdown signals (because it might be maximizing the wrong objective). As in Stuart Armstrong's satisficer design, we can add an term to make the system try to conservatively optimize . Specifically, we choose our policy as follows: This policy will avoid reducing any too much, while trying to optimize . If (say) the user pressing a shutdown button is less likely given that is the correct utility function, then the optimal policy might optimize more in worlds where the shutdown button is pressed and less in worlds where the shutdown button is not pressed (i.e. it will behave more conservatively when the shutdown button is pressed). Of course, this assumes we already know that the shutdown button counts as evidence that we have the wrong utility function, and in a way not screened off by .

Unfortunately, we are not actually guaranteed that the AI will immediately shut down. Perhaps it reduced some values in the course of trying to optimize before learning that was the wrong utility function. Then the next action (upon finding that is wrong) might be to increase some of the lowest values. This could be quite dangerous! Although we can place bounds on how much any can decrease in expectation (which get tighter the lower is), we might not have enough confidence in the system to trust it to make these calculations instead of just shutting down.