Summary: if we're not sure what the right utility function is, we might use the minimax decision rule to create a low-impact AI that is not bad for any possible utility function. There are also some flawed ways to combine minimax with value learning and corrigibility that might be worth improving on. This is a writeup of one idea from my visit at MIRI last week, during which I worked mostly with Benja and also with Nate to some extent.

# The minimax decision rule

The minimax decision rule maximizes the minimum expected utility under some set of uncertain parameters. The uncertainty that a minimaxer has over these parameters can be considered a form of Knightian uncertainty.

For reduced-impact AI it will be useful to consider the utility function itself as an unknown parameter. Suppose we have some set of utility function representatives. A utility function representative is a function mapping each outcome to a real number, defining a VNM preference relation; notably, a single VNM preference relation can correspond with multiple utility function representatives, which are translations and scalings of each other. Now we could initially define the minimax rule as:

where is the agent's policy (contained in the set ), consists of some universe-locating observations programmed into the AI, and is the outcome. is meant to locate our universe in a way that prevents the minimaxer from making decisions from "behind the veil" (which might cause it to, say, optimize at the expense of in our universe and vice versa in another universe). Minimaxers are not VNM in general due to failing the axiom of independence.

Why might we consider programming a reduced-impact AI to use minimax? If we are uncertain about the correct utility function, but know that it is in some set, then we could define to contain a representative for each utility function. To reduce impact, we would like each representative to assign 0 utility to the expected status quo given (i.e. ). This way, the AI will only take actions if no utility function in our set loses out relative to the status quo (conditioned on ). This is similar to Stuart Armstrong's satisficer design. With such a design, the AI might be unable to take any actions other than shutting down (for example, if for some both and appear in ), but at least it will not reduce any utility in expectation given (compared to when it shuts down immediately).

Note that since these expected utility guarantees talk about expected utilities
given , we will want the AI to have all the information we have in order to
be confident that, *from our perspective*, expected utility does not decrease.
It is not enough for the creation of the minimaxer to be good *a priori*: it
must also be good *given the information we already know about the world*.
The fact that it is dangerous not to give the system enough observations
is worrying because (1) we don't expect to be able to write down all the information in our memory, and (2) it might also indicate that imperfect inference algorithms
would make the minimaxer dangerous. Also note that evaluating the expected
utility of the status quo relies on hard-to-calculate logical information (such as
the probability that people create friendly AI), and it is not clear what happens
when we don't have logical omniscence.

# Combining minimax with value learning

Often, we don't just want to minimax over possible utility functions, we also want to learn the utility function. These goals are not necessarily opposed. If we are in a value learning framework, then we believe that the distribution of as a function of depends on the correct utility function representative , so we should rewrite the rule: Note that we are not concerned with the prior distribution over , only the distribution over outcomes and observations as a function of and . This resembles frequentism. In fact, we can use frequentist statistics to select minimax policies (similar to minimax estimators).

Consider the following toy example. There are 2 utility functions, and . Assume that there are no universe-locating observations. We believe that our next observation (which may be either or ) depends on the utility function. Specifically, . After seeing the observation, we may choose the outcome , or . Utilities are as follows:

So is good for , is good for , and is a compromise. If we could not see any observations, then the minimax rule would select . However, consider an alternative policy . If we see , select , and if we see , select . Now we have . Therefore, minimax will prefer this policy to the one that always selects .

What this tells us is that, if we know what distribution over possible worlds (including counterfactuals) we expect for each utility function, we might be able to use minimax to conservatively learn the utility function without committing to any prior over utility functions. In fact, we could even have Knightian uncertainty over the distribution over possible worlds to expect for each utility function.

However, since we're also conditioning on , we can't learn anything about the utility function that screens off. If we want to contain quite a lot of information, then this is a serious problem: we won't actually learn anything new about from future observations, so we do not take into account the fact that the outcome distribution depends on . I currently don't see a clear way to resolve this that doesn't relax the safety guarantees we get from conditioning on .

# Combining minimax with corrigibility

When creating a corrigible system, we want an AI try to maximize some objective while behaving relatively conservatively and responding to shutdown signals (because it might be maximizing the wrong objective). As in Stuart Armstrong's satisficer design, we can add an term to make the system try to conservatively optimize . Specifically, we choose our policy as follows: This policy will avoid reducing any too much, while trying to optimize . If (say) the user pressing a shutdown button is less likely given that is the correct utility function, then the optimal policy might optimize more in worlds where the shutdown button is pressed and less in worlds where the shutdown button is not pressed (i.e. it will behave more conservatively when the shutdown button is pressed). Of course, this assumes we already know that the shutdown button counts as evidence that we have the wrong utility function, and in a way not screened off by .

Unfortunately, we are not actually guaranteed that the AI will immediately shut down. Perhaps it reduced some values in the course of trying to optimize before learning that was the wrong utility function. Then the next action (upon finding that is wrong) might be to increase some of the lowest values. This could be quite dangerous! Although we can place bounds on how much any can decrease in expectation (which get tighter the lower is), we might not have enough confidence in the system to trust it to make these calculations instead of just shutting down.

This relates to what in Boston we've been calling the Ensemble Stability problem: given multiple utility functions, some of which may be incorrect, how do you keep the AI from sacrificing the other values for the incorrect one(s). Maximin is a step in the right direction, but I don't think it fully solves the problem.

I see two main issues. First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we'll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to prevent the creation of that AI, or to modify it into including the erroneous value. The second issue is that, if one of the utility functions is offset so it outputs a score well below the others, the other utility functions will be crowded out in the AI's attention and resource allocation.

One approach to the latter problem might be to make a utility function aggregation that approaches maximin behavior in the limit as the AI's resources go to infinity, but starts out more linear.

The utility functions are normalized so that they all assign 0 to the status quo. The status quo includes humans designing an AI to optimize something. So the minimax agent won't do anything worse for the values of the later AI than what would happen normally, unless the future AI's utility function is not in minimax's ensemble.

Since they're normalized to return 0 on the status quo, this won't quite happen, but it could be that one is a lot harder to increase above 0 than others, and so more resources will go to increasing that one above 0 than the others.

Interesting. I wonder if using a soft minimum might be useful here. Also, the satisficer design is not intended to be shutdownable particularly. If humans go to a lot of effort to shut it down, then it should shut down, but it isn't designed to shut down. Though that could be added separately.