A putative new idea for AI control; index here.

I've previously talked about stratified indifference/learning. In this short post, I'll try and present the idea, as simply and clearly as possible.

The truth is yet written

Assume the AI maximising a utility , but is unsure whether is equal to utility or instead utility . Humans have written the true utility on a paper somewhere, and, in an hour, the AI will look at that paper to check what is. Let be the indicator function that is written on that paper in an hour; is the obvious converse.

Thus .

Here's the problem. There is nothing stopping the AI from looking up the paper currently. And there is nothing stopping the AI from rewriting the paper, or substituting another ( is determined by the paper in an hour's time, however that paper got there). In terms of causal graphs, we have the following setup:

Here is the node that represents the utility that is ultimately equal to, is what is currently written on the paper (from now on, written as $ \mathbf{W_0}$ for short), is what will be written on the paper at decision time (written as ), and is the AI's action. Note that determines directly, while it itself is determined by (the past paper) and (the AI's actions).

Default policy

Now assume that there is a default action or policy where the AI simply does nothing. In that case, the value of determines completely. The AI has another action (short for overwrite) in which it overwrites the paper's choice (this subdivides into , overwrite with , and , overwrite with ). Given , is then independent of .

Generally speaking, the AI will always choose , as one of and will be easier to maximise, so the AI will choose that one to maximise.

So, informally, what do we really want to happen? We want the AI to maximise , if would have been chosen, absent its intervention (and similarly for ). "Absent its intervention" seems to designate equality between the nodes and . More generally, it designates what happens when .


So let's modify the AI's expected utility calculation to meet that informal desiderata. First, let's look at the universe where, absent AI intervention, . What that means is that, conditional on , is set to .

In this universe, and are both set to . Now, it makes no sense to keep being set to , since that node is a consequence of as well as .

So we'll define a probability distribution (and corresponding expectation ) over all nodes that are not descendants of (to do so, we'll assume that all nodes that are descendants of and ancestors of are endogenous variables that are deterministic functions of their inputs; the stochasticity is put into exogenous variables, see this paper for more details). Then we will make the AI choose its actions by calculating the expectation of , given .

So, more precisely, this is the expectation of in the universes where would have been if the AI had chosen to do nothing -- thus .

So if the AI choose action , this is . Note that, in this instance, this is equal to the classical expectation .

What happens if the AI choose , to overwrite the paper with ? This is irrelevant for the expected utility calculation, since we're using : this is the expectation of (the utility that would have been chosen, if ). The actual values of node becomes irrelevant, only the counterfactual value it would have had. So changing does not benefit the AI. In fact, if the AI is time-inconsistent, it makes things worse, as it means its future self will be maximising instead of .

Then we do the same thing for , resulting in the formula for the stratified expectation :

  • .

In a sense, the AI should maximise .

If you just want to get the basic idea of stratification, you need read no further than this. The rest is just some examples and potential issues.

Lottery and outcome pumps

Some versions of indifference have problems with infinite improbability drives.

A simple example of this would be a situation where the AI buys a lottery ticket and decides that it will overwrite the paper with if the ticket wins, and with if the ticket loses. Call this action .

Presumably, it can use the money gained to increase utility, so is rather high (as now implies that the lottery was won).

This gives the following causal graph:

Now let's look at the stratified expectation. Given and , we can compute . This is simply , and the value of being what it was before, and independent of .

Thus will continue to treat the node as an independent variable with the same probability as under . Choosing thus does not help the AI's expected stratified utility.

Humans changing their minds

Another situation to consider is one where humans have written the paper initially, but decided to change their minds at a later date. This can be captured in the following diagram:

Here the node determines how the humans change their minds and rewrite the paper. What is the dotted arrow between and ? That represents whether or not the change is a consequence of the AI's actions or not. If is independent of the AI's actions -- if we would have changed our minds even if -- then will set and to whatever its prior was. Thus determines , overriding , the human's initial decision.

So far, so good. But what if the change in policy is a consequence of the AI's actions? What if the humans only change the utility because they see the AI behaving badly? Then is a descendant node from , and stratification doesn't apply to it. Now, we could decompose that node further as:

Here are the fundamentals of human personalities. Since it is not a descendant node from , stratification will put a probability distribution over it. However, this still doesn't allow us to intervene, given the AI's action. Under stratification, the AI's sole concern is still "would have lead to (or ) under ", rather than correcting itself under a different choice of actions.

How to model humans learning from the AI's actions is hard. Indeed, it might be fundamentally hard: you want to distinguish humans learning from the AI (legit) from the AI manipulating humans (non-legit) and this likely can't be done by just looking at the causal graph.

Notes on default policy

Note that might be a really bad policy (take people apart to see what they prefer). Then, if it's pretty clear that taking people apart is a bad idea according to the preferences that will emerge from that process, the AI will not choose , but will instead leave people together and try to deduce what the outcome of that process would have been.

Alternatively, if doing a logical counterfactual over its own action is bad, we could have the AI take to not be its own action, but some other causal process that would have prevented the AI from being turned on in the first place.

Personal Blog


5 comments, sorted by Click to highlight new comments since: Today at 7:36 AM
New Comment

I think you can put this scheme on a nicer foundation by talking about strategies rather than actions, and by letting the AI have some probability distribution over .

Then you just use the strategy that maximizes . You can also think of this as doing a simplification of the expected utility calculation that bakes in the assumption that the AI can't change .

You can then reintroduce the action with the observation that the AI will also be well-behaved if it maximizes .

In this example, it's clear that is a special node. However, the AI only deduced that because, under , determines . It's perfectly plausible that under action , say, instead determines it. Under and , none of those nodes have any impact.

Therefore we need to be a special strategy, as it allows us to identify what nodes connect with . The advantage of this method is that it lets the AI find the causal graph and compute the dependencies.

Agree strategies are better than actions.

Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.

I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent's ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).

RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.

P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = ... = null).

So after talking w/Stuart, I guess what he means by "humans learning from the AI’s actions" is that what humans' beliefs about U converges to actually changes (for the better). I'm not sure if that's really desirable, atm.

On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents'). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).

I generally think of as the "turn yourself off and do nothing" strategy.