Thanks to Rebecca Gorman for co-developing this idea

On the 26th of September 1983, Stanislav Petrov observed the early warning satellites reporting the launch of five nuclear missiles towards the Soviet Union. He decided to disobey orders and not pass on the message to higher command, which could easily have resulted in a nuclear war (since the soviet nuclear position was "launch on warning").

Now, did Petrov have free will when he decided to save the world?

Maintaining free will when knowledge increases

I don't intend to go into the subtle philosophical debate on the nature of free will. See this post for a good reductionist account. Instead, consider the following scenarios:

  1. The standard Petrov incident.
  2. The standard Petrov incident, except that it is still ongoing and Petrov hasn't reached a decision yet.
  3. The standard Petrov incident, after it was over, except that we don't yet know what his final decision was.
  4. The standard Petrov incident, except that we know that, if Petrov had had eggs that morning (instead of porridge[1]), he would have made a different decision.
  5. The same as scenario 4., except that some entity deliberately gave Petrov porridge that morning, aiming to determine his decision.
  6. The standard Petrov incident, except that a guy with a gun held Petrov hostage and forced him not to pass on the report.

There is an interesting contrast between scenarios 1, 2, and 3. Clearly, 1 and 3 only differ in our knowledge of the incident. It does not seem that Petrov's free will should depend on the degree of knowledge of some other person.

Scenarios 1 and 2 only differ in time: in one case the decision is made, in the second it is yet to be made. If we say that Petrov has free will, whatever that is, in scenario 2, then it seems that in scenario 1, we have to say that he "had" free will. So whatever our feeling on free will, it seems that knowing the outcome doesn't change whether there was free will or not.

That intuition is challenged by scenario 4. It's one thing to know that Petrov's decision was deterministic (or deterministic-stochastic if there's a true random element to it). It's another to know the specific causes of the decision.

And it's yet another thing if the specific causes have been influence to manipulate the outcome, as in scenario 5. Again, all we have done here is add knowledge - we know the causes of Petrov's decision, and we know that his breakfast was chosen with that outcome in mind. But someone has to decide what Petrov had that morning[2]; why does it matter that it was done for a specific purpose?

Maybe this whole free will thing isn't important, after all? But it's clear in scenario 6 that something is wrong. Even though Petrov has just as much free will, in the philosophical sense - before, he could choose to pass on the warning or not, now he can equally choose to not pass on the message or die. This suggests that free will is something that is determined by outside features, not just internal ones. This is related to the concept of coercion and its philosophical analysis.

What free will we'd want from an AI

Scenarios 5 and 6 are problematic: call them manipulation and coercion, respectively. We might not want the AI to guarantee us free will, but we do want it to avoid manipulation and coercion.

Coercion is probably the easiest to define, and hence avoid. We feel coercion when its imposed on us, when our options narrow. Any reasonably aligned AI should avoid that. There remains the problem of when we don't realise that our options are narrowing - but that seems to be a case of manipulation, not coercion.

So, how do we avoid manipulation? Just giving Petrov eggs is not manipulation, if the AI doesn't know the consequences of doing so. Nor does it become manipulation if the AI suddenly learns those consequences - knowledge doesn't remove free will or cause manipulation. And, indeed, it would be foolish to try and constrain an AI by restricting its knowledge.

So it seems we must accept that:

  1. The AI will likely know ahead of time what decision we will reach in certain circumstances.
  2. The AI will also know how to influence that decision.
  3. In many circumstances, the AI will have to influence that decision, simply because it has to do certain actions (or inactions). A butler AI will have to give Petrov breakfast, or make him go hungry (which will have its own consequences), even if it knows the consequences of its own decision.

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

It will be important to define exactly what we mean by that.


  1. I have no idea what Petrov actually had for breakfast, that day or any other. ↩︎

  2. Even if Petrov himself decided what to have for breakfast, he choose among the options that were possible for him that morning. ↩︎

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 10:24 AM

I was slightly confused by the beginning of the post, but by the end I was on board with the questions asked and the problems posed.

On impacts measures, there's already some discussions in this comment thread, but I'll put some more thoughts about that here. My first reaction to reading the last section was to think of attainable utility: non-manipulation as preservation of attainable utility. Sitting on this idea, I'm not sure this works as a non-manipulation condition, since it lets the AI manipulate us into having what we want. There should be no risk of it changing our utility, since that's a big change in attainable utility; but still, we might not want to be manipulated even for our own good (like some people's reactions to nudges).

Maybe there can be an alternative version of attainable utility, something like "attainable choice", which ensures that other agents (us included) are still able to make choices. Or to put it in terms of free will, that these agents choices are still primarily determined by internal causes, so by them, instead of primarily determined by external causes like the AI.

We can even imagining integrating attainable utility and attainable choice together (by weighting them for example), so that manipulation is avoided in a lot of cases, but the AI still manipulates Petrov to not report if not reporting saves the world (because it maintains attainable utility). So it solves the issue mentioned in this comment thread.

(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)

When do you plan on posting this? I'm interested in reading it

Ideally within the next month!

So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.

Two thoughts.

One, this seems likely to have some overlap with notions of impact and impact measures.

Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human's behavior matter, similar to problems we face in specifying values or defining concepts.

Not Stuart, but I agree there's overlap here. Personally, I think about manipulation as when an agent's policy robustly steers the human into taking a certain kind of action, in a way that's robust to the human's counterfactual preferences. Like if I'm choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I'm probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.

Hmm, I see some problems here.

By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.

Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it's not clear what it means to be mistaken about normative claims.

OK, but there's a difference between "here's a definition of manipulation that's so waterproof you couldn't break it if you optimized against it with arbitrarily large optimization power" and "here's my current best way of thinking about manipulation." I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.

Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!

I agree. The important part of cases 5 & 6, where some other agent "manipulates" Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov.

I'm fine with the AI choosing actions using a model of the world that includes me. I'm not fine with it supplanting me from my agent-shaped place in the story I tell about my life.