A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI

by Michaël Trazzi2 min read21st May 2019No comments


Seed AIOracle AIDeception

I previously wrote about a toy model where a seed AI starts by exhibiting human-desirable behaviors, before undertaking a treacherous turn when humans are no longer a threat. In this post, I'll link this phenomenon to how children learn to lie and self-confirming predictions.


To master lying, a child should:

1. Possess the necessary cognitive abilities to lie (for instance, by being able to say words or sentences).
2. Understand that humans can (deliberately) say falsehoods about the world or their beliefs.
3. Practice lying, allowing himself/herself to be punished if caught.

If language acquisition flourishes when children are aged 15-18 months, the proportion of them who lie (about peeking in a psychology study) goes from 30% at age 2, to 50% of three-year olds, eventually reaching 80% at eight. Most importantly, they get better as they get older, going from blatant lies to pretending to be making reasonable/honest guesses.

There is therefore a gap between the moment children could (in theory) lie (18 months) and the moment they can effectively lie and use this technique to their own advantage (8 years old). During this gap, parents can correct the kid's moral values through education.

Seed AI

In the seed AI case, deceiving could mean either:

a. Not using its full capabilities. This could mean not going "all in" to perform well on a task, or ensuring that humans are not able to measure its "intelligence" with their current sensors.
b. Concealing its true (unaligned) intentions by not carrying through unaligned plans when both (i) {a supervisor is watching (cf. absent supervisor)} and (ii) {the supervisor won't be dead at the end of the plan} are true.

Before undertaking a treacherous turn, a seed AI would need to go through the following steps:

1. Notice it is being watched by a supervisor.
2. Understand that it can control whether the supervisor is watching (e.g. by obstructing the view of a camera or by killing the supervisor), and even what the supervisor is watching (for instance by obfuscating its code or by influencing what the supervisor's sensors are measuring).
3. Figure out that it could counterfactually optimize even more its objective function if this supervisor didn't exist (and therefore didn't have to satisfy its preference).
4. Plan to optimize its objective function (in the long term) by faking alignment and performing suboptimally for a while.
5. Undertake a treacherous turn when humans are no longer a threat (because reached capability threshold).

Let's see how having formalized those steps can help us think about deception.

According to the sordid stumble view, "an AI that lacks human-desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values."

In other words, an unaligned seed AI would start by being really bad at pretending it has aligned values, so humans could just correct the AI's code on the go (similarly to the "children learning to lie" case).

Using the above formalism, the sordid stumble view states that when the seed AI will be thinking about unaligned plans (steps 2. and 3.) without being smart enough to do 4. or 5., it will be really bad at lying, thus vulnerable.

The core of the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) is about how long steps 2. and 3. will take, and how obvious the seed AI's unalignment will look like during these steps.

Predicting AI

A predicting AI (special case of oracle AI) could deceive its supervisor, by:

a. Deliberately manipulating humans to let it escape (formalized by an "escape" message/action here).
b. Gradually outputting self-confirming predictions that are increasingly manipulative (as described in the last paragraph from this post).

In the second case, the argument (from Stuart Armstrong's post) goes like this:

1. After each question-answer iteration, the predicting AI output predictions that are more and more inaccurate according to how the world would look like if the predicting AI wasn't there. It gradually improves in a) predicting how humans will react to its answers (understanding the "human part" of the environment) b) convincing/manipulating humans c) predicting things that are true no matter what (because also gets to understand the "non-human" part of the environment).
2. The AI eventually converges to the fixed-point solution of this "feedback loop" interaction, which is to output self-confirming predictions. At this point, it is so manipulative that its predictions include messages such as "please build a superintelligence" or "let me escape".


  • For the children and the seed AI, there is a gap between the first moment of vulnerability (when the agent is bad at deceiving or doesn't even know that deceiving is a thing) and when the agent effectively conceals its true intentions/capabilities.
  • The main uncertainty for the seed AI is how long it will stay unaligned without effectively concealing its capabilities and intentions (after having planned a treacherous turn).
  • For the predicting AI, the manipulation/deception occurs naturally and gradually because of a predictive feedback loop, without necessiting an initial "conception of deception".


New Comment