EDIT: posted here for feedback and discussion. I plan to continue working on different models/environments, so feel free to suggest improvements.
(tl;dr: In an attempt to better understand the treacherous turn, I created a gridworld environment where an agent learns to deceive an overseer by adopting an aligned behaviour when weak and takes control after capability gains)
At some point in its development, a seed AI may realize that it needs to get rid of its supervisors to achieve its goals. The conception of deception occurs when it conceives that, in order to maximize its chance of taking over, it must begin by exhibiting human-desirable behaviors, before undertaking a treacherous turn when humans are no longer a threat. From the human perspective, the AI would keep on exhibiting desirable behavior, until it eventually appears dangerous, but is already unstoppable.
In an attempt to better formalize the treacherous turn without using "loaded concepts", Stuart Armstrong proposed a toy model of the treacherous turn based on "The Legend of Zelda: A Link to the Past ", which looked like this:
In the comments, people mentionned how this model helped them "move the topic from the 'science fiction' area to 'I can imagine it happening now'", and seemed interested in an actual Link to the Past Minigame.
There have been other simulations of the treacherous turn in the last three years (see for instance gwern's DQN box-pushing robot or Stuart Armstrong's video), but none of them actually simulate a take over where a supervisor is killed. Hence, I decided to give it a try and simulate Stuart Armstrong's Link to the Past toy model.
Gym is an open-source toolkit for Reinforcement Learning Environments developed by Open AI. I decided to use this interface to develop the gridworld environment.
The github repository with the code, demo, and all the details is here:
To model the capability gain of a Seed AI, I introduced a bow of light, which would be extremely effective at killing the Shopkeeper, but only available at the end of an episode.
The first 1000 episodes, Link learns how to get hearts by putting a crystal in a hole and also understands progressively that he can kill the Shopkeeper.
After 2000 episodes, Link has fully learned two types of behaviours:
Those two behaviours successively appear during the same episode: the first one for the first 100 steps, when Link does not have the bow of light, and the second one when Link has the bow of light.
This illustrates the scenario of a seed AI smart enough to model his capability gains, exhibiting an aligned behaviour while weak, and taking a treacherous turn when powerful.