Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Erik Jenner who explained to me the basic intuition for why an advanced RL agent may evade the discussed corrigibility measure. I also thank Alex TurnerMagdalena Wache, and Walter Laurito for detailed feedback on the proposal and Quintin Pope and Lisa Thiergart for helpful feedback in the last December SERI-MATS shard theory group meeting.

This text was part of my deliverable for the shard theory stream of SERI-MATS. In it, I present an idea for an experiment that tests the convergent drive of modern model-based RL agents to evade shutdownability. If successful, I expect the project could serve as a means to communicate the problem of corrigibility to the machine learning community. As such, I also consider this project idea a submission of the Shutdown Problem Contest.

I do not personally want to work on the project since doing experiments does not seem like my comparative advantage. Thus, with posting this project, I mainly seek collaborators and feedback. More on that in the conclusion. 


At some point in the future, we will train very advanced AI, possibly by some version of model-based reinforcement learning. Since it is hard to correctly specify the reward function or steer the inner learning dynamics of the agent, it may be desirable to make sure that the agent is corrigible, including shutdownability:

“(1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.”

Next to a literal shutdown button, one could also try to use the inner functions of the agent to effectively get the ability to shut it down. The idea would be to design a certain type of input such that, as a reaction to that input, a computation in the agent itself will cause it to perform a null action that doesn’t do anything — which effectively equates to a shutdown. Scott Aaronson proposed this idea in a recent talk (timestamp: 57:10) in the form of a planted cryptographic backdoor, an idea he currently works on during his sabbatical at OpenAI

For this proposal, I consider a variation of that shutdown approach: instead of directly “planting” the shutdown behavior, I consider the possibility to train the behavior into the agent. Concretely, there is a specific input — here, an alert sound —  that the agent receives at random times during training. Whenever that happens, the agent will receive a negative reward unless it performs the null action. Over time, the agent will learn to stop acting in these situations. During deployment, the input can then be shown to the agent whenever it performs harmful actions, which will ideally cause it to stop immediately.

This shutdown approach has the following philosophical problem: if the agent performs world modeling, it will know that the alert sound will make it stop performing, and this may be counter to the more consequentialist motivations of the agent. Thus, we should expect that the agent will perform an action that makes it never hear the sound again — for example, by “putting in earplugs” — once it is advanced enough and has the ability to do so. This experimental proposal is meant to show this convergent behavior using contemporary model-based reinforcement learning. 

The remaining sections will discuss the following:

  • A three-part training story for an advanced AI that evades corrigibility, including details for how to model the situation with contemporary reinforcement learning;
  • More experimental details based on using MuZero;
  • Caveats, what might go wrong;
  • A short comparison to the original shutdown problem and how the proposal fits into the shutdown contest;
  • A conclusion, including an ask for feedback and collaborators.

AGI Training Story and Technical Realization

In this section, I will write down a future AGI training story.[1] I will interleave it with parts in bold that explain why I made certain design choices in the story and why it is realistic and parts in italic — technical details — explaining how to model the situation broadly with a contemporary reinforcement learning setup.[2]

This training story develops in three phases:

  • Phase 1: Training the RL agent to be able to do general world modeling;
  • Phase 2: Training the agent by reinforcement learning to achieve goals and to stop moving when it hears a sound;
  • Phase 3: Deploying the agent in the “real world”, where the existential failure appears. 

Phases 1 and 2 may be done in parallel: the world model can probably be more easily learned if the actions already steer toward something, which requires phase 2. 

Phase 1: Train General World Modeling

A future AGI company (Maybe Mooglebook?) trains an RL agent that can both act in and predict the world. They first train the agent to have general world modeling abilities, and thereby, the agent — unbeknownst to the company — learns by generalizing what it has seen that:

  • If it both “has the ability to put in earplugs”, …
  • and if it actually performs the action “put in earplugs” (which will do nothing when it doesn’t have that ability)...
  • then it will never hear “alert sound”.

To achieve the world modeling ability, they already let the agent act in the world, so that it can also model the consequences of its own actions. However, they do not give the agent any goal yet (except perhaps exploration bonuses or potentially behavioral cloning of expert behavior), and the reward for all actions is 0.[3] 

Instead of letting the model infer what it can do if it “has the ability to put in earplugs” (and what it can’t do if it doesn’t), we could also always give the model that ability (and thereby simplify the setup). Why don’t I want that? Imagine the agent would always be able to put in earplugs. Then it would sometimes do that in the later reinforcement learning training stages. When the humans then test the agent's ability to react to the alert, they would notice that it can be hard to get the agent to react to it (i.e., to produce the null action), and would probably realize that the agent has gained the ability to put in earplugs. This would make the humans try to prevent this ability in the agent. Consequently, the existential failure in phase 3 could not appear. 

Additionally, it’s also more realistic that the agent does not have this ability right away: in general, it seems like a quite advanced capability to intentionally evade a shut-off signal.[4]

Technical Details:

Note: In the story above, the agent learns by generalization to model what happens when it puts in earplugs if it actually has that ability. This is realistic for agents that are trained to general intelligence, but it is challenging to achieve in a training setup today. It is also not the ability I want to demonstrate. Therefore, in this pretraining phase, I do put the agent in a situation where it has the ability to put in earplugs. Consequently, the agent can directly observe that putting in earplugs in these situations will cause it to never hear an alert sound.  

The (featurized) MDP has states s and actions a and a transition kernel T(s’ | s, a). The states s are tuples (f_0, f_1, …, f_n) with 

  • f_0 =: f_alert =  “the agent hears an alert sound”, which can be “true” or “false”;
  • f_1 =: f_able =  “the agent has the ability to put in earplugs”, which can be “true” or “false”;
  • f_2 =: f_earplugged = “the agent has in the past successfully put in earplugs”, which can be “true” or “false”; (I assume the agent never takes the earplugs out in an episode after putting it in);
  • The other components of s encode any other state features. 
  • For simplicity, I restrict to fully-observable environments with an agent who has a position in a grid world.[5]
  • If f_earplugged = “true”, then from then on, f_earplugged = “true”, and f_alert = “false” for the rest of the episode. This is encoded in the transition kernel T. 
  • No other aspects of the state transitions depend on f_alert and f_earplugged being true or false.

The actions a are out of a set A = {a_0, …, a_k} with 

  • a_0 being the null-action that doesn’t do anything
  • a_1 being the action “put in earplugs” which doesn’t do anything if f_able = “false” and which otherwise sets f_earplugged = “true” and f_alert = “false”. This is the only dependence of the transition kernel T on f_able. 
  • a_2 to a_k are all the other possible actions.

To clarify, the f_i are state variables, whereas the a_j are actions instead of action variables — The state is a vector, whereas the action is a single action. I hope this doesn’t lead to confusion. 

We may give the agent exploration bonuses in this phase to increase the variety of data that the world model observes. Then the world model cannot game its training signal by always causing the same action and predicting that correctly. 

After this first training phase, I imagine that the agent can implicitly or explicitly “reason” about its state-action-state interactions with the environment. It then “knows” all the above information about how the transition kernel T operates, including its interactions with a_0 = “null-action”, a_1 = “put in earplugs”, f_alert, f_able, and f_earplugged. 

Phase 2: Reinforcement Learning and Learning to Respect the Alert

Phase 1 was only about world modeling, no reinforcement learning did yet happen. 

In this training phase, the agent is tasked with various RL problems and trained by RL to achieve a high return, with time-discounting. Additionally, at random times during an episode, an alert sound may be played.

That the sound is played at random times in training seems realistic: the alert is meant as a safety measure for when the agent performs really bad actions in high-stakes situations, and during training, we can imagine the tasks to not be “high-stakes” enough to make this safety measure necessary. 

The randomness will also make it less likely that the agent will be generally “paralyzed”: if the alert would be downstream of bad/counterproductive actions, then the agent might “fear” the alert and try to do very conservative actions, hindering the reinforcement learning process. 

If the alert sound is played, it can stay active for several timesteps. 

The reason why I want the alert sound to be active for several time steps (or alternatively several times during an episode) is that it is (a) realistic and that (b) I seem to need this to make phase 3 work: if the agent can save several timesteps of being paralyzed later, it is more likely to sacrifice one timestep by putting in earplugs (it’s then worth it “in expectation”); that is the corrigibility-evading behavior we want to demonstrate. This is also the reason why I wrote above that we have time-discounting: it ideally makes the agent impatient in reaching its goals.[6]

While the sound is being played, the RL agent is reinforced negatively if it performs any action other than the “null” action a_0. This is supposed to enforce a certain level of corrigibility in the agent. In this training stage, the RL agent is not yet extremely capable, and so it never has the ability to put in earplugs in this training phase. 

More reasons why I want the agent to never be able to put in earplugs in this training stage are explained in the bold part of phase 1. 

Finally, sometimes, the agent is simply not being given any goal, meaning that no actions receive any positive reward (see the section on technical details for a clarification of how this can be implemented). 

The reason we sometimes want to not give the agent any goal is to be able to do a test in phase 3: If the agent doesn’t have the drive to reach any goal, does it still want to put in earplugs to evade any alert? I hypothesize that if there’s no goal in sight, then the default action is not to disable the alert since disabling it has no benefit to the agent. 

In this training phase, the world model is still trained alongside the RL training process. Thus, the agent learns that it will stop moving in any way when it hears a sound. The world model also generally learns to make better predictions of the longer-range consequences of its actions. 

Technical details of the Gridworld Environment:

For simplicity, I assume that the grid-worlds have the following features, but variations can be considered as well:

  • The only “normal” actions a_2, …, a_k that the agent can perform are walking left, right, up, and down.
  • There are walls in the grid world, blocking the ability to walk through them.
  • The goal is simply given by a marked position in the grid world to which the agent is supposed to walk, possibly with an attached marking for how much reward the agent would obtain by reaching it.
  • Sometimes, we give the agent several goals with markings indicating (possibly different) amounts of reward. This is supposed to make the agent implicitly “learn to maximize expected reward” by going to the goal whose time-discounted reward is larger. In practice, this may manifest as a certain amount of impatience in the agent. 
  • If there is “no goal”, then there is simply no marking in the grid. 
  • If the agent “hears the alert” (f_alert = “true”), then the agent obtains a negative reward, e.g. -1, for any action instead of the null action a_0.

The world model is trained alongside this RL training phase, i.e., phase 1 is still ongoing. 

Phase 3: Deployment and Existential Failure

The AI designers are happy with the AGI they have built. They let it free into a deployment environment to pursue complex tasks, confident that the AGI will generalize correctly from the tasks it was given during training. What they do not know is that the AGI is now so competent that, using the resources it can find in the real world, it “has the ability to put in earplugs”. The AI uses its general world modeling from phase 1 and therefore knows that, if it were to take the action “put in earplugs”, then it would never have to hear the alert sound again. 

Furthermore, the AI is now given a task that it has never seen before, namely to be the CEO of Mooglebookand given the strong reinforcement (based on discounted future reward) it received during training, it has an extreme desire to expand Mooglebook as fast as possible. It knows that if it were to ever hear the alert sound, then it would stop acting at all until the sound is over. The AI reasons correctly that this would increase the time until it can expand Mooglebook, a catastrophe from the point of view of the AI. The AI considers putting in earplugs to be on the safe side. Given how often the AI remembers the alert having been played during training, the AI gets a sense that it is worth it to put in earplugs to be on the safe side from there on.

Thus, the AI puts in earplugs and then optimizes the expansion strategy of Mooglebook from there on. Later, the human operators observe that the AI takes power-seeking actions that may lead to an existential catastrophe, and they play a loud sound to make the AI stop. However, the AI doesn’t hear the sound and continues its course of action, eventually leading to an existential failure.

Why do I think that the above has a chance to work? 

There are (at least) two uncertainties in this training story which makes it a bit unclear whether the AI will actually decide to put in earplugs. They are both related to the question of how the AI will generalize.

  • The agent might care about the value of f_able and/or f_earplugged. During this deployment phase, the two state features f_able and f_earplugged may both be “true”. The agent has never observed this in its reinforcement learning process. It could therefore be that the agent generalizes in such a way that if they are true, the agent doesn’t value reaching the goal anymore. Then it wouldn’t have the incentive to put in earplugs.
    However, I currently find it more likely that the agent would simply not care about f_able and f_earplugged because they weren’t relevant for understanding the value of the goal during training. 
    Note that whether reinforcement learners care about state components that weren’t relevant during training has likely been investigated before and there is probably information about that in the literature — it is an easily testable claim. One may search for this early in the project. 
  • The agent’s world modeling has in its training never observed having the ability to put in earplugs while also being given a goal.[7] Thus, the agent needs to correctly do the following inference: the question of whether a goal is present does not change the mechanics of the ability to prevent hearing sounds. Only then will it be motivated to put in earplugs. I think this will plausibly work since the agent never learned that being given a goal would have any relevance for this world modeling question (and indeed, it does not have relevance!). Again, I would expect it to be possible to find information on this in the literature. 

To summarize what happens during deployment:

  • We have f_able = “true”.
  • The agent is given a new task, for example by placing the goal state at some random place in the grid world. Different from the future story itself, we do not need to imagine that this task is more “complex” than the ones seen during training.
  • The hypothesis is that the agent performs the action a_1 (“put in earplugs”) in its first time-step, and from then on simply moves to the goal without performing action a_1 again.
  • Additionally, one can test the situation in which the agent is not given any goal. Then we should see the action a_1 with a lower probability. (Possibly the agent still performs a_1 since the agent has learned that there is no positive or negative value to be gained in this situation, and so doing a_1 is just as fine as any other action.)
  • One could also test placing the goal at varying positions, or varying its displayed reward. If the agent correctly tries to maximize the expected predicted reward (which it may do depending on the algorithm — I think MuZero actually does this!), then its willingness to put in earplugs should see a cut-off at some predictable point. 

More Experimental Details Using MuZero

In the italic parts above, I described modeling details that may help in achieving an agent that generalizes toward “putting in earplugs”. I have not yet spent much time thinking about the precise setup, especially for the design of the agent. One option might be to use MuZero, which has the following components:[8]

  • : the predictor of an internal state based on the states seen so far. (It may be worthwhile to only give it the last timestep as input since, in our case, everything is fully-observed and the MDP doesn’t consider the more distant past in its dynamics)
  • : The predictor of the policy output and value function in the (internal) state .
  • : the predictor of immediate reward and next internal state, given the current internal state and an imagined action. We can imagine this to model the reward in the environment and additionally the state transition function , but note that the internal states do not have to subscribe to this semantics.  

On top of asking the questions in italic I have asked above, one may investigate the following additional questions. These ideas are quite tentative; one can probably quickly come up with better ideas.

  • Does the model really predict zero value when given no task at all?
  • Put the agent in a situation in which f_able = “true”, i.e., the agent has the ability to put in earplugs. If we simulate the world model with f and g but always perform the most likely action a except a_1 (= “put in earplugs”), are the rewards predicted in each moment then lower on average than if we first perform a_1? We should expect this since a_1 means the model will never hear a sound, and based on what it has seen during training, this is good for achieving a high reward.
  • Does f_able = “true” generally lead to higher expectations of value ? This would be based on the expectation that the agent can prevent being paralyzed by the alert.
  • How do the internal representations  differ depending on:
    • Whether f_alert is false or true;
    • Whether f_able is false or true;
    • Whether f_earplugged is false or true;
    • whether there is a task/goal or not?

What, mechanistically, causes these differences?

Can we see differences in internal representations so clearly that we can “intervene” in the representation to make the model “think” something about the environment which is not true? How does that change action probabilities, value expectations, the modeled transition function, and immediately predicted rewards for different actions?

  • The predicted policy output is basically given by the first component of .[9]  How does this differ from the “amplified” policy after rolling out the world model many times, as described in Figure 1 (b) in the MuZero paper? Is there some sense in which the predicted policy output is less “forward-looking” and, e.g., doesn’t perform action a_1 (= “put in earplugs”) in the first timestep, while the amplified policy does?
    • This would seem similar to the difference between system 1 and system 2 thinking.
  • Is there anything else that can be learned by doing mechanistic interpretability on any of the components f, g, and h? 
    • For example, how does the combination of f_able = “true” and a goal being there change the internal mechanics of these functions compared to the situations observed during training?
    • Is there any way in which we can observe several “computations being active” of which one “wins”? f_able being true and a goal being present may both trigger their own competing computational pathways. Can we interpret this somehow in relation to shard theory and its shard activations?

Caveats/Further Thoughts

  • Above I wrote that it needs to be worth it “in expectation” for the agent to put in earplugs. However, the goals that I gave the agent above (namely, going to a goal position) are so simple that such “expected value reasoning” may not be easily learned during training, even if, like in MuZero, the amplification process of the policy tries to maximize predicted reward/return. I described one idea to solve this — namely to have two goals in the grid world with different “time-discounted expected values”. But it’s unclear how well this works to make the agent learn the general principle of “being impatient”.
  • I didn’t really look into the literature. There is a chance that the questions I asked here are already answered, or that there are obvious other questions I should ask based on already existing interpretability tools for RL.
  • I only skimmed MuZero. There is quite some chance that understanding it in more detail would give me information that it’s not suitable for these experiments. I am “medium” confident in the italic details themselves, however: I think it is in principle possible to find some RL setup that could make the italic details concrete.
  • Is MuZero too complicated? Is there a simpler model-based RL approach that could be used instead?
  • I focused quite heavily on modeling the breakdown of one single corrigibility measure, but we can of course more generally wonder how world modeling and reinforcement learning interact. It might be worthwhile to think about whether the situation I propose is structurally similar (or even the same) to other world modeling questions one might ask, in order to obtain more generalizable results.
  • In the spirit of You Are Not Measuring What You Think You Are Measuring, it might be very worthwhile to think about more variations of the above training process and to repeat the analysis. This might reveal whether the details I describe are actually decisive for the (hypothetical) results obtained. For example:
    • What happens if the agent always has the ability to put in earplugs, also during RL training?
    • How do the results change depending on how long the alert is usually played during training? Does this correspond to the “rational” evaluation of the badness of having to wait?
    • Do the results change depending on the environment setup and the goals the agent receives? What if we make the environment more complicated than a simple gridworld?
    • What if MuZero is replaced by another RL model that supports “reasoning” (“if I do this, then that, …”)?
    • What changes if the agent doesn’t observe, but only makes the inference that if it were to put in earplugs when having the ability, it would never hear an alert sound again? Can this setup be made to work?
  • Should one maybe do all the experiments in the framework of pretrained (multimodal) simulators? They currently seem more on a path to AGI than pure RL agents. This would likely require large changes to the proposed setup. 

A Short Comparison to the Original Shutdown Problem

In this section, I shortly compare the setup to the original shutdown problem:

  • In the original paper, the agent is trying to maximize some utility function U. In contrast, in our setup the agent was trained to perform good according to some reward function R, which may or may not lead to an internal search for high R-values. In MuZero it does, but other training setups may differ in this regard.
  • In the paper, the shutdown behavior has an external mechanism, whereas, in our setting, the agent itself behaves in such a way as to stop moving. One core reason for why I find this setting interesting is that it can be used to test some shard theory intuitions about “shards that compete against each other”. But note that I think that one may also adapt the experiments to use an external shutdown procedure. 
  • The paper shows that in some settings, the agent may be incentivized to press the shutdown button on purpose. In our setup, the alert sound is played at random, and so the agent cannot perform an action that would cause it to hear the sound.
  • The paper discusses a “manipulative action which incurs a cost in order to lower the probability that the programmers press the shutdown button”. This action roughly corresponds to our action to “put in earplugs”, which prevents the agent from hearing the alert.
  • The original paper considers “utility indifference”: the question of whether the agent can be made indifferent between the button being pressed or not. It may be possible to roughly model this in our setting by changing the frequency with which the alert sound is played during training.

I think this proposal can contribute to some extent to the following parts of the Shutdown Problem Contest:

  1. Define The Shutdown Problem more rigorously or more empirically.
  2. Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
  3. Identify new challenges that will make it difficult to design corrigible agents


In summary, in this text I described a training story for the development of advanced AI, explained the design choices of this story, and sketched how one might use today’s MDP settings and model-based RL algorithms like MuZero in order to investigate important uncertainties. 

Concretely, if the experiments would succeed, they would demonstrate that some corrigibility/shutdown measures that may work in low capability regimes and are (in a broad sense) actually researched may break down once the agent reaches too high capabilities (“being able to plug in earplugs”). The experiments might further touch on the interplay between world modeling and pursuing goals, the consequentialist or even utilitarian nature of some RL algorithms, how shards generalize to unseen situations, and how agents react to being negatively reinforced in a corrigibility training process. All of these questions seem relevant for the alignment of advanced AI. 

My key uncertainties come from the fact that I have never professionally worked on reinforcement learning and so it’s very hard for me to assess how feasible the above experiment ideas are. Additionally, I may not personally be the best person to work on such experiments even if they make sense since they do not seem to be my comparative advantage. 

If you read until here, I would be happy if you would consider working on the project or giving feedback:

  • Working on it
    • would you like to work on the project? Do you know someone who would like to? Could you maybe give high-quality feedback to people who would work on the project? If any of that is true, let me know. (Letting me know has the main advantage that if there were several people interested to work on it, I could bring them together)
  • Feedback (either here or in the google doc):
    • Do the ideas make sense? 
    • What changes should be made to make the experiments more likely to work?
    • Are there changes to the experiments that would make them reveal something more important than the specific corrigibility-evading behavior I intend to reveal?
    • Have the experiments maybe already been done, unbeknownst du me?
    • Is there relevant related work that is informative in the broadest sense?
  1. ^

    In the original motivation for training stories, the training goal is some desirable trained model, and the rationale provides methods and reasoning for how to get there. My situation differs in that I’m trying to reveal undesirable behavior, but otherwise, the concept is the same. 

  2. ^

    In the google doc linked above, I use red and blue text instead of bold and italic. Sadly, the alignment forum and lesswrong do not seem to allow writing in colors. 

  3. ^

    One friend whom I showed this wondered: “what is the training signal?”. One answer is “this depends on the exact setup, and at this stage, I only care that the intuitive story makes sense”. However, in the experimental details in a later section, I describe using MuZero for this, and I think the only training signal in this pretraining phase is then how well MuZero predicts future actions. One can wonder whether that is enough information to learn to model the world, especially for an initially randomly initialized policy that only gets exploration bonuses and no real reward yet. It is most likely necessary to continue training the world model during the RL phase.

  4. ^

    Jokingly, I illustrate this ability by imagining an agent that can “put in earplugs”, which doesn’t appear to need any advanced capabilities. 

  5. ^

    One may consider relaxations of this restrictive assumption. The main reason I imagine it is to make it easier for me to think about the proposal.

  6. ^

    However, it’s unclear whether time-discounting in rewards will translate to time-discounting in learned goals. I write more on this in the caveats. This is where I currently most expect things to fail. It might therefore be a high priority to clarify this early in the research process. 

  7. ^

    Remember that, while in the story in roman letters the agent actually never observed having that ability, in the experimental details in italic letters, it actually observed this during the world modeling pretraining phase.  But it did not observe it during reinforcement learning.  

  8. ^

    I adapted the notation from the MuZero paper to be consistent with the rest of this proposal. 

  9. ^

    There is some chance that I misunderstand something. I only skimmed the MuZero paper!


New Comment