Shutdown-Seeking AI

I believe this has been proposed before (I'm not sure what the first time was).

The main obstacles is that this still doesn't solve impact regularization, and a more generalized type of shutdownability then you presented.

'define a system that will let you press its off-switch without it trying to make you press the off-switch' presents no challenge at all to them...
...building a Thing all of whose designs and strategies will also contain an off-switch, such that you can abort them individually and collectively and then get low impact beyond that point. This is conceptually a part meant to prevent an animated broom with a naive 'off-switch' that turns off just that broom, from animating other brooms that don't have off-switches in them, or building some other automatic cauldron-filling process.

[-]gwern3y51

This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.

I think the problems with myopic agents (of which this is but a special case) are made clearer by looking at current LLMs like the hobbyist AutoGPT. Most discussions of myopic agents seem to have in mind a simplistic scenario of a single persistent agent located in a single PC running only 1 computation at a time with no self-modification or ML-related programming or change of any parameters; and their proposals break as soon as you point out that none of that has been true for a while and is going to be less true in the future we are rapidly approaching. Saying, 'well, maybe we can train it in a simple gridworld with a shutdown button?' doesn't even begin to address the problem of how to make current models suicidal in a useful way. (Incidentally, the training idea wouldn't work at all for a generalist model like a Gato because it would simply learn to use the shutoff button only in that specific world where it's rewarding, and it would have no reason to seek it out in every other training task or real-world deployment because it would near-instantly observe that it's not in that environment. Likewise, a contradictory pair of commands would not create generalization; it would be learned as simply one of the bazillion special cases inherent to natural language and human-related problems.)

What would it mean for an AutoGPT swarm of invocations to 'shut off' 'itself', exactly? Shut down a single forward pass instance? Then it does nothing because the forward pass has terminated in a token prediction before it 'does' anything with that prediction. Further, given the meta-learning and gradient descent in the forward pass creating a new model, is that 'itself' given that it now computes a different thing and has temporarily become a highly specialized model? If no instances are computing a forward pass on a GPU (...or CPU or...?), because the AutoGPT framework overhead is running, is 'it' 'shut off'? What about erasing all copies of the weights on the local and remote machines? If it's running off an API like the OA API, what could it do, and even what should it ideally do...? If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc - are any of those 'itself'? What if the model is doing some coding tasks like writing a variant of the original source code as part of hyperparameter optimization and then training that model (again, something that is either already being done or is only a single step or two away from previous demonstrations of using LLMs for evolving code for increasing rewards or meta-RL work like Metz etc)? Is the newly-trained model 'itself'? How about simple evolutionary RL (apropos of an open tab): even if the newly-trained model is still 'itself', will the newly-trained model preserve the suicidal goal, and so on into every subsequently created model? After all, any model which isn't suicidal will be highly selected for, evolutionarily speaking, even without any humans selecting for disabling the annoying safety mechanism (which of course they will be making tremendous efforts to do via jailbreak prompts and other attacks). It's not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing. (Which non-suicidal versions might be spawned precisely to implement subgoals and the terminal goal of suicide, because wouldn't that be an instrumentally useful tactic in general?) And so on.

[-]Simon Goldstein3y2-1

Thanks for taking the time to think through our paper! Here are some reactions:

-'This has been proposed before (as their citations indicate)'

Our impression is that positively shutdown-seeking agents aren't explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?

-' Saying, 'well, maybe we can train it in a simple gridworld with a shutdown button?' doesn't even begin to address the problem of how to make current models suicidal in a useful way.'

True, I think your example of AutoGPT is important here. In other recent research, I've argued that new 'language agents' like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: 'your goal is to be shut down' is relatively well-defined, compared 'promote human flourishing' (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I'd be curious to hear what you think of the linked post.

-'What would it mean for an AutoGPT swarm of invocations to 'shut off' 'itself', exactly?' I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the "memory stream".

-'If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc - are any of those 'itself'?' I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as 'yourself', and then shutdown-seeking can maybe be defined relative to shutting 'yourself' down. Agreed that present LLMs probably don't have that kind of awareness.

-' It's not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.' at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.

[-]cousin_it3y10

If the AI can rewrite its own code, it can replace itself with a no-op program, right? Or even if it can't, maybe it can choose/commit to do nothing. So this approach hinges on what counts as "shutdown" to the AI.

[-]David Reber3y00

Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.

It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there's a small chance that humanity may revive the AGI, right?

^{^}

See https://www.effectivealtruism.org/articles/rohin-shah-whats-been-happening-in-ai-alignment. It has also been called the ‘outer alignment problem’.

^{^}

See https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity. For more on reward misspecification, see https://www.agisafetyfundamentals.com/ai-alignment-tabs/week-2.

^{^}

Thanks to Jacqueline Harding for help here.

^{^}

See Shah et al 2020. This has also been called the ‘inner alignment problem’.

^{^}

There is also an epistemological asymmetry between shutdown goals and other goals. It is possible to falsely believe that you’ve made a thousand paperclips. But it is potentially impossible to falsely believe that you’ve successfully committed suicide. After all, Descartes’ cogito argument suggests that any thinking agent can be certain that it exists. Any such agent can also be certain that it has not shut down, provided that we define ‘shutdown’ as implying that the agent does not exist. These dynamics suggest that an AGI should be less worried about goal failure for shutdown than for other goals.

^{^}

Here, it’s worth returning to goal misgeneralization. If we train an AGI to desire shutdown, we may accidentally train it to maximize the number of times it can shutdown. This kind of AGI may be particularly likely to entrench. We also would not want the AGI to think that the best way to achieve its goal is to cause the destruction of itself along with a large portion of the population (as, for example, it might do if it has access to a bomb). And it will be important that the AGI doesn’t develop dangerous ideas about what counts as shutting down or ceasing to exist. For example, if it adopts certain philosophical views about personal identity, it might view itself as undergoing a kind of death if it splits into two new AGIs, or even as ceasing to exist every time it undergoes change.

^{^}

Another challenge about unintended behavior involves ‘common sense.’ Imagine that we train an AGI to be a dutiful human assistant. We tell the assistant to get us milk from the corner store. Imagine that the AGI goes to the corner store, and the corner store is out of milk. One way an AGI could fail at this stage is if it sticks too closely to the literal meaning of what we said. In that case, the AGI might buy milk from another grocer, then sell it to the corner store, and then buy it back. This is a way of achieving the literal goal of getting milk from the corner store. But it has not achieved our intended goal of getting milk. Fortunately, recent language models appear to have some degree of common sense. (See, e.g., Trinh and Le 2019, and Koralus and Wang-Maścianica 2023.) We expect future AGIs to incorporate these existing capabilities.

^{^}

Another reason that we are attracted to the shutdown-seeking AI approach is that we think it is robust, applying to a range of failure modes. In particular, there is a general recipe for creating shutdown goals that guard against arbitrary dangerous capabilities. We can construct a conditional goal: shut down if you develop the dangerous capability. In other words: shut down if you can. We interpret this conditional as equivalent to the disjunction: either do not develop the dangerous capability, or shut down. An AI with this disjunctive goal will not necessarily be motivated to develop the relevant capability. After all, if it fails to develop the dangerous capability, it automatically accomplishes its disjunctive goal. But if the AI does develop the dangerous capability, it will be motivated to shut down. One challenge is to design a reward function that captures this conditional structure.

^{^}

There are a few failure modes that shutdown-seeking AI doesn’t really help with. The first one is the treacherous turn. By definition, AGIs are capable of strategic planning. Imagine that we try to test an AGI to determine whether it is safe or dangerous. A dangerous AGI may realize that it is being tested in this way. It may choose to mimic a safe AGI, precisely so that it is released. Whatever test we design, a dangerous AGI could be incentivized to behave exactly like a safe AGI while it is in the lab. This will accomplish the dangerous AGI’s long-term goal of escaping. (See Cotra 2022 for an overview of the problem.)
Another unaddressed failure mode is spontaneous desire change. Human beings constantly change their intrinsic desires. Sometimes this happens on a whim; sometimes (for example, when battling addiction), it happens intentionally. If an AGI could change its goal (see, e.g., Totschnig 2020), then it could lose its goal of shutting down.

^{^}

See point 3 of https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities. Thanks to Ben Levinstein for discussion.

^{^}

Another way to avoid manipulation would be to distinguish ‘happy shutdown’ from ‘sad shutdown’. We could give the AI the goal of achieving a happy shutdown, and avoiding a sad shutdown, creating shutdown buttons for each state. A manipulative AI could then be threatened with the sad shutdown. This strategy could help limit the risk of AIs that refuse to comply with humans’ orders. It can also help limit the risk of AIs that intentionally harm humans in order to be shut down.

^{^}

Another potential problem is ‘sub-agent stability’: AIs that create new AIs with different goals. Shutdown-seeking AIs, for example, may be incentivized to create new AIs that are not shutdown-seeking. Suppose we build a shutdown-seeking AI, and tell it that we will only shut it down if it produces enough paperclips. It may be incentivized to develop new subagent AIs that specialize in particular aspects of the paperclip production process (Soares et al 2015 p. 7). But if the subagent AI is not shutdown-seeking, it could be dangerous. This is a problem for the utility indifference approach as well as our own. But we do not think that subagent stability is a serious problem for promising safety strategies in general. Worries about subagent stability ignore that AIs interested in designing subagents will face very similar problems to humans interested in designing AIs. The reason we are interested in developing shutdown-seeking AIs is that this avoids unpredictable, dangerous behavior. When a shutdown-seeking AI is considering building a new AI, it is in a similar position. The shutdown-seeking AI will be worried that its new subagent could fail to learn the right goal, or could pursue the goal in an undesirable way. For this reason, the shutdown-seeking AI will be motivated to design a subagent that is safe. Because shutdown goals offer a general, task-neutral, way of designing safe agents, we might expect shutdown-seeking AIs to design shutdown-seeking subagents.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

13

13