"You can't fetch the coffee if you're dead."—Stuart Russell, on the instrumental convergence of shutdown-avoidance
Note: This is presumably not novel, but I think it ought to be better-known. The technical tl;dr is that we can define time-inhomogeneous reward, and this provides a way of "composing" different reward functions; while this is not a way to build a shutdown button, it is a way to build a shutdown timer, which seems like a useful technique in our safety toolbox.
It's common in AI theory (and AI alignment theory) to assume that utility functions are time-homogeneous over an infinite time horizon, with exponential discounting. If we denote the concatenation of two world histories/trajectories by ⊳, the time-consistency property in this setting can be written as
This is property is satisfied, for example, by the utility-function constructions in the standard Wikipedia definitions of MDP and POMDP, which are essentially
Under such assumptions, Alex Turner's power-seeking theorems show that optimal agents for random reward functions R will systematically tend to disprefer shutting down (formalized as "transitioning into a state with no transitions out").
Exponential discounting is natural because if an agent's preferences are representable using a time-discount factor that depends only on relative time differences and not absolute time, then any non-exponential discounting form is exploitable (cf. Why Time Discounting Should Be Exponential).
However, if an agent has access to a clock, and if rewards are bounded by an integrable nonnegative function of time, the agent may be time-inhomogeneous in nearly arbitrary ways without actually exhibiting time inconsistency:
Any utility function with the above form still obeys an analogous version of our original time-consistency property that is modified to index over initial time t0:
Note that time-homogenous utility functions are a special case in which U(t,h)=γtU(0,h).
We define a time-bounded utility function as a dependent tuple
i.e., a family of utility functions indexed by times within a given fixed range. The intended semantics of a time-bounded utility function in (τ,R) form is:
Given two time-bounded utility functions (in the same environment), they can be concatenated into a new time-bounded utility function:
You can check that ⊳ is a monoid, with the neutral element given by (0,∅).
Let R1 be the reward function for a time-bounded task and τ1 be the time limit for the task, after which we want this agent to shut down. Assume that R1 also has bounded output, with per-stage reward always between R1––– and ˆR1. We define
We can then define τ2 to be 1 or indeed any positive integer. If an agent does not reach a shutdown state before τ1 is up, then it will realize a cost in R2 that outweighs all other rewards it could receive during the episode by a factor of C (a constant greater than 1). Therefore, optimal agents for (τ1,R1)⊳(τ2,R2) must shut down within time τ1 with probability ≥1−1/C (if the shutdown state is reachable in that time by any agent).
Suppose that the optimal policy π∗ results in a shutdown probability p<1−1/C, but there exists a policy π′ which shuts down deterministically (with probability 1). Then
which contradicts the optimality of π∗.
Several years ago, MIRI's Agent Foundations group worked on how to make a reflectively stable agent with a shutdown switch, and (reportedly) gave up after failing to find a solution where the agent neither tries to manipulate the switch to not be flipped nor tries to manipulate the switch to be flipped. This definitely isn't a solution to that, but it is a reflectively stable agent (due to time-consistency) with a shutdown timer.
MIRI researchers wrote about finding "a sensible way to compose a 'shutdown utility function' with the agent's regular utility function such that which utility function the agent optimises depends on whether a switch was pressed"; what's demonstrated here is a sensible way of composing utility functions—but such that which utility function is cared-about depends on how long the agent has been running.
From a causal incentive analysis point of view, the difficulty has been removed because the "flipping of the switch" has become a deterministic event which necessarily occurs, at time τ1, regardless of the agent's behavior, so there is nothing in the environment for it to manipulate. An optimal agent with this reward structure would not want to corrupt its own clock, either, because that would cause it to act in a way that accumulates massive negative reward (according to its current utility function, when it considers whether to self-modify).
The details will vary depending on the RL algorithm, but the idea is essentially that we give Q the current time t as an input, and then we try to approximate a solution to the finite-horizon Bellman equation,
instead of the infinite-horizon Bellman equation,
The recursion grounds out at Q(τ1+τ2,s,a), which can be defined as equal to zero.
Time-bounded agents still seek “power” or “resources” to the extent that they are useful within the time horizon, including to avoid premature shutdown. But this is still meaningfully better than the infinite-horizon case, in which even tiny predicted probabilities of shutdown (conditional on aligned behaviour) could get multiplied by the reward attainable with the entire cosmic endowment, and thereby outweigh the task-performance deficit implied by temporarily diverting resources to taking over the world.
For example, assuming it takes at least a day to take over the world, there’s no incentive for a single time-bounded agent, trained to fold laundry with a 10-minute time horizon, to get started on taking over the world. But if it turns out the agent can ensure its near-term security within only 2 minutes, it might be worth doing (if it estimates a >20% probability of getting shut down prematurely otherwise).
If the objective being optimised within the time-bound involves imitating non-time-bounded agents, such as humans, then instrumental convergence of those other agents implies that such objectives directly encourage long-term power-seeking behaviour, even if there is no additional instrumentally convergent shutdown-avoidance introduced by reinforcement learning.
(Suggested by John Wentworth in the comments.) The environment might contain non-time-bounded agents who will offer the time-bounded agent rewards today in exchange for taking actions that further their long-term interests. This is another case in which the original objective turns out to directly reward long-term power-seeking actions, even though it might not have seemed that way at first. There might be other patterns like this (besides imitation and trade), and if you can think of more, feel free to point them out in the comments. The construction in this post does nothing to mitigate or counteract such incentives from the original objective; rather, it merely avoids systematically creating new incentives for long-term power seeking that arise as a consequence of being an infinite-horizon RL agent with almost any nontrivial objective.
Unless optimality on the outer objective is guaranteed (e.g. via exact dynamic programming), it is possible that the approximate policy found by the training process will be a mesa-optimiser which optimises in a non-time-bounded way when observations are outside the training distribution.
Perhaps this goes without saying, but a time-bounded agent will only be useful for time-bounded tasks. This approach cannot be applied directly to saving the world, even if one uses exact dynamic programming to avoid out-of-distribution mesa-optimisation (which is not possible in a model-free setting and would typically be infeasible with large perception & action spaces). Any combination of action repertoire and time horizon that would be sufficient for saving the world would also be sufficient for taking control of the world, and the usual instrumental-convergence arguments imply that taking control of the world would likely be preferred: it would be instrumentally useful to lock in the (presumably misspecified!) R1 for the rest of the time horizon, and probably do a lot of damage in the process, which would not be easily recovered after time τ1.
It is possible to design an RL setup in which optimal agents will reliably shut themselves down within a predetermined finite time horizon, without any reflective-stability or instrumental-convergence incentives to do otherwise. I have seen claims like this informally argued, but they do not seem to get much attention, e.g. here. This is a very limited kind of corrigibility; as TekhneMakre points out in the comments, it’s hardly corrigibility at all since it doesn’t involve any input from an operator post-deployment, and is perhaps better filed under “bounded optimisation.” And this does not necessarily get you very far with existential safety. But it is a straightforward positive result that deserves to be more commonly known in the alignment community. Being able to safely dispatch short-timescale subtasks with high-dimensional perception and action spaces seems like a potentially very useful ingredient in larger safety schemes which might not otherwise scale to acting in real-world environments. As is very common in contemporary alignment research, the bottleneck to making this practical (i.e., in this case, being able to use model-free RL) is now a matter of robustly addressing mesa-optimisation.
When R is defined over (s,a,s′), then we should think of trajectories/histories h as being like paths in a graph (or morphisms in a category) from s to s′, and thus always having both an initial and a final state. Then ⊳ becomes a partial operation, only defined when the final state of h1 equals the initial state of h2.
I only skimmed the post, so apologies if you addressed this problem and I missed it.
Problem: even if the AI's utility function is time-bounded, there may still be other agents in the environment whose utility functions are not time-bounded, and those agents will be willing to trade short-term resources/assistance for long-term resources/assistance. So, for instance, the 10-minute laundry-folding robot might still be incentivized to create a child AI which persists for a long time and seizes lots of resources, in order to trade those future resources to some other agent who can help fold the laundry in the next 10 minutes.
That’s true! Thanks for pointing this out; I added a subsection about it to the post. There are probably also a bunch of other cases I haven’t thought of that provide stories for how the environment directly rewards actions that go against the spirit of the shutdown criterion (besides imitation and this one, which I might call “trade”). This construction does nothing to counteract such incentives. Rather, it just avoids the way that being an infinite-horizon RL agent systematically creates new ones.
As an addendum, it seems to me that you may not necessarily need a 'long-term planner' (or 'time-unbounded agent') in the environment. A similar outcome may also be attainable if the environment contains a tiling of time-bound agents who can all trade across each other in ways such that the overall trade network implements long term power seeking.
Note: This is presumably not novel, but I think it ought to be better-known.
Note: This is presumably not novel, but I think it ought to be better-known.
This indeed ought to be better-known. The real question is: why is it not better-known?
What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.
When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MIRI's more technical papers, they just heard about them.
The conventional wisdom 'nobody knows how to build myopic agents' is not true for the class of all agents, as your post illustrates. In the real world, applied AI practitioners use actually existing AI technology to build myopic agents, and corrigible agents, all the time. There are plenty of alignment papers showing how to do these things for certain models of AGI too: in the comment thread here I recently posted a list.
I speculate that the conventional rationalist/EA wisdom of 'nobody knows how to do this' persists because of several factors. One of them is just how social media works, Eternal September, and People Do Not Read Math, but two more interesting and technical ones are the following:
It is popular to build analytical models of AGI where your AGI will have an infinite time horizon by definition. Inside those models, making the AGI myopic without turning it into a non-AGI is then of course logically impossible. Analytical models built out of hard math can suffer from this built-in problem, and so can analytical models built out of common-sense verbal reasoning, In the hard math model case, people often discover an easy fix. In verbal models, this usually does not happen.
You can always break an agent alignment scheme by inventing an environment for the agent that breaks the agent or the scheme. See johnswentworth's comment elsewhere in the comment section for an example of this. So it is always possible to walk away from a discussion believing that the 'real' alignment problem has not been solved.
I might be totally wrong here, but could this approach be used to train models that are more likely to be myopic (than e.g. existing RL reward functions)? I'm thinking specifically of the form of myopia that says "only care about the current epoch", which you could train for by (1) indexing epochs, (2) giving the model access to its epoch index, (3) having the reward function go negative past a certain epoch, (4) giving the model the ability to shutdown. Then you could maybe make a model that only wants to run for a few epochs and then shuts off, and maybe that helps avoid cross-epoch optimization?
Isn't this the same as the "seamless transition for reward maximizers" technique described in section 5.1 of Stuart and Xavier's 2017 paper on utility indifference methods? It is a good idea, of course, and if you independantly invented it, kudos, but it seems like something that already exists.
I did explicitly disclaim against novelty, and I did invent this independently; the paper you linked is closely related, and I would like to upvote it as I think those results should also be better known, but I think the problem I solve in this post is different (and technically easier!) than the problems solved in that paper, including in section 5. The problem solved there asks for the optimal agent to act as if it’s an infinite-horizon optimal agent for R1 (including whatever power-seeking would be instrumental for such an agent!) until the time bound causes it to switch into acting like the optimal agent for R2 (and for all that to be reflectively stable). Here, I am not asking for the optimal agent to behave as if it has a longer time horizon than it really does.
Problem: suppose the agent foresees that it won't be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence. (The correctness might be assumed by the shutdown problem, IDK, but it's still an overall issue.)
Another comment: this doesn't seem to say much about corrigibility, in the sense that it's not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There's no dependence on an external operator's choices (except that once the AI is shut down the operator can pick back up doing whatever, if they're still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.
To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.
Oops, I missed that assumption. Yeah, if there's such a policy, and it doesn't trade off against fetching the coffee, then it seems like we're good. See though here, arguing briefly that by Cromwell's rule, this policy doesn't exist. https://arbital.com/p/task_goal/
Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Hm. So this seems like you're making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it's not going to get senioritis, it's going to pick up whatever scraps of expected utility might be left. I wonder though if there's synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.
Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the EU maximum is reached. I will think about this more.
A few other problems with time bounded agents.
If they are engaged in self modification/ creating successor agents, they have no reason not to create an agent that isn't time bounded.
As soon as there is any uncertainty about what time it is, then they carry on doing things, just in case their clock is wrong.
(How are you designing it? Will it spend forever searching for time travel?)
Could it be useful to have a shutdown-by-default process as follows?
This will allow trading power for safety, as you can make shorter steps forward as the agents become more dangerous, and you don't need to do everything in the first time period.
Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.
This pattern is instantiated in the Open Agency Architecture twice: