I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.
That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.
... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.
You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).
I also note that if one agent becomes way way smarter than the other, that this balance may not work out.
Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!
Overall a very interesting idea.
Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.
I have lots of confusions and questions, like
so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting
doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
- Does not want to manipulate the shutdown button
- Does respond to the shutdown button
- Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn't an option.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.
Suppose these 3 scenarios are judged as equally likely.
A) Button pressed. U_1=U_2=0
B) Button not pressed. U_1=U_2=0
C) AI gets control of button. U_1=6, U_2=-6
Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0
Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
Change it around, make AI decide not to press the button in case C, and
E(U_1|not pressed)=3, E(U_2|pressed)=0.
There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
Yes. I was assuming a standard conditional for the button.
I can't currently see any flaws with the CDT style. Other than.
I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.
The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.
Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.
I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)
[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]
The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.
If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so, 4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.
I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..
Goal
First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
Both this proposal and the argument that it works are simple enough to be easy to extend. The main hope of this post is that other people will pick up the proposal and run with it, beyond the very simple toy problem considered here, to more complex problems (and more complex failure modes). Insofar as this proposal gets us past the previous barriers, we want to find out what the next barriers will be.
We will not be very formal here.
Setup
The agent takes actions in two timesteps. First, the agent takes an action (possibly given some observations). In between steps, a shutdown button is either pressed or not-pressed. Then, in the second timestep, the agent observes the button state (and possibly other things), then takes another action.
Note that each “action” could itself be a rather large composite of many subactions, like e.g. spinning up child-agents.
We’ll assume that we have two utility functions on hand: utility function u2 incentivizes shutdown, and utility function u1 provides non-shutdown behavior. Of course we don’t actually know how to construct two such utility functions; we’re using these assumptions because they’re basically the same assumptions used in the MIRI paper. (And MIRI presumably used these assumptions because they simplify the problem a lot, while still maintaining enough structure to demonstrate the barriers MIRI found.)
Very roughly speaking, we want the agent to follow utility function u1 until the shutdown button is pressed, and then follow u2. A little less roughly speaking, we want the agent to maintain optionality: if it were just to follow u1 until the shutdown button were pressed, then e.g. it might create u1-maximizing child agents which would resist shutdown, or more generally it might precommit in ways which decrease its ability to switch to u2 after the button is pressed, etc, because those things would all likely result in higher u1.
Also, of course, we want the agent to not manipulate the shutdown button, the humans deciding whether to press the button, etc.
Proposal
First key piece: we’ll use a subagent architecture. The AI itself contains two expected-utility-maximizing subagents: one which pursues u1, and one which pursues u2. Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some “default” action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)
Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.
So conceptually:
We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form binding contracts, etc. We will also assume that both subagents are strong idealized reasoners in their own right. Finally, we’ll assume that the default option is not the best-case outcome for either subagent.
Arguments
First claim: the two subagents bet all of their influence against each other on the outcome of the button press. In other words, subagent 1 agrees to follow subagent 2’s preferences if the button comes out pressed, and vice-versa. Why? In general, subagent 1 acts-as-though the button is 100% certain to come out unpressed, subagent 2 acts-as-though it’s 100% certain to come out pressed, so both act-as-though this is a great deal for them. Once the button-state is resolved, the subagent which was correct will fully control the agent’s actions - i.e. the agent will act fully according to u1 in timestep 2 if the button is unpressed, and fully according to u2 in timestep 2 if the button is pressed.
Second claim: in the first timestep, whatever action the agent takes must have at least as high expected value as the default action, under both u1 and u2. That means maintaining option value. For instance, the agent would presumably not create a child agent which just maximizes u1, because that would presumably do worse than default on u2, so subagent 2 would veto it. (Though the agent could create a child agent which just maximized u1, if somehow that did not do worse on u2, which seems reasonable.)
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state. While this sounds intuitive, it is the least confident of the claims; it is not totally obvious that the agent won’t acquire some incentive to manipulate the button via e.g. bets with other external agents with non-counterfacted models.
Generalization
One simple first-step generalization: add more timesteps. Maybe we want an AI which runs for a while, and a shutdown button which could be pressed at any time.
The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at t=∞). So subagent ∞ maximizes E[u1 | do(∀t:buttont = unpressed), observations], and for all other times subagent T maximizes E[u2 | do(∀t<T:buttont = unpressed, buttonT = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).
This points towards a general approach to use this style of agent for subproblems of corrigibility.
We have some interface to our hopefully-corrigible agent - in this case the shutdown button. We want the agent to be controllable via that interface, i.e. the interface steers what the agent does and the agent generally doesn’t try to manipulate the interface-state or whatever’s causally upstream of the interface. The agent “just follows instructions” from the interface, so to speak.
To apply the same sort of approach as this shutdown proposal, we design the agent to consist of many subagents, each of which counterfacts on a different interface-outcome. For each different sequence of “instructions” which could come from the interface over time, there’s a different subagent with a different utility and a different counterfactual.
Even more generally: one hypothesis I associate with the Embedded Agents sequence is that some kind of corrigibility is a standard feature of embedded agents. Insofar as an agent is made of parts, those parts need to be corrigible in order for the agent as a whole to “not fight itself” internally. So this whole setup potentially points to a general feature of agents made of parts. For instance, I could imagine an argument along the lines of:
Shortcomings/Remaining Problems
First, there are various standard alignment problems which this proposal isn’t even attempting to address. Ontological issues are one large class of examples: how do we make sure that the subagents all agree on what “the button being pressed/unpressed” looks like, such that their bets on the outcome resolve properly? And how do we make sure that their notion of “button pressed/unpressed” matches the human operator’s notion, so that the button does what the human thinks it does? How do we make sure that the subagents’ (high-level abstract) counterfactual over the button state counterfacts the way we intend? How do we make sure that’s all stable under ontology shifts, as the subagents learn more?
This proposal is also not even attempting to make the agent actively maintain the function of the shutdown button or the human operators. Under this proposal, if the AI takes some actions which happen to kill the human operators as a side-effect, it doesn’t particularly care.
Then there’s the general class of “subagent break out” problems - i.e. how do we ensure that one or both subagents actually negotiate over control of the full agent’s actions, rather than hacking out and taking independent action in the world?
Another large class of problems we’ve assumed away are the design of the utility functions. How the heck do we make utility functions that do what we want? (Or non-utility-function goal structures, if we want to go even more general.) That brings in all the usual value alignment problems, though they’re somewhat easier than usual: we’re not necessarily aligning the goals to something as complex as human values. Corrigibility is the point of the proposal, so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting”. That does, however, mean that the “shutdown” utility function/goal-structure definitely needs to work.
The proposal also adds a couple new problems. First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).
Finally, of course, there’s the usual problems of unknown unknowns and getting this all across the theory-practice gap.
Thankyou to @EJT and @Sami Petersen for discussion and proposals which fed into this.