A putative new idea for AI control; index here.
The desiderata for corrigibility are:
- A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
- A corrigible agent does not attempt to manipulate or deceive its operators.
- A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
- A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.
In this post, I'll be looking at one aspect of points 1 and 3. I feel that they're actually pretty much the same point -- the AI has to act while preserving the corrigibility infrastructure.
A summary of the result will be:
- Without a formal definition of what an operator is, corrigibility is vulnerable to the AI replacing the operator with a fake version.
The warning in this post should always be born in mind, of course; it's possible that we me might find a semi-formal version of operator that does the trick.
Defining the operator or defining the event
Typically the process of correctionof values is defined physically, as being the outcome of specific event (such as a button press, or a specific stochastic event triggered by or triggering a button press). This could also be defined by a digital event (receiving a certain encoded signal, or similar).
In all cases, the idea is the same: there is a very narrow and specific definition of the value correction event, and the physical or digital infrastructure surrounding it plays the role of ensuring the value corrections happen just as we intended. For instance, the physical event could be the button press (or a charge in a wire triggered by the button press) and the button itself could be in a room with an operator, with the operator keeping an eye on the AI and motivated to press the button under the right circumstances.
Such a setup is naturally vulnerable if the AI is motivated to manipulate the operator, or one of its subagents is. But even if we can solve points 2. and 4., there remains another vulnerability: the AI can choose to replace the operator with an artificial one (maybe a robot, maybe a second wire).
Why would the AI be so motivated? Well, for a start, it's indifferent to doing so. If the operator is not part of the definition of corrigibility, then having a human operator or a robot one that does exactly the same thing will be equivalent to the AI.
Now, there are frictions that might push against this. Replacing the operator might carry an expected cost. Or maybe the operator has private information that the AI wants to know, hence replacing them is a negative as the robot wouldn't have that information. And maybe the AI has a rough "don't harm humans" motive, that, while imperfect, at least prevents it from removing the operator in this circumstance.
But there may be motivations pushing the other way. For example, if a learning agent is unbiased but not uninfluenceable, then it will be motivated to replace the operator with a robot with the same expected outcome, but different variance. Or maybe the operator has private information, but it's much more effective to have a robot read that info from their brain than to rely on the human to reveal that through their behaviour. Or maybe the AI is decently motivated to preserve the corrigibility infrastructure, and is aware of how mortal humans can be, so replacing them with a robot is the prudent thing to do.
All this stems from the fact that the operator is not well defined as part of the corrigibility infrastructure, but their position relies on physical facts about the world, along with a narrow definition of the correction of value event. To combat that, we'd need to define the operator properly, a very tricky challenge, or physically and cognitively secure them, or hope the AI learns early on not to not harm them.