On corrigibility and its basin

Donald Hobson

Post somewhat inspired by Eliezers "well why didn't anyone else write it then?"

Corrigibility has various slightly different definitions, but the general rough idea is of an AI that does what we want. An AI that doesn't trick us. An AI that doesn't try to break out of its box. An AI that will follow clear instructions if it is given them.

A slightly more technical definition is that an agent is corrigible if it is easy for other agents to optimize over it. This has two parts, predictability and controllability.

Get_Random_Bits is not predictable. Agents can't predict what it will do, so will have a hard job forming plans that use it. A pseudorandom algorithm would score better. Being predictable by AIXI is easy. Every small Turing machine manages that. Being predictable by a much more limited agent, such as a human, is a much stricter condition. A pseudorandom algorithm is predictable so long as it was simple enough to be predicted in practice by whichever agent is trying to do so. While Get_Random_Bits has a low predictability, it is not the worst possible predictability score. There are plenty of plans that will work with almost all strings of bits. Sometimes the relevant measure over bit-strings is far from uniform. You can confidently predict that Get_Random_Bits will not hack its way out and take over the internet. Optimize_Random_Utility_Function is worse.

Return_Zeros is a very predictable function. But it isn't controllable. You can't optimize its output by optimizing its input. Return_Input is a much more controllable function.

In a scenario where there is one input (controlled by the agent) and one output, Return_Input is a maximally controllable function, at least to an omniscient controller.

If an agent is logically omniscient, but not physically omniscient, they can't make Return_Input play chess, they don't know what moves the opponent will make.

If an agent isn't logically omniscient, they can't get Return_Input to output the trillionth digit of pi.

For both of these reasons, Run_Python_Script is more controllable, at least for non-omniscient agents that know python.

AIXI would find any Turing complete programming language equally controllable, so long as it knew what the language was, and AIXI had unlimited output bandwidth (for verbose languages) and there were no performance concerns about using slow languages.

The Basin

The basin of corrigibility is the set of all algorithms sufficiently corrigible that active optimization on the part of the human programmers can bring it to full corrigibility.

For example, an AI that is perfectly corrigible, except for a glitch that causes it to shut down whenever it sees a picture of a banana, is well within this basin. The human handlers can easily ask the AI if it has any glitches, then instruct the AI to remove this glitch.

For instance consider the following behaviors, when given contradictory instructions.

Follow the first instruction, ignoring parts of the second if need be.
Follow the second instruction, ignoring parts of the first.
Follow the instruction from whoever is higher in the organizational hierarchy. If 2 contradictory instructions are given by the same person, follow the one that contains more letters.
Toss a coin
Throw a CONTRADICTION_ERROR and hault.
Hold a vote of all the programmers.

You could decide, as part of your definition of corrigibility, that an ideal corrigible agent should do 3. or whatever. But that's part of the map, not the territory.

The structure of the territory is that there are many slight variations on how the AI behaves in this circumstance. Its also easy to move between them. Just tell the AI to do so.

There are many other details like this. How verbose should the AI be? Should it ask 100 clarifying questions before acting? Should it follow the instructions of a single programmer, or only a majority vote, or only a unilateral decision by all programmers?

This is the structure of the basin. The AI designs it is easy to move back and forth between.

Is an empty python terminal in the basin? Well it would be in AIXI's basin. It would be in the basin of a hypothetical human that was really good at AI programming. The more capable the agent guiding movement about the basin, the bigger the basin is. An empty terminal isn't in the basin for me, with one week of thinking time, as the moving force.

The discussion of corrigibility beginning with very simple programs like Return_Zeros and building up complexity gradually with Return_Input, Run_Python_Script and beyond is interesting. It helps make clear that corrigibility isn't a particularly narrow target or especially challenging for software in general, or even for some more intelligent systems. It's specifically at the point when a program starts to become a powerful optimizer or to take on more agentic qualities that it starts to seem really difficult and unclear how to maintain corrigibility.

Post somewhat inspired by Eliezers "well why didn't anyone else write it then?"

For posterity or anyone who doesn't know which post from Eliezer this is referring to, it's Let's See You Write That Corrigibility Tag.

Corrigibility has various slightly different definitions, but the general rough idea is of an AI that does what we want

An aligned AI will also so what we want because it's also what it wants, its terminal values are also ours.

I've always taken "control" to differ from alignment in that it means an AI doing what we want even if it isn't what it wants, ie it has a terminal value of getting rewards, and our values are instrumental to that, if they figure at all.

And I take corrigibility to mean shaping an AIs values as you go along and therefore an outcome of control.

Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn't corrigible. Corrigibility is more following instructions than having your utility function.

Post somewhat inspired by Eliezers "well why didn't anyone else write it then?"

For posterity or anyone who doesn't know which post from Eliezer this is referring to, it's Let's See You Write That Corrigibility Tag.

Corrigibility has various slightly different definitions, but the general rough idea is of an AI that does what we want

An aligned AI will also so what we want because it's also what it wants, its terminal values are also ours.

And I take corrigibility to mean shaping an AIs values as you go along and therefore an outcome of control.

Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn't corrigible. Corrigibility is more following instructions than having your utility function.

9

On corrigibility and its basin

9

The Basin