Thinking about maximization and corrigibility

One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent considers what would happen with a proxy agent that has that proxy utility, how it should function to avoid the goodharting/misalignment trouble.

The tricky part of this is respecting minimality. The proxy agent itself should be more like a pivotal aligned agent, built around the kind of thing the current action or plan is, rather than around the overall goals of the original agent. This way, passing to the proxy agent de-escalates the scope of optimization/cognition. More alarmingly, the original agent that's corrigible in this sense now seemingly reasons about alignment, which requires all sorts of dangerous cognition. So one of the things a proxy agent should do less of is less thinking about alignment, less ambitious corrigibility.

Anything that makes a proxy agent safer (in the sense of doing less dangerous cognition) should be attempted for the original corrigible agent as well. So the most corrigible agent in this sequence of three is human programmers, who perform dangerous alignment cognition to construct the original corrigible agent, which perhaps performs some alignment techniques when coming up with proxy agents for its actions, but doesn't itself invent those techniques. And the proxy agents are less corrigible still in this sense, some of them might be playing a maximization game that works directly (like chess or theorem proving), prepared for them by the original corrigible agent.

^{^}

There's a case to be made that you could have an AI system correctly know that it would be bad for the human's values to comply with shutting off. But for our first AI systems, who aren't moral patients and might be written to optimize the subtly wrong thing, it seems a design failure for there to be forseeable circumstances in which they won't shut down. Human input is roughly the only glue connecting things-the-AI-wants and things-humanity-wants, so it's playing with fire to design AIs that can oppose our steering.

^{^}

We could do this by training a generative model of "plans" on human data, or be choosing amongst a list of human-approved meta-plans, or something similar.

^{^}

We have some people trying to solve the alignment problem. I haven't seen them say their research is bottlenecked in a way some sort of AI research buddy could fix.

Maybe AI theorem provers / formalization helpers can speed things up. Maybe there are subproblems that need some sort of shallow brainstorming + pruning. These are the sorts of things I can imagine non-scary AI systems helping with.

But the best human work on alignment, which isn't yet looking like it solves the problem, involves smart folk thinking very deeply about things, inventing and discarding novel factorizations of the problem in search of angles that work.

This involves a lot of open-ended thinking, goal-directed loops, etc. It's not the sort of thing I can see how to replace or substantially-augment with non-scary AI systems.

^{^}

Also, to be clear, you shouldn't have your AI be thinking about the off-switch in the first place! That's not its job and shouldn't be its concern. The off-switch should just work.

Maybe you should get a compartment of the AI to help with making sure the off-switch continues to work, and that the humans aren't impeded in using it. That seems a bit more reasonable.

Generally speaking the causal connection between the off-switch and the AI being off should be kept as simple and robust as you can make it. Then you need to design your AI so that it won't mess with that.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

30

Thinking about maximization and corrigibility

30

When does maximization work?

Passing the buck on optimization

Patches help but do not evade the difficulty

The road to corrigibility

Related corrigibility ideas