You might be interested in Reducing Goodhart. I'm a fan of "detecting and avoiding internal Goodhart," and I claim that's a reflective version of the value learning problem.
One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent considers what would happen with a proxy agent that has that proxy utility, how it should function to avoid the goodharting/misalignment trouble.
The tricky part of this is respecting minimality. The proxy agent itself should be more like a pivotal aligned agent, built around the kind of thing the current action or plan is, rather than around the overall goals of the original agent. This way, passing to the proxy agent de-escalates the scope of optimization/cognition. More alarmingly, the original agent that's corrigible in this sense now seemingly reasons about alignment, which requires all sorts of dangerous cognition. So one of the things a proxy agent should do less of is less thinking about alignment, less ambitious corrigibility.
Anything that makes a proxy agent safer (in the sense of doing less dangerous cognition) should be attempted for the original corrigible agent as well. So the most corrigible agent in this sequence of three is human programmers, who perform dangerous alignment cognition to construct the original corrigible agent, which perhaps performs some alignment techniques when coming up with proxy agents for its actions, but doesn't itself invent those techniques. And the proxy agents are less corrigible still in this sense, some of them might be playing a maximization game that works directly (like chess or theorem proving), prepared for them by the original corrigible agent.
Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can".
Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on.
Here's what I'd like to discuss in this post:
When does maximization work?
In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are:
Examples: theorem proving, compression / minimizing reconstruction error.
Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks.
Examples: quantilization, GPT-4 tasked to write good documentation.
Examples: chess engine evaluations, having f evaluate the thoughts that lead to x.
There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post.
Passing the buck on optimization
Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers.
Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off.
But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown:
(You might also like to read this recent post on CIRL (in)corrigibility, which is well-formalized and has great explorations of the thought process of a CIRL agent.)
These scenarios depend on a bunch of particulars. But the bottom line is that we're still leaving open cases where the thing being maximized comes apart from our good outcome. The good outcome in this case is the operators switching the AI off[1].
I claim that a large part of the problem was that this AI design merely passed the buck one step, relying on an inner process of maximization that is itself incorrigible. Then if the glue connecting that maximization to good things weakens, we're in no place to fix it.
Patches help but do not evade the difficulty
One patch idea is to replace "maximize" with "quantilize": e.g. the AI samples from plans that look like "reasonable human plans"[2] and chooses one from the top 1%. This will help stay in the region of actions/plans that we can model, and make it easier to stay in control of the process.
This is a good and helpful change! But it only matters to the extent that it replaces some of the labor being done by "maximize this proxy" with something safer.
Suppose you were trying to use your AI to do something ambitious, e.g. implement a monitoring scheme that can flag GPU code that implements large-scale LLM training. (NB: this is not a great example of something ambitious that would help with AI x-risk! But it can help illustrate the difficulty.)
Developing the ability to do this monitoring will require a bunch of thinking and work. Suppose one of the pieces you need is an AI module that understands code quite well, so you can use it to check if some code implements the necessary operations to train an LLM. How does your quantilizing CIRL-ish system achieve this?
If the answer is "it writes code for some machine-learning-ish maximization processes that--", then you're running into the same issues again. You have pushed some of the maximization around, without factoring it out. You may be in a better place than a complete black box, since you can offload some labor to your CIRL bot. But you haven't named a solution that "tiles" and can achieve complicated goals without doing open-ended maximization inside.
And if you think "write code to detect LLM training" isn't too scary an open-ended task, I observe that "solve AI alignment for us" is a lot more dangerous[3]. See also Rob Bensinger on the danger with AI being in high-powered plans.
The road to corrigibility
Here is a proto-plan for corrigibility: we try to build into the AI some way of tracking the fact that its optimization is only desirable when it stays within our model. And generally that it's important to only do optimization when it's desired.
Backing up, my intuition says that the off-switch situation should play out like this:
This is far from a coherent vision, since I don't have the details for how the AI could "want" that! Still though, one key property of my visualization is some "awareness" in the AI system that all of its optimization can run afoul of Goodhart's curse and other problems, and is thereby tentative. It should prioritize optimization in ways that are safer, perhaps using any of the ideas we have for mild optimization or similar.
And really, when the humans are trying to shut it down, this should result in an immediate update against continuing to run itself![4] There definitely shouldn't be a complicated reasoning process that might for-all-we-know decide to resist.
In short, my basic hope is that we can offload some of the tricky work of "watching the AI to make sure it's working like we expect" to the AI, and my specific hope is that the AI can do simple things that steer away from complicated model-violating things. The implementation remains super hard and ill-specified.
Related corrigibility ideas
I broadly think of other angles on corrigibility as trying to stay in the regime of "we know what the optimization is doing", and "we don't expect it to surprise us by breaking our models".
Two related ideas are "reward actions not outcomes" and "supervise processes not outcomes". The point being that it helps to shape your AI with opinions about its plans and how it does its planning, rather than only constraining downstream outputs.
And another proto-plan for optimizing things better is Davidad's "open agency" sketch. (See also this comment.) I'm no expert, but my take on it is:
I'll finally point also to Eliezer's recent list of ideas (though beware, there are fiction spoilers in the surrounding text). My thoughts looking for "corrigibility-generating principles" are in large part inspired by these.
There's a case to be made that you could have an AI system correctly know that it would be bad for the human's values to comply with shutting off. But for our first AI systems, who aren't moral patients and might be written to optimize the subtly wrong thing, it seems a design failure for there to be forseeable circumstances in which they won't shut down. Human input is roughly the only glue connecting things-the-AI-wants and things-humanity-wants, so it's playing with fire to design AIs that can oppose our steering.
We could do this by training a generative model of "plans" on human data, or be choosing amongst a list of human-approved meta-plans, or something similar.
We have some people trying to solve the alignment problem. I haven't seen them say their research is bottlenecked in a way some sort of AI research buddy could fix.
Maybe AI theorem provers / formalization helpers can speed things up. Maybe there are subproblems that need some sort of shallow brainstorming + pruning. These are the sorts of things I can imagine non-scary AI systems helping with.
But the best human work on alignment, which isn't yet looking like it solves the problem, involves smart folk thinking very deeply about things, inventing and discarding novel factorizations of the problem in search of angles that work.
This involves a lot of open-ended thinking, goal-directed loops, etc. It's not the sort of thing I can see how to replace or substantially-augment with non-scary AI systems.
Also, to be clear, you shouldn't have your AI be thinking about the off-switch in the first place! That's not its job and shouldn't be its concern. The off-switch should just work.
Maybe you should get a compartment of the AI to help with making sure the off-switch continues to work, and that the humans aren't impeded in using it. That seems a bit more reasonable.
Generally speaking the causal connection between the off-switch and the AI being off should be kept as simple and robust as you can make it. Then you need to design your AI so that it won't mess with that.