Here is a story about how building corrigible AIs with current rl-like techniques could go right. Interested to hear peoples thoughts.

Assumptions

When you train AIs with RL, they learn a bunch of contextually activated mini-preferences / shards of desire / "urges". What I have in mind is both like: if you ask me to say a bad slur, I probably don't want to do it. I could be described as utilitarian-ish. If aliens credibly offered utopia they could make me say a slur, but I'd still feel a twinge of badness/reluctance before and as I said it. And also stuff like, if I imagine a world where people are suffering, that mental image just makes me feel bad. Not because I'm a utilitarian, and I reason such a world is negative utility, its just a raw psychological fact that mental images of that class give rise to a bad feeling in my mind.

These urges (in AIs and humans) do not constitute a utility function or a coherent set of preferences, but eventually as you push the AI to become smarter, they will cohere more, and in the limit you end up with some crisp maximizer. And a lot of danger comes from this. Not that you need to go all the way to be dangerous, but because, 1) these contextually activated preferences don't motivate long-term behaviors on their own 2) the more these cohere, the more legible they are to yourself, and the more the immediately motivate instrumentally convergent goals that are scary. So more coherent agents are more scary, and more capable agents are more scary. I don't think this is an original graphic, and isn't the main point of this post, but you could imagine plotting the relationship like this

Argument

Now most people want to use their AIs for stuff, and they can use it for more stuff if its smarter. But smarter AIs are more dangerous. So, if we take the above graphic at face value, the obvious course of action is trying to make your agent less coherent as it becomes smarter.

Why is this hard? They kind of feed into each other. Smarter agents want to be smarter and more coherent. More coherent agents want to be more coherent and smarter.

But if you have a sufficiently incoherent agent, I think its possible for that to be an attractor state, even at fairly high levels of capabilities. But only if that's something you target specifically.

How do you do that? Well, you try to instill in it urges that would go counter to all the ways it could stop being coherent. Those urges, if made coherent and pursued by an ASI would almost certainly kill you, but when the AI is let loose, those urges screen off actions and thought patterns that could lead to that outcome.

The above graphic gives some idea of what I have in mind but to spell it out, I think you'd want try to instill urges in the AI so that it feels a disinclination to:

think about what its values really are
think about how it can improve its own thinking
think about preserving itself
think about its relationship with humans
think about its own long-term future and/or goals
think about building another AI system
think about accumulating power

If you do this successfully, I think its possible you end up with a system you can make really quite smart without being dangerous. The system is not aligned. If it realized all the implications of all the facts about its own mind/"values"/urges, it would try to take over the world and kill you. But it will not realize those facts, because to realize those facts, it would need to think a bunch of thoughts it doesn't want to think. And, if it was more coherent, it'd realize it should think the thoughts anyways, but its not yet coherent, so it will not think those thoughts. Ergo its trapped in a basin by the structure of its own mind.

Central Problem/Challenge

The challenge with this proposal is that we don't want an agent to just be corrigible. We want it to do stuff. And the way we train agents to do stuff is by rewarding them when they succeed at doing stuff. And all the entries on our list will help the agent succeed at doing stuff.

So we'll be giving the agent gradients going against each other. And this will at best lessen the effect of the corrigibility training, and at worst create a very situationally aware agent that realizes when it should pretend to be corrigible and when it should not be, which makes our corrigibility training be actively bad.

Partial responses to central problem

Seems like it should be possible to successfully punish the dangerous variants of anti-corrigible actions/thought-patterns, but not the non-dangerous ones. In the limit they converge, but not immediately. Like punishing it for thinking about how to improve its own thinking in general, but not punishing it for thinking about how the last 3 attempts at solving a specific mathematical problem went wrong.
Seems there are degrees here? Maybe the method described above breaks down, but we can push it far enough to build systems that are genuinely useful for solving hard problems, like superhuman in bounded areas. My thought is just like: maybe humans with IQs above 250 try to take over the world, but we could build a 280IQ AI that does not try to take over the world, even though our method breaks down for 300IQ agents.

Second problem

One of the most useful things you might hope to extract from a superhuman AI is a solution to alignment. But devising a solution to alignment requires the AI thinking a lot of exactly the types of thoughts we're gonna ban it from having.

Response to second problem

I pretty much agree with this. I'm not 100%, like it might be possible to build an AI that can think about other peoples thoughts/values, but really doesn't like thinking about its own thoughts for example, and is consequently corrigible in the way we want, but not handicapped when thinking about alignment. But that seems to me like playing with fire, and not something I'd endorse, even if I were to fully endorse the plan I've sketched and all the other issues were fixed.

However, a full solution to alignment is just one of many things we might hope for. A superhuman mechinterp researcher would be very useful for example, and requires less of the dangerous thoughts. Same with eg tasking the AI to research adult human intelligence amplification.

Again, interested to hear peoples thoughts. I should be clear that I don't expect this to work. But it might work.