Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day (ish) for 25 days. Or until I run out of hot takes. And now, time for the week of RLHF takes.

I see people say one of these surprisingly often.

Sometimes, it's because the speaker is fresh and full of optimism. They've recently learned that there's this "outer alignment" thing where humans are supposed to communicate what they want to an AI, and oh look, here are some methods that researchers use to communicate what they want to an AI. The speaker doesn't see any major obstacles, and they don't have a presumption that there are a bunch of obstacles they don't see.

Other times, they're fresh and full of optimism in a slightly more sophisticated way. They've thought about the problem a bit, and it seems like human values can't be that hard to pick out. Our uncertainty about human values is pretty much like our uncertainty about any other part of the world - so their thinking goes - and humans are fairly competent at figuring things out about the world, especially if we just have to check the work of AI tools. They don't see any major obstacles, and look, I'm not allowed to just keep saying that in an ominous tone of voice as if it's a knockdown argument, maybe there aren't any obstacles, right?

Here's an obstacle: RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true. RLHF does whatever it learned makes you hit the "approve" button, even if that means deceiving you. Information-transfer in the depths of IDA is shaped by what humans will pass on, potentially amplified by what patterns are learned in training. And debate is just trying to hack the humans right from the start.

Optimizing for human approval wouldn't be a big deal if humans didn't make systematic mistakes, and weren't prone to finding certain lies more compelling than the truth. But we do, and we are, so that's a problem. Exhibit A, the last 5 years of politics - and no, the correct lesson to draw from politics is not "those other people make systematic mistakes and get suckered by palatable lies, but I'd never be like that." We can all be like that, which is why it's not safe to build a smart AI that has an incentive to do politics to you.

Generalized moral of the story: If something is an alignment solution except that it requires humans to converge to rational behavior, it's not an alignment solution.

Let's go back to the perspective of someone who thinks that RLHF/whatever solves outer alignment. I think that even once you notice a problem like "it's rewarded for deceiving me," there's a temptation to not change your mind, and this can lead people to add epicycles to other parts of their picture of alignment. (Or if I'm being nicer, disposes them to see the alignment problem in terms of "really solving" inner alignment.)

For example, in order to save an outer objective that encourages deception, it's tempting to say that non-deception is actually a separate problem, and we should study preventing deception as a topic in its own right, independent of objective. And you know what, this is actually a pretty reasonable thing to study. But that doesn't mean you should actually hang onto the original objective. Even when you make stone soup, you don't eat the stone.

How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage "killing everyone".

A merely half-way reasonable method of supervision will only discourage getting caught killing everyone, is the thing.

In all the examples we have from toy models, the RLHF agent has no option to take over the supervision process. The most adversarial thing it can do is to deceive the human evaluators (while executing an easier, lazier strategy). And it does that sometimes.

If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.

But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.

If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.

OK, so this is wire-heading, right? Then you agree that it's the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you've said that the problem with RLHF is that "RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true." How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes "true" claims, and you'd still get wire-heading.

So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?

I'm not trying to be antagonistic! I do think I probably just don't get it, and this seems a great opportunity for me finally understand this point :-)

I just don't know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don't consider an outer alignment problem, because it happens with any possible reward).

But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.

I agree with this. I agree that this failure mode could lead to extinction, but I'd say it's pretty unlikely. IMO, it's much more likely that we'll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)

If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?

My standards for interesting outer alignment failure don't require the AI to kill us all. I'm ambitious - by "outer alignment," I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn't kill everyone instantly, merely lead to an AI that doesn't do what we want.

My model is that if there are alignment failures that leave us neither dead nor disempowered, we'll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we've found a reward signal that leaves us alive and in charge, we've solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).

A merely half-way reasonable method of supervision will only discourage getting caught killing everyone, is the thing.

If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.

So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?

I'm not trying to be antagonistic! I do think I probably just don't get it, and this seems a great opportunity for me finally understand this point :-)

But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.

If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

23

Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

23