Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written ~~every day~~ some days for 25 days. I have now procrastinated enough that I probably have enough hot takes.

I felt like writing this take a little more basic, so that it doesn't sound totally insane if read by an average ML researcher.

Edit - I should have cited Buck's recent post somewhere.

Use of RLHF by OpenAI is a good sign in that it shows how alignment research can get adopted by developers of cutting-edge AI. I think it's even a good sign overall, probably. But still, use of RLHF by OpenAI is a bad sign in that it shows that jamming RL at real-world problems is endorsed as a way to make impressive products.

If you wandered in off the street, you might be confused why I'm saying RL is bad. Isn't it a really useful learning method? Hasn't it led to lots of cool stuff?

But if you haven't wandered in off the street, you know I'm talking about alignment problems - loosely, we want powerful AIs to do good things and not bad things, even when tackling the whole problem of navigating the real world. And RL has an unacceptable chance of getting you AIs that want to do bad things.

There's an obvious problem with RL for navigating the real world, and a more speculative generalization of that problem.

The obvious problem is wireheading. If you're a clever AI learning from reward signals in the real world, you might start to notice that actions that affect a particular computer in the world have their reward computed differently than actions that affect rocks or trees. And it turns out that by making a certain number on this computer big, you can get really high reward! At this point the AI starts searching for ways to stop you from interrupting its "heroin fix," and we've officially hecked up and made something that's adversarial to us.

Now, maybe you can do RL without this happening. Maybe if you do model-based reasoning, and become self-reflective at the right time to lock in early values, you'll perceive actions that manipulate this special computer to be cheating according to your model, and avoid them. I'll explain in a later take some extra work I think this requires, but for the moment it's more important to note that a lot of RL tricks are actually working directly against this kind of reasoning. When a hard environment has a lot of dead ends, and sparse gradients (e.g. Montezuma's Revenge, or the real world), you want to do things like generate intrinsic motivations to aid exploration, or use tree search over a model of the world, which will help the AI break out of local traps and find solutions that are globally better according to the reward function.

Maxima of the reward function have nonlocal echoes, like mountains have slopes and foothills. These echoes are the whole reason that looking at the local gradient is informative about which direction is better long-term, and why building a world model can help you predict never-before-seen rewarding states. Deep models and fancy optimizers are useful precisely because their sensitivity to those echoes helps them find good solutions to problems, and there's no difference in kind between the echoes of the solutions we want our AI to find, and the echoes of the solutions we didn't intend.

The speculative generalization of the problem is that there's a real risk of an AI sensing these echoes even if it's not explicitly intended to act in the real world, so long as its actions are affecting its reward-evaluation process, and it benefits from building a model of the real world. Suppose you have a language model that you're trying to train with RL, and your reward signal is the rating of a human who happens to be really easily manipulated (Maybe the language model just needs to print "I'm conscious and want you to reward me" and the human will give it high reward. Stranger things have happened.). If it's clever, perhaps if it implicitly builds up a model of the evaluation process in the process of getting higher reward, then it will learn to manipulate the evaluator. And if it's even cleverer, then it will start to unify its model of the evaluation process and its abstract model of the real world (which it's found useful for predicting text), which will suggest some strategies that might get really high reward.

Now, not all RL architectures are remotely like this. Model-based RL with a fixed, human-legible model wouldn't learn to manipulate the reward-evaluation process. But enough architectures would that it's a bad idea to just jam RL and large models at real-world problems. It's a recipe for turning upward surprises in model capability into upward surprises in model dangerousness.

All that said, I'm not saying you can never use RL - in fact I'm optimistic that solutions to the "build AI that does good things and not bad things" problem will involve it. But we don't know how to solve the above problems yet, so orgs building cutting-edge AI should already be thinking about restricting their use of RL. The outer loop matters most: a limited-scope component of an AI can be trained with an RL objective and still be used in a safe way, while a top-level controller optimized with RL can use normally-safe cognitive components in a dangerous way.

This includes RLHF. When I say "jam RL at real-world problems," I absolutely mean to include using RLHF to make your big language model give better answers to peoples' questions. If you've read the right alignment papers you're probably already thinking about ways that RLHF is particularly safe, or ways that you might tweak the RLHF fine-tuning process to enforce stricter myopia, etc. But fundamentally, it has problems that we don't know the solutions to, which makes it unsafe to use as a long-term tool in the toolbox.

Boy howdy it makes the numbers go up good though.

[-]Adrià Garriga-alonso4y41

I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)

The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic or procrastinator AI) could just look like the AI leaving us alone so long as we guarantee that its reward numbers will be really really high.

No, the problem is long-term planning and agentic-ness, which implies that the AI will realize that seizing power is a good instrumental goal.

Model-based RL with a fixed, human-legible model wouldn't learn to manipulate the reward-evaluation process

No, instead it manipulates the world model, which is by assumption imperfect; and thus no useful systems can be constructed this way. This has been a capabilities problem for model-based RL, even with learned models, for decades; which is not actually fully solved yet.

11

Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

11