As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes. This take owes a lot to the Simulators discussion group.

Fine-tuning a large sequence model with RLHF creates an agent that tries to steer the sequence in rewarding directions. Simultaneously, it breaks some nice properties that the fine-tuned model used to have. You should have a gut feeling that we can do better.

When you start with a fresh sequence model, it's not acting like an agent, instead it's just trying to mimic the training distribution. It may contain agents, but at every step it's just going to output a probability distribution that's been optimized to be well-calibrated. This is a really handy property - well-calibrated conditional inference is about as good as being able to see the future, both for prediction and for generation.

The design philosophy behind RLHF is to train an agent that operates in a world where we want to steer towards good trajectories. In this framing, there's good text and bad text, and we want the fine-tuned AI to always output good text rather than bad text. This isn't necessarily a bad goal - sometimes you do want an agent that will just give you the good text. The issue is, you're sacrificing the ability to do accurate conditional inference about the training distribution. When you do RLHF fine-tuning, you're taking a world model and then, in-place, trying to cannibalize its parts to make an optimizer.

This might sound like hyperbole if you remember RL with KL penalties is Bayesian inference. And okay; RLHF weights each datapoint much more than the Bayesian inference step does, but there's probably some perspective in which you can see the fine-tuned model as just having weird over-updated beliefs about how the world is. But just like perceptual control theory says, there's no bright line between prediction and action. Ultimately it's about what perspective is more useful, and to me it's much more useful to think of RLHF on a language model as producing an agent that acts in the world of text, trying to steer the text onto its favored trajectories.

As an agent, it has some alignment problems, even if it lives totally in the world of text and doesn't get information leakage from the real world. It's trying to get to better trajectories by any means necessary, even if it means suddenly delivering an invitation to a wedding party. The real-world solution to this problem seems to have been a combination of early stopping and ad-hoc patches, neither of which inspire massive confidence. The wedding party attractor isn't an existential threat, but it's a bad sign for attempts to align more high-stakes AI, and it's an indicator that we're probably failing at the "Do What I Mean" instruction in other more subtle ways as well.

More seems possible. More capabilities, more interpretability, and more progress on alignment. We start with a perfectly good sequence model, it seems like we should be able to leverage it as a model, rather than as fodder for a model-free process. Although to any readers who feel similarly optimistic, I would like to remind you that the "more capabilities" part is no joke, and it's very easy for it to memetically out-compete the "more alignment" part.

RLHF is still built out of useful parts - modeling the human and then doing what they want is core to lots of alignment schemes. But ultimately I want us to build something more self-reflective, and that may favor a more model-based approach both because it exposes more interpretable structure (both to human designers and to a self-reflective AI), and because it preserves the niceness of calibrated prediction. I'll have a bit more on research directions tomorrow.

New Comment
1 comment, sorted by Click to highlight new comments since:

Thanks for the link to porby post on modularity and goal agnosticism, that's an overlooked goldmine.