Take 13: RLHF bad, conditioning good.

Charlie Steiner

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written ~~every day~~ some days for 25 days. I have now procrastinated enough that I probably have enough hot takes.

Edit - I should have cited Buck's recent post somewhere.

Hyperbolic title, sorry. But seriously, conditioning is better than RLHF for current language models. For agents navigating the real world, both have issues and it's not clear-cut where progress will come from.

By "conditioning", I mean the decision transformer trick to do conditional inference: get human ratings of sequences of tokens, and then make a dataset where you append the ratings to the front of the sequences. A model trained on this dataset for next-token prediction will have to learn the distribution of text conditional on the rating - so if you prompt it with a high rating and then the start of an answer, it will try to continue the answer in a way humans would rate highly.

This can be very similar to RLHF - especially if you augment the training data by building a model of human ratings, and train a model to do conditional inference by finetuning a model trained normally. But in the right perspective, the resulting AIs are trying to do quite different things.

RLHF is sorta training the AI to be an agent. Not an agent that navigates the real world, but an agent that navigates the state-space of text. It learns to prefer certain trajectories of the text, and takes actions (outputs words) to steer the text onto favored trajectories. Conditioning, on the other hand, is trying to faithfully learn the distribution of possible human responses - it's getting trained to be a simulator that can predict many different sorts of agents.

The difference is stark in their reactions to variance. RLHF wants to eliminate variance that might make a material difference in the trajectory (when the KL penalty is small relative to the Bayesian-updating KL penalty), while conditioning on rating still tries to produce something that looks like the training distribution.

This makes conditioning way better whenever you care about the diversity of options produced by a language model - e.g. if you're trying to get the AI to generate something specific yet hard to specify, and you want to be able to sift through several continuations. Or if you're building a product that works like souped-up autocorrect, and want to automatically get a diversity of good suggestions.

Another benefit is quantilization. RLHF is trying to get the highest score available, even if it means exploiting human biases. If instead you condition on a score that's high but still regularly gotten by humans, it's like you're sampling policies that get this high-but-not-too-high score, which are less exploitative of human raters than the absolute maximum-score policy.

This isn't a free lunch. Fine-tuning for conditional inference has less of an impact on what sort of problem the AI is solving than RLHF does, but it makes that problem way harder. Unsurprisingly, performance tends to be worse on harder problems. Still, research on decision transformers is full of results that are somewhat competitive with other methods.

It also still exploits the human raters some amount, increasing with the extremity of the score. Sam Marks has talked about a scheme using online decision transformers to improve performance without needing to make the score extreme relative to the distribution seen so far, which is definitely worth a read, but this seems like a case of optimality is the tiger. Whether found by RLHF or conditioning, the problem is with the policies that get the highest scores.

Looking out to the future, I'm uncertain about how useful conditioning will really be. For an AI that chooses policies to affect the real world (as opposed to generating text), it doesn't seem nearly so important to be able to produce a variety of on-distribution policies. On the other hand, maybe we'll come up with ways to leverage that capability that are useful for alignment.

Currently, I expect that many of the shared problems between RLHF and conditioning will be tackled by developing the capability for models to receive meta-level feedback that directly affects generalization properties. This capability is more consonant with RLHF, and is discordant with conditioning, because it means departing from the generator of the training distribution in order to do the right thing.

24

Take 13: RLHF bad, conditioning good.

24