As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day some days for 25 days. I have now procrastinated enough that I probably have enough hot takes.

People often talk about aligning language models, either to promote it, or to pooh-pooh it. I'm here to do both.

Sometimes, aligning language models just means trying to get a present-day model not to say bad outputs that would embarrass your organization. There is a cottage industry of papers on arxiv doing slightly different variants of RLHF against bad behavior, measuring slightly different endpoints. These people deserve their light mockery for diluting the keyword "alignment."

The good meaning of aligning language models is to use "get language models to not say bad things" as a toy problem to teach us new, interesting skills that we can apply to future powerful AI. For example, you could see the recent paper "Discovering Latent Knowledge in Language Models Without Supervision" as using "get language models to not lie" as a toy problem to teach us something new and interesting about interpretability. Aligning language models with an eye towards the future doesn't have to just be interpretability research, either, it can be anything that builds skills that the authors expect will be useful for aligning future AI, like self-reflection as explored in Constitutional AI.

If you're brainstorming ideas for research aligning language models, I encourage you to think about connections between current language models and future AI that navigates the real world. In particular, connections between potential alignment strategies for future AIs and situations that language models can be studied in.

Here's an example: Constitutional AI uses a model to give feedback on itself, which is incorporated into RL fine-tuning. But we expect future AI that navigates the real world to not merely be prompted to self-reflect as part of the training process, but to self-reflect during deployment - an AI that is acting in the real world will have to consider actions that affect its own hardware and software. We could study this phenomenon using a language model (or language-model-based-agent) by giving it access to outputs that affect itself in a more direct way than adding to an RL signal, and trying to make progress on getting a language model to behave well under those conditions.

Doing this sounds weird even to me. That's fine. I want the research area of aligning language models to look a lot weirder.

Not to say that normal-sounding papers can't be useful. There's a lot of room to improve the human feedback in RLHF by leveraging a richer model of the human, for example, and this could be pretty useful for making current language models not say bad things. But to do a sufficiently good job at this, you probably have to start thinking about incorporating unsupervised loss terms (even if they provide no benefit for current models), and addressing scenarios where the AI is a better predictor than the human, and other weird things.

Overall, I'm happy with the research on aligning language models that's been done by safety-aware people. But we're in the normal-seeming infancy of a research direction that should look pretty weird.

New Comment