(Thanks to Evan Hubinger and Nicholas Schiefer for comments on these ideas.)
These are some notes on the relation between conditioning language models, prompting, and fine-tuning. The key takeaways are:
We can think of a language model as specifying a probability distribution π(x), where x is a sequence of tokens of fixed length N (the length of the context window). We generate text by sampling sequences from π.
Sometimes we don’t want to just sample from a language model. Instead, we want to condition the model on some facts about the sequence x. We can write the conditioned distribution as
where c(x) encodes some constraints on x. For instance c(x) might require that the first token is “Apple”, or that the 7th and 12th tokens are the same, etc.
It’s easy to sample from a language model conditioned on the first two tokens being the same, but not all conditionals are so straightforward. Suppose we condition on the sequence x beginning with the factorization of a large composite number. There exist valid sequences unambiguously satisfying the conditional, but sampling them is hard if we don't know the factorization ahead of time. So there are limits to the kinds of conditionals we can apply in practice.
A prompt is a very restricted kind of conditional where the condition is that certain tokens in x are known in advance. For instance, we might specify that the first four words are “Mary had a little”, or that the last three words are “happily ever after.”
Prompts are nice in a few ways:
The downside with prompting is that there are lots of conditionals we can’t turn into prompts. For instance:
None of these can be expressed in terms of fixed tokens in the context window.
Instead of prompting, we can fine-tune a model, either with an explicit reward function or with Reinforcement Learning from Human Feedback (RLHF). We start with a pre-trained model, then fine-tune it to maximize either an explicit or a learned reward.
Subject to actually converging to the optimum distribution, fine-tuning with a KL penalty is a form of variational bayesian inference. The result is a variational approximation of the Bayesian update on human feedback using the pre-trained model as a prior. That is, we obtain a new model which produces the probability distribution
where the likelihood is L(x)=er(x)/β, β is the KL penalty weight, and r(x) is the reward for sequence x. A more formal discussion was given by Korbak, Perez & Buckley.
Fine-tuning can approximate any conditional a prompt can achieve. To see this, note that every prompt consists of setting tokens at some positions i∈S to values yi, where the indices in S form a subset of the context window. A prompt in this form is approximated by fine-tuning on the reward function
where δxi,yi=1 if xi=yi and is zero otherwise. In the limit of large λ, fine-tuning on this reward function amounts to providing enormous evidence in favor of the desired token values, which is equivalent to conditioning with a prompt that directly fixes those tokens.
With appropriate choices of the reward r(x) we can achieve any shift in the probability distribution that doesn't expand the support of π(x), and so in principle fine-tuning can approximate any conditional.
In practice some conditionals are hard to achieve because they require an unrealistically large number of samples for fine-tuning to converge to the full Bayesian update. For instance it is hard to fine-tune on the reward corresponding to “the sequence x begin with a factorization of a large composite number” because it is takes many tries to find an x satisfying the conditional.
Still, there are many kinds of conditionals that fine-tuning can access in practice. For instance, RLHF can condition on positive human sentiment rating, or on not containing malicious plans.
More generally, fine-tuning seems to be good for conditioning on properties that are:
Because fine-tuning with a KL penalty implements Bayesian updates, every reward function describes a conditional of the form “condition on the following sequences being more/less likely according to their reward”. Unfortunately we may not understand at a deeper level what this conditional means.
In particular, it is not obvious how this conditional generalizes. Consider RLHF with a sentiment reward. There are multiple ways a model could interpret the implied conditional:
These two interpretations generalize very differently. For instance (1) could increase the probability of text describing humans helping each other while (2) could decrease that probability by implying a world with little social trust.
This sort of generalization ambiguity seems really dangerous, because we could end up with very different behavior from what we intended in specifying the reward function or providing feedback.
My key takeaways are:
(1-4) suggest that we will need some sort of fine-tuning/RLHF to achieve the kinds of complex conditionals that are useful in practice/for alignment schemes. If so, (5) says we should try to figure out more about how fine-tuning conditionals generalize, because that's where a lot of the danger lies.
A fine-tuning could be an identity or mission statement for an agent (bureaucracy), so that it speaks with a purpose or attention to particular features of a situation, or to a particular concept, or to an aspect of preference. Then in an HCH-like setting, let's define for each situation (initial prompt) an episode on it that involves multiple agents discussing the situation, elucidating its aspects pertaining to those agents. Each agent participates in some set of episodes defined on a set of situations (agent's scope), and the scope can be different for different agents (each agent is specialized and only participates in episodes about situations where its fine-tuning is relevant).
An agent is aligned when it consistently acts according to its fine-tuning's intent within its scope (it's robust to situations in its scope, or to episodes on the situations in its scope). An agent doesn't need to behave correctly outside its scope to be considered aligned, so a fine-tuning doesn't need to generalize too far, but its scope must conservatively estimate how far it does generalize.
So a large space of situations can be covered by overlapping smaller scopes of agents that bind behaviors of episodes on those situations together. Each agent acts as a sort of acausal coordination device across the episodes on its scope, if agents are iteratively retrained on the data of episodes (as a sort of reflection). And each episode binds behaviors of agents participating in it together (in a sort of bargaining).
In this sketch, alignment/extrapolation (across distributional shift) is sought by training new specialized agents that cover novel situations further from the initial training/fine-tuning distribution with their scopes. This is done by adding them to episodes on situations that are in scopes of both old and new agents, where they bargain with old agents and learn to extend their alignment to new situations within their new scopes. A new agent is trained to understand new situations (within its scope) and arguments that take place within the episodes on those situations. These are unfamiliar to old agents, so their adequate descriptions/explanations won't fit into a context window for an old agent (the way prompts can't replace fine-tunings), but the new agent is expecting these situations already, so can discuss them (after iterating reflection, fine-tuning to the episodes on the new agent's scope).