# 17

(Thanks to Evan Hubinger and Nicholas Schiefer for comments on these ideas.)

These are some notes on the relation between conditioning language models, prompting, and fine-tuning. The key takeaways are:

1. Prompting and fine-tuning can both be used to condition language models.
2. Prompting is quite restricted in the kinds of conditionals it can achieve.
3. Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
4. In practice fine-tuning can still implement more kinds of conditionals than prompting.
5. We don't understand how fine-tuning conditionals generalize, which seems dangerous.

# Conditioning

We can think of a language model as specifying a probability distribution , where  is a sequence of tokens of fixed length  (the length of the context window). We generate text by sampling sequences from .

Sometimes we don’t want to just sample from a language model. Instead, we want to condition the model on some facts about the sequence . We can write the conditioned distribution as

where  encodes some constraints on . For instance  might require that the first token is “Apple”, or that the 7th and 12th tokens are the same, etc.

## Some conditions are easy, some are hard

It’s easy to sample from a language model conditioned on the first two tokens being the same, but not all conditionals are so straightforward. Suppose we condition on the sequence  beginning with the factorization of a large composite number. There exist valid sequences unambiguously satisfying the conditional, but sampling them is hard if we don't know the factorization ahead of time. So there are limits to the kinds of conditionals we can apply in practice.

# Prompting

A prompt is a very restricted kind of conditional where the condition is that certain tokens in  are known in advance. For instance, we might specify that the first four words are “Mary had a little”, or that the last three words are “happily ever after.”

Prompts are nice in a few ways:

• It’s easy to sample from a language model given an arbitrary prompt.
• We sort of understand what prompts do. A prompt asks the model to predict the output of a text-generation process given that it knows the values of the fixed tokens.

The downside with prompting is that there are lots of conditionals we can’t turn into prompts. For instance:

• Sample text from the model that humans will rate as having positive sentiment.
• Sample text from the model that never involves violence.
• Sample text from the model that contains a valid chess game.

None of these can be expressed in terms of fixed tokens in the context window.

# Fine-Tuning

Instead of prompting, we can fine-tune a model, either with an explicit reward function or with Reinforcement Learning from Human Feedback (RLHF). We start with a pre-trained model, then fine-tune it to maximize either an explicit or a learned reward.

Subject to actually converging to the optimum distribution, fine-tuning with a KL penalty is a form of variational bayesian inference. The result is a variational approximation of the Bayesian update on human feedback using the pre-trained model as a prior. That is, we obtain a new model which produces the probability distribution

where the likelihood is  is the KL penalty weight, and  is the reward for sequence . A more formal discussion was given by Korbak, Perez & Buckley.

## Fine-tuning can approximate prompts

Fine-tuning can approximate any conditional a prompt can achieve. To see this, note that every prompt consists of setting tokens at some positions  to values , where the indices in  form a subset of the context window. A prompt in this form is approximated by fine-tuning on the reward function

where  if  and is zero otherwise. In the limit of large , fine-tuning on this reward function amounts to providing enormous evidence in favor of the desired token values, which is equivalent to conditioning with a prompt that directly fixes those tokens.

## Fine-tuning can approximate any conditional

With appropriate choices of the reward  we can achieve any shift in the probability distribution that doesn't expand the support of , and so in principle fine-tuning can approximate any conditional.

## Some conditions are easy, some are hard

In practice some conditionals are hard to achieve because they require an unrealistically large number of samples for fine-tuning to converge to the full Bayesian update. For instance it is hard to fine-tune on the reward corresponding to “the sequence  begin with a factorization of a large composite number” because it is takes many tries to find an  satisfying the conditional.

Still, there are many kinds of conditionals that fine-tuning can access in practice. For instance, RLHF can condition on positive human sentiment rating, or on not containing malicious plans.

More generally, fine-tuning seems to be good for conditioning on properties that are:

1. Easy to identify/evaluate.
2. Not too rare under the initial distribution  (or some pre-conditioned version of this, e.g. via prompts).

## Generalization Concerns

Because fine-tuning with a KL penalty implements Bayesian updates, every reward function describes a conditional of the form “condition on the following sequences being more/less likely according to their reward”. Unfortunately we may not understand at a deeper level what this conditional means.

In particular, it is not obvious how this conditional generalizes. Consider RLHF with a sentiment reward. There are multiple ways a model could interpret the implied conditional:

1. Positive-sentiment text is more likely, so humans are kinder in the world than the pre-training distribution suggested.
2. Positive-sentiment text is more likely, so there are legal restrictions on the kinds of text that are recorded.

These two interpretations generalize very differently. For instance (1) could increase the probability of text describing humans helping each other while (2) could decrease that probability by implying a world with little social trust.

This sort of generalization ambiguity seems really dangerous, because we could end up with very different behavior from what we intended in specifying the reward function or providing feedback.

# Summary

My key takeaways are:

1. Prompting and fine-tuning can both be used to condition language models.
2. Prompting is quite restricted in the kinds of conditionals it can achieve.
3. Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
4. In practice fine-tuning can still implement more kinds of conditionals than prompting.
5. We don't understand how fine-tuning conditionals generalize, which seems dangerous.

(1-4) suggest that we will need some sort of fine-tuning/RLHF to achieve the kinds of complex conditionals that are useful in practice/for alignment schemes. If so, (5) says we should try to figure out more about how fine-tuning conditionals generalize, because that's where a lot of the danger lies.

New Comment
1 comment, sorted by Click to highlight new comments since:

A fine-tuning could be an identity or mission statement for an agent (bureaucracy), so that it speaks with a purpose or attention to particular features of a situation, or to a particular concept, or to an aspect of preference. Then in an HCH-like setting, let's define for each situation (initial prompt) an episode on it that involves multiple agents discussing the situation, elucidating its aspects pertaining to those agents. Each agent participates in some set of episodes defined on a set of situations (agent's scope), and the scope can be different for different agents (each agent is specialized and only participates in episodes about situations where its fine-tuning is relevant).

An agent is aligned when it consistently acts according to its fine-tuning's intent within its scope (it's robust to situations in its scope, or to episodes on the situations in its scope). An agent doesn't need to behave correctly outside its scope to be considered aligned, so a fine-tuning doesn't need to generalize too far, but its scope must conservatively estimate how far it does generalize.

So a large space of situations can be covered by overlapping smaller scopes of agents that bind behaviors of episodes on those situations together. Each agent acts as a sort of acausal coordination device across the episodes on its scope, if agents are iteratively retrained on the data of episodes (as a sort of reflection). And each episode binds behaviors of agents participating in it together (in a sort of bargaining).

In this sketch, alignment/extrapolation (across distributional shift) is sought by training new specialized agents that cover novel situations further from the initial training/fine-tuning distribution with their scopes. This is done by adding them to episodes on situations that are in scopes of both old and new agents, where they bargain with old agents and learn to extend their alignment to new situations within their new scopes. A new agent is trained to understand new situations (within its scope) and arguments that take place within the episodes on those situations. These are unfamiliar to old agents, so their adequate descriptions/explanations won't fit into a context window for an old agent (the way prompts can't replace fine-tunings), but the new agent is expecting these situations already, so can discuss them (after iterating reflection, fine-tuning to the episodes on the new agent's scope).