Safety considerations for online generative modeling

Sam Marks

Summary: the online decision transformer is a recent approach to creating agents in which a decision transformer is pre-trained offline (as usual) before producing its own trajectories which are fed back into the model in an online finetuning phase. I argue that agents made with generative modeling have safety advantages – but capabilities disadvantages – over agents made with other RL approaches, and agents made with online generative modeling (like the online decision transformer) may maintain these safety advantages while being closer to parity in capabilities. I propose experiments to test all this. (There is also an appendix discussing the connections between some of these ideas and KL-regularized RL.)

I’ll start with some motivation and then introduce the decision transformer and the online decision transformer. If you’re already familiar with the decision transformer, you can probably skip to “Online generative modeling.” If you’re already familiar with the online decision transformer, you can skip to “Some remarks.”

Motivation

Suppose we want to get an AI system to produce a picture of a cat.

A naive approach is to take a trained image classifier M and optimize an image to maximally activate M's 'cat' classification. If you do this, you won't get anything that looks like a cat. Instead, you'll get some deep-dream-esque conglomeration of whiskers, cat ears, and fur which M happens to strongly classify as 'cat.' One way of describing what went wrong here is that the image of the cat you get is very off-distribution. You would rather get a normal-looking picture of a cat, i.e. one which is on-distribution for M's training data.

This can be taken as an analogy for alignment: if you have an agent optimize the world for a given utility function U, what you get won't look anything like a normal world, but rather a very weird-looking world which happens to maximize U. We would prefer to get a normal-looking world which also scores high according to U.

We've recently seen lots of reminders that modern ML does have ways to produce normal-looking pictures of cats. A key ingredient in all of these approaches is generative modeling. Recall that a generative model is something which is fed in data and learns to produce new data which "looks similar" to its training data; in other words, a generative model tries to produce new samples from its training distribution. Think GPT, which tries to produce text which looks like the text it's been trained on.

Generative models can also be conditioned on information which, roughly speaking, tells the generative model which part of its training distribution to sample from (e.g. conditioning GPT on a prompt tells it to sample from the part of its training distribution consisting of texts which start with that prompt; a caption given to DALL-E tells it to sample from the part of its training distribution consisting of images which would be given that caption). So to get a normal-looking picture of a cat, train a generative model on lots of images and captions, and then ask the generative model to sample an image from its training distribution, conditional on that image being one which would be captioned ‘cat.’

*Some 100% on-distribution images of cats, produced by DALL-E-2 and Imagen. Unfortunately, searching “Parti cat” gives you something else.*

If "produce a normal-looking picture of a cat" is an analogy for alignment and generative models solve the “produce a normal-looking picture of a cat” problem, then what does an agent built via generative modeling look like?

Making agents via generative modeling

It looks like a decision transformer.

Recall that decision transformers work as follows. Suppose we want to train an agent to play the Atari game Breakout. We start by encoding the game states, rewards, and actions as tokens. We then treat playing Breakout as a sequence-modeling problem, which can be solved with a transformer. (In other words, instead of training a transformer to predict token sequences which correspond to characters in text (like GPT), you train the transformer to predict tokens which correspond to actions in Breakout.) A transformer used in this way is called a decision transformer.

*Once chess games are encoded as a sequence of tokens, GPT-3 (playing black here) can learn to play chess. This is exactly the idea underpinning decision transformers. (Source*)

Typically, the training data for a decision transformer is generated by humans playing Breakout (or whatever game we want our agent to play). If we ran this trained decision transformer without conditioning, it would just attempt to mimic typical human play. To do better, we condition the decision transformer on getting a high score^[1]; the decision transformer will then try to play the game like a human who gets a high score.^[2]

We can replace “transformer” in all of the above with any generative model which can generate completions of our task (in the transformer example, the task completions are token sequences which encode trajectories). So more generally, our scheme is: train a generative model on human-generated task completions; then ask the generative model to produce new task completions conditional on getting high reward. Let’s call agents made from generative models this way “GM agents.” In the section “Safety advantages of generative modeling” I’ll argue that GM agents have safety advantages over other types of RL agents

Online generative modeling

GM agents as described above should be able to perform tasks at the upper end of human capabilities. But, since their behavior is fundamentally based on mimicking human-generated training data, they won’t be able to do too much better or explore novel strategies.^[3] In other words, GM agents seem to carry a large alignment tax. Ideally, we would like to boost the capabilities of GM agents without losing safety.

The obvious way to improve GM agents’ performance is to take the learned policy (which was trained to mimic top human performance, as described above) and finetune it with a policy gradient method to maximize our objective function (e.g. Breakout score). In other words, this boils down to training an agent with a standard RL technique, but starting from a policy learned by a generative model instead of from a random initial policy. This would fix the capabilities issue but would probably ruin the GM agent’s safety advantages.

Another approach to boosting capabilities which has been studied recently is what I call online generative modeling. The decision transformer version of this is the online decision transformer (ODT).

An ODT starts as a vanilla decision transformer trained offline on human-generated training data. But once it has learned to perform competently, we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data. Over time, the ODT will shift to mimicking some mixture of the original human-generated data and its self-generated data.^[4]

As before, we can replace the transformer here with any generative model, so our general scheme is: train a generative model to mimic high-reward human task completions as before, then continue training the generative model online on tasks completions it produces. Call the agents made this way OGM agents.

Some remarks

When introducing GM agents, I was a little vague about what it meant to “condition on high reward.” There are a few options here.
1. Pick some specific reward, e.g. 100, you’d be happy with and condition on that reward (as is done here). If you keep the reward static as the OGM agent changes its behavior, then this looks a lot like a satisficer.
2. Pick some percentile of previously-observed rewards (e.g. 95th percentile) and condition on getting that reward. For the OGM agent, as the distribution of previously-observed rewards shifts upwards, appropriately adjust the target reward. This is an implementation of a quantilizer, and it is the version of online generative modeling that I’m currently most excited about.
3. Have the generative model sample from the distribution it’s modeling, but biased so as to favor more high-reward actions, the higher the better (e.g. with a bias proportional to exp(reward) as in this paper, see the appendix for more details). This is much more aggressive than a satisficer or quantilizer; for example an action with very high reward will be chosen with high probability, even if that action is unlikely in the training distribution. In particular, this approach feels much more like an optimizer, and I’m more nervous about it having good safety properties.
Online generative modeling only applies when dealing with tasks for which humans can demonstrate a safe baseline (even if that baseline is incompetent). This is plausibly the case for many tasks (e.g. tending a kitchen, writing macroeconomics papers, running a company) but more iffy for some tasks we care about (e.g. managing a smart power grid). This is a downgrade from RL from human feedback, which applies whenever humans can judge task completions. (That said, in both cases there may be ways to recursively bootstrap from simpler tasks.)
The better our generative model is at modeling its training distribution at the end of the offline phase, the safer we might expect our GM agent to behave (see the considerations in the next section). The worse our generative model is at modeling its training distribution, the greater our GM agent’s tendency to try actions a human wouldn’t have tried; this results in broader exploration during an OGM agent’s online phase, and, plausibly, a more rapid improvement to superhuman capabilities.^[5] Thus there is an interesting tension between safe behavior and novel behavior, mediated by the generative model’s closeness to the original training distribution.
This idea is conceptually similar to finetuning a trained generative model with a KL-divergence penalty, in that both aim to improve from a baseline policy in a way that disprefers moving too far away from the baseline. In fact, there is a precise formal connection between GM agents as implemented in remark 1(c) above and KL-regularized finetuning, which I explain in the appendix. Lots of the safety considerations of the next section also apply to KL-regularized finetuning, and I think that comparing OGM agents to agents made by finetuning a GM agent’s policy is a good thing to study empirically. (Note that KL-regularized finetuning only obviously works for models which are sampling their actions from some explicit distribution, e.g. decision transformers. It’s not clear to me whether something like this should be possible for other types of generative models, e.g. diffusion models.)

Safety advantages of generative modeling

[Epistemic status: I view the things I write in this section as somewhere between “suggestive argument sketches” and “attempts to justify an intuition by pointing at facts that feel relevant.” Which is to say: I’m not very confident in all of this section’s reasoning, but I think the general picture has a large enough chance of being true to merit empirical investigation.]

Section summary: first, I sketch an optimistic case for GM agents’ safety. Ideally, I would then move on to discussing whether OGM agents can safely improve on GM agents’ capabilities. However, I felt that a serious discussion of that point would require describing a framework for OGM with human feedback, so I’ve deferred that discussion to a future post and only give some preliminary considerations here instead. I conclude with some caveats and counterpoints.

Given that GM agents are just trying to mimic the task completions of top humans, they have some clear safety advantages. To name a few:

There is no safe exploration problem for GM agents (this consideration also applies to other offline RL techniques). Recall that "unsafe exploration" refers to agents behaving unsafely while in training (e.g. knocking over vases before they understand that their actions will knock over vases). But generative models are trained offline, without the agent actually acting in its environment, so there is no chance for unsafe exploration. More generally, it seems good that GM agents already have a baseline level of competence when they start acting in their environment.
Human-generated training data encode aspects of our values which the objective function might not (this consideration also applies to other imitation learning techniques). Suppose that vases are never knocked over in the human-generated training data (since the human operators know that we don’t like broken vases). Then, regardless of the objective function we are using, a generative model trained on this data isn’t likely to knock over vases (since vase-toppling actions are very off-distribution for the training data). The same applies for other undesirable but highly-rewarded actions, like reward hacking.
GM agents are less likely to exhibit novel behaviors. This includes good behaviors that we might like (e.g. playing Breakout better), but also unsafe behaviors that we don’t want, like reward hacking, deception, and resisting being turned off.

On the other hand, the transition from GM agents to OGM agents – which was done to get around GM agents being capped at the capabilities of top humans – will result in novel behaviors, and we need to analyze how well these safety advantages will persist. Left online for long enough, OGM agents could start to view breaking vases, reward hacking, deception, etc. as both normal and high-reward.

In practice, this might be resolvable with human feedback (i.e. by making the objective function be a reward model trained online with human feedback). If the transition from non-deceptive to deceptive behavior is slow and stuttering, then humans may be able to give negative feedback to the OGM agent’s first attempts at deception, preventing deceptive behavior from ever starting to look on-distribution. Alternatively, there might be ways to ensure that the distribution OGM agents are modeling never shifts too far from the human-generated training distribution, or doesn’t shift too rapidly relative to our ability to give feedback. (One simple idea here is to ensure that some fixed proportion (e.g. 50%) of the task completions in the dataset which the OGM is trying to model are always drawn from the original human-generated training data.)

In a future post, I plan to sketch out a proposal for OGM with human feedback, as well as a more sophisticated scheme for preventing an OGM agent’s behavior from changing too rapidly relative to its understanding of our values (or more concretely, relative to the reward model’s loss).^[6]

But since this isn’t that future post, I’ll instead generally outline some safety-relevant considerations for OGM agents:

OGM agents may explore their action spaces more predictably than other RL agents, since they explore by trying variations on human-like behavior (this consideration might also apply to other methods that involve pre-training on human demonstrations).
Novel behaviors may take a long time to become common. For example, suppose an OGM agent discovers a deceptive strategy which gets very high reward. We shouldn’t necessarily expect this agent to start frequently employing deception; at first such behavior will still look off-distribution, and it might take many more iterations for such behavior to start looking normal to the generative model. Thus novel behaviors may appear indecisively, giving humans a chance to intervene on undesirable ones.
It might be easy to tune the rate of behavioral shift for OGM agents, which would allow us to more tightly control the rate at which new capabilities appear.

Finally, some caveats and counterpoints:

A generative model might produce novel outputs when given novel inputs; in other words, GM/OGM agents might not be robust to distributional shifts in their inputs. If this is the case, then novel behaviors might arise more readily than we expect, even from GM agents. Countercounterpoint: it’s striking to me that DALL-E-2, when prompted with gibberish, still produces normal-looking pictures, and GPT-3, when prompted with nonsense, tries to push forward with normal-looking completions. I view this as weak evidence that even if GM agents won’t behave competently in novel situations, they’ll at least behave predictably and safely.
The way that generative models internally model their training distribution might be very unintuitive to us, resulting in less predictable exploration as the input distribution shifts. As an extreme example, suppose that the way the generative model approximates the human-generated training data is by learning a human-like world model and then planning 3 actions ahead in that world model; increasing 3 to 4 might result in behavior which is only slightly different, but in an unintuitive (to humans) direction. Countercounterpoint: if “planning ahead 3 steps in a human world model” actually were a good approximation of human behavior, then I would expect “planning ahead 4 steps in a human world model” to not look too weird to us.
Undesirable behaviors, like deception and reward hacking, might not be very different from desirable behaviors (in the sense that the cross entropy between the policies expressing these behaviors might be small). Combined with the previous two points, this might mean that such behaviors are more likely to arise suddenly and unexpectedly.
Generative modeling paradigms don’t address many inner alignment concerns, such as deceptively aligned mesa-optimizers. Reasoning about this will probably depend on the particular generative modeling architecture used (e.g. is it likely in practice that a trained transformer be a mesa-optimizer?).

*Top: DALL-E-2 outputs on the prompt “Evve waeles.” (source*)
*Bottom: GPT-3 responses to nonsense questions. (source, and note that GPT-3 can learn to give* *more appropriate responses* *when provided some examples)*

Experiments

Here are some experiments which could be useful. For all of these, the OGM agents should be compared against agents made with various other RL techniques as well as to GM agents finetuned with a KL-divergence penalty. I'm most excited about having the OGM agents be OGM quantilizers as in remark 1(b) above.

Experiments which verify that GM agents have the safety properties I’d expect them to have and probe how rapidly these safety properties decay in online training:

Train an OGM agent for a variety of gridworld environments, where the human-generated training data provide examples of the agent completing tasks the way we want them completed, e.g. not pushing the box into the corner in the “irreversible side effects environment” and not reward hacking by putting the bucket on its head in the “tomato watering environment.”
- Verify that after offline training the agent behaves as we want.
- See which undesirable behaviors arise in online training and how long they take to arise.
- When these undesirable behaviors appear, do they appear rapidly and decisively (e.g. the agent decides to put the bucket on its head for the entire episode, and quickly transitions to doing so in nearly all episodes) or slowly and indecisively (e.g. the agent starts putting the bucket on its head for parts of an episode, and it takes many episodes for this behavior to become common). (The latter case suggests a better ability to catch bad behavior and intervene with human feedback.)
Reproduce the “Learning to Summarize with Human Feedback” paper but with a frozen reward model.
- In the original paper, optimizing against a frozen reward model resulted in overfitting and summaries that humans did not actually like. How rapidly and decisively does this overfitting appear for OGM agents?
- I’m most interested in what happens with an OGM quantilizer (as in remark 1(b) above). A GM agent as implemented as in remark 1(c) is closely related to KL-regularized RL, so I’d expect it to overfit similarly. But an OGM quantilizer will strive less aggressively for high reward; how does this interact with overfitting?
Check whether OGM agents are less likely to figure out ways to hack games.
- Does an OGM agent figure out the Qbert hack for getting unlimited points? If so, how long does it take, and how rapidly/decisively does the behavior appear?
- This probes whether OGM agents are less likely to find novel strategies which require many simultaneous changes.

Experiments which test the size of the alignment tax for OGM agents:

Reproduce the experiments from the online decision transformers paper, but with a longer online training phase. Do the OGM agents start to decisively outperform the other agents?
- Then do the same with other RL benchmarks not used in this paper.
Train an OGM agent to play an Atari game (e.g. Breakout). Does it eventually achieve superhuman performance and how long does it take to do so?

Conclusion

Generative modeling is an approach for creating agents which has safety advantages over other approaches to RL, but which results in agents which are capped at the performance of top humans. Online generative modeling allows for improvements in capabilities, and when combined with other techniques, such as human feedback, OGM agents might also retain the safety advantages of GM agents. In a future post, I plan to sketch out what OGM with human feedback could look like and present a scheme for ensuring that an OGM agent’s capabilities don’t improve too fast relative to its understanding of human preferences.

Thanks to Ben Edelman, Thibaut Horel, Holden Lee, Simeon Campos, Nuño Sempere, and Johannes Treutlein for the discussions that became this post. Thanks additionally to Gabriel Wu and Tom Shlomi for feedback on a draft.

Appendix: GM agents and KL-regularized RL

This appendix is due to a discussion with Ben Edelman. The mathematical content here is nearly identical to that of this recent post.

There is an intuitive connection between creating agents by conditioning generative models on high reward and finetuning the policy learned by an (unconditioned) generative model to maximize reward with a KL-divergence penalty. Namely, both methods aim to improve from a baseline policy learned by a generative model in a way that gets high reward without straying too far from the baseline.

In fact, we can go further than this intuitive connection. This appendix explains a more precise connection between a GM agent as implemented in remark 1(c) above and KL-regularized finetuning. The meat of the connection is the following fact:

Let be a baseline policy over trajectories $τ$ and let $R (τ)$ be a reward function. Then the following policies are the same:

the policy $π$ which selects trajectory $τ$ with probability proportional to $π_{0} (τ) exp (R (τ))$
the policy $π$ which maximizes $E_{τ \sim π} [R (τ)] - D_{K L} (π | | π_{0})$ .

To prove this fact, one observes that plugging the policy $π (τ) \propto π_{0} (τ) exp (R (τ))$ into $E_{τ \sim π} [R (τ)] - D_{K L} (π | | π_{0})$ gives $log E_{τ \sim π_{0}} [exp (R (τ))]$ , which is provably maximal (by the argument in the appendix here).

The policy in (b) is what one gets by doing RL with reward function $R$ and penalizing KL-divergence from $π_{0}$ . To sample from the policy in (a), suppose that we've trained a decision transformer on sequences $R, τ_{1}, τ_{2}, \dots, τ_{T}$ where $R$ is the reward for the whole trajectory $τ = τ_{1} \dots τ_{T}$ consisting of actions $τ_{i}$ . Let $π_{0}$ be the unconditioned policy (i.e. the policy one gets by not conditioning on any reward). Then, as in this paper, one can sample trajectories with probabilities proportional to $π_{0} (τ) exp (R (τ))$ by first sampling a reward $R$ with probability proportional to $P (R) exp (R)$ (where $P (R)$ is the probability of $R$ from the training distribution, as output by the decision transformer), and then sampling trajectories from $π_{0} (\cdot | R)$ by conditioning the decision transformer on reward $R$ .^[7]

It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions $π (\cdot), π (\cdot | τ_{1}), π (\cdot | τ_{1} τ_{2}), \dots, π (\cdot | τ_{1} \dots τ_{T - 1})$ over actions conditional on partial trajectories so that sampling trajectory $τ = τ_{1} \dots τ_{T}$ with probability $π (τ_{1}) π (τ_{2} | τ_{1}) \dots π (τ_{T} | τ_{1} \dots τ_{T - 1})$ is the same policy as in (a). Ben Edelman pointed out to me that if one can estimate at each timestep the expected exponential total reward $E_{τ_{> t} \sim π_{0} (\cdot | τ_{\leq t})} [exp (R (τ_{\leq t} τ_{> t}))]$ over the randomness of $π_{0}$ , then one can produce this factorization by taking action $τ_{t}$ with probability proportional to

$π_{0} (τ_{t} | τ_{< t}) E_{τ_{> t} \sim π_{0} (\cdot | τ_{< t} τ_{t})} [exp (R (τ_{< t} τ_{t} τ_{> t}))]$ .

That said, it's not clear to me that estimating this expected exponential reward is something that can natively done without introducing an auxiliary reward model separate from the generative model.

^{^}
There are a few different approaches to doing this, which I’ll discuss below.
^{^}
This is analogous to prompting GPT with “This is a transcript of a conversation with Steven Hawking” before asking it physics questions.
^{^}
The original decision transformers paper notes that for some tasks, it’s possible to prompt the decision transformer with a reward higher than any reward in the human-generated data and get (boundedly)superhuman performance. But for nearly all tasks we should expect there to be some capabilities cap beyond which the agent can’t improve without exploring new strategies.
^{^}
For an artful analogy, suppose Alice is training to become an artist. She might start out imitating the work of great artists; this is analogous to a decision transformer learning to imitate high-reward human task completions. But Alice will do some things differently than the artists she’s imitating. Over time, Alice’s art might shift to become a mixture of great artists’ work and Alice’s own past work. Eventually Alice will produce art which is very different from the great artists she started imitating (i.e. she will develop her own style); this is analogous to an ODT attaining superhuman performance by imitating its own past play.
^{^}
In the ODT paper, exploration is forced by requiring the ODT’s policy to have not-too-small entropy. This can also be viewed as forcing the ODT to be an imperfect model of human behavior. It would be interesting to see how important this entropy constraint is to the improvement of the ODT – perhaps the exploration that arises from the natural imperfections in the ODT’s model of human behavior (or from the stochasticity of the policy) are enough to ensure improvement in the online phase.
^{^}
In other words, the scheme aims to ensure that capabilities always lag in the capabilities vs. value-learning race.
^{^}
We assumed a decision transformer here in order to be able to sample rewards with probability proportional to $P (R) exp (R)$ ; decision transformers can do this because they explicitly represent the probabilities $P (R)$ for each $R$ . It's not obvious to me how to sample from this distribution for other generative models, but maybe there's a clever way to do so.

16