This is a followup to Conditioning Generative Models based on further discussions with Evan Hubinger, Nicholas Schiefer, Abram Demski, Curtis Huebner, Hoagy Cunningham, Derek Shiller, and James Lucassen, as well as broader conversations with many different people at the recent ARC/ELK retreat. For more background on this general direction see Johannes Treutlein’s “Training goals for large language models”.


Previously, I wrote about ways we could use a generative language model to produce alignment research. There I proposed two approaches:

  1. Simulate a superhuman alignment researcher by conditioning on them generating superhuman machine-verifiable results in a variety of related domains. Try to align this superhuman agent by crafting scenarios where a misaligned agent has an incentive to defect ("Honeypots").
  2. Simulate humans performing alignment research for a long time, by conditioning on observations spaced many years apart ("Patient Research").

I think that both approaches are an improvement over doing nothing, but I suspect that no amount of honeypot construction actually ensures an aligned agent, and so it seems that simulating humans ("Patient Research") is the more promising direction.


If we truly have a perfect generative model I think the “Patient Research” approach really does well. We can set up a world where machine learning practice is strictly banned and where researchers spend thousands of years constructing a machine-verifiable scheme for AI alignment. The ban prevents that world from containing malign superintelligences, and the time-scale means that if alignment is something humans can solve, the model should produce that solution.

The problem is that we won’t have perfect generative models, so we’ll have to accept some limitations.

In this post I’ll explore what changes if we cannot condition on worlds too different from our own. The idea is to understand what happens if our generative model has trouble generalizing too far away from its training data.

Failure Modes

The main dangerous failure mode with using generative models to produce alignment research is that we accidentally ask for a future that contains a deceptive AGI.

If we get such a future, the AGI might spoof observations to trick us into importing it into our world. For instance it could pretend to be a human producing really good alignment research, but produce work which is subtly flawed and, once implemented, allows it to take over the world.

There are generally two ways we hit this failure mode.

The first is that we ask for a prediction of the distant future. The future 50 years from now has a decent chance of containing a deceptive AGI, and if we naively ask for that future there’s a good chance we’ll simulate that AGI, be deceived, and import that AGI into our world.

The second way is to ask for a world that is very unlikely in ways that make deceptive AGI more likely. For instance, we could condition on observing that next year someone produces what appears to be a perfect solution to AI alignment. When we do this, the hypothesis “a deceptive AGI took over and pretended to solve alignment” becomes a lot more likely, and so the generative model is more likely to simulate that scenario.

The name of the game, then, is to craft strategies that simultaneously (1) make deceptive AGI’s less likely and (2) accelerate alignment research.

Ground Rules


We have a generative model trained on observations of the world. The training objective is to predict later observations given earlier ones. We assume that the model is inner-aligned, so that it really does try to predict the future conditioned on observations.

Observations are multi-modal, and can be anything measurable about the world (e.g. images, video, audio, text, contents of an SD card, etc.). Predicted observations do not need to be made by actual sensors, but do need to be ones that could in principle be made were the right sensors in the right places.


It is possible that our model could know how to construct arbitrary trajectories through time, and so might be able to produce good counterfactual histories. A worry though is that it won’t generalize fully, and in particular that it will over-index on observations in the training set.

For instance, it may be that counterfactual conditionals that should avert the industrial revolution nevertheless result in trajectories that happen to lead to industrialization so that airplanes continue to appear in the observations later on, because airplane featured prominently in the training set. If this happens, the model is producing samples from an artificially narrow (and improbable!) distribution, which could cause all manner of unlikely and/or unpredictable outputs.

To avoid this issue we assume that the model is able to generalize to futures consistent with the observations in its training set, but not to counterfactual pasts. As a result, all conditioning has to be made with respect to future observations. Specifically, observations must come after the end of the training set. This is a much stricter requirement than the one in Conditioning Generative Models, and gives us much less room to craft counterfactual realities.

In principle we can make some counterfactual historical trajectories by only training the model on data up to some past date. The tradeoff is that the model will require enormous amounts of data to train, and pushing that date back to e.g. before the Internet probably makes the training set too small to be useful.

No Fixed Points

We don’t want the model to try to solve for fixed points in its output. This sort of thing arises most naturally when we try to use the model recursively (e.g. suggesting scenarios where human researchers use the model to simulate human researchers using the model, etc.). In such cases the model may try to produce self-fulfilling prophecies, resulting in optimization pressure to steer the future in particular (likely simple) ways.

This seems very dangerous, so I’d like to rule out all scenario that could result in simulated queries to simulated copies of the model. Unfortunately this is not as easy as just not conditioning on the model’s output.

To see why, note that we are right now contemplating querying this model in the future. It seems very likely that simulated humans will want to query the model too. We can try to condition on human researchers never querying generative models, but that’s a very unlikely future. An unfortunate consequence of this is that the model may think that a likely reason the humans don’t query the model is because there is a deceptive AGI manipulating the world (see “Failure Modes” above), and so this conditional raises the odds that we see outputs from worlds with deceptive AGI’s, which is bad.

I see two ways out of this problem:

  1. Push the end of the training set back far enough that the model itself doesn’t appear in the training data, and that no one would plausibly build the model during the simulated time-span. That restricts our questions to near-past counterfactuals (see “Conditioning” above), but does seem like a robust solution.
  2. Condition on a very early disaster that prevents the model from being used (e.g. global GPU shortage, policy change banning generative models, etc.). This one pushes down the chance that the model gets used but also pushes down the chance of any other advanced AI existing, which makes it much better than just conditioning on the model not being used. This may still net increase the odds of simulating a world with deceptive AGI, but it’s better than the alternatives.

Neither of these is totally satisfactory, and I think this general challenge of avoiding self-fulfilling prophecies is an important one to think more about. It seems conceivable that there are just ways to dissolve the problem entirely, which would be great. Broader discussions of this challenge and possible approaches can be found in Training goals for large language models and Conditioning Generative Models for Alignment.

Finally, none of this prevents us from asking the model a question, getting an answer, and then using that answer to craft new questions. The problems I’m worried only apply to recursive (rather than sequential) applications of the model.

Factored Cognition Approaches

Generate 5 Years of Alignment Research

A simple approach is to ask for a copy of the AI Alignment Forums 5 years into the future. This doesn’t require any conditioning, and gives us a five year lead on where we would have been without the generative model. There is a small chance that we can research produced by a misaligned AGI, but that chance can be made very small by tuning the amount of time. If you think that five years is too long, and makes deceptive AGI’s too likely, you can substitute a shorter timeframe into this approach.

Direct 5 Years of Alignment Research

We can do better than just generating five years of research though! Instead, we can direct that research. We can condition on researchers spending more of their time on what we believe to be the most important parts of the problem. Mild interventions like this seem unlikely to make the generative model invoke deceptive AGI’s, and could plausibly significantly increase the progress we see.

Ask What to Ask

Even better than directing research, we can allow research to proceed for five years and then ask the model what humans wish they had done instead. This could be done with a prompt of the form

The research note with the title “What I wish we had worked on 5 years ago” was posted by Paul Christiano to the AI Alignment Forums in 2027. Its contents follow.

Now we might disagree with the arguments in such a post, but this seems like an extremely valuable way to understand the pitfalls of different agendas without actually having to explore them ourselves.

Build up a Picture

The above three approaches, which I’ll refer to as “Generate”, “Direct”, and “Ask”, can be run in parallel. We can generate many trajectories of alignment research. We can increase the diversity of those trajectories, and answer specific questions, by directing research towards the topics that seem most crucial. And we can solicit feedback from humans in these simulated worlds on what they wish they had done, which we can then incorporate into our judgements of what to ask next.

Incidentally, this approach is a special case of Factored Cognition. ****It is also closely related to HCH, but is importantly not recursive (e.g. the humans we simulate do not themselves consult simulations of themselves).

What if we need more time?

To the extent that AI Alignment turns out to be a broad subject, by which I mean one that factors nicely into many parallel subproblems, I expect the above approaches to work well.

Unfortunately the problem may turn out to be very deep. It may be that what we need is a great deal of sequential research. That’s much harder to get out of our model.

To see why, note that the obvious thing to do here is to ask for research conducted over longer time horizons. The more we extend the horizon, though, the more likely it is that we sample from research produced by a deceptive agent. That seems very bad.

Alternatively, we can try to be clever and “import” research from the future repeatedly. For instance we can first ask our model to produce research from 5 years out. Then, we can condition our model on that research existing today, and again ask it for research 5 years out. The problem with this approach is that conditioning on future research suddenly appearing today almost guarantees that there is a powerful AGI involved, which could well be deceptive, and that again is very bad.

If we want more time, then, we need to condition the model in ways that make AGI timelines longer.

No More GPU’s

Prosaic AGI is likely to require ML chips (e.g. GPU’s, TPU’s, etc.). Most of these are designed in the US and manufactured abroad.

Suppose we condition the model on the US banning the export of ML chip designs. The US has a long history of misguided protectionist policies, and ML chips could be called a national security issue, so this sort of conditional should be pretty mild.

The effect, though, is that ML chip manufacturing is set back by a few years as new fabs are built in the US. This extends the horizon where we can safely assume there are no deceptive AGI’s, which lets us answer deeper questions.

Redirecting Civilization

We can plausibly get even more time by a three-step process:

  1. Push AGI off into the future.
  2. Steer civilization with conditionals such that we don’t want to build AGI.
  3. Conduct thousands of years of alignment research.

For example, it seems unlikely but not implausible that a religion emerges that sees dignity only in the fruits of human work. If such a religion were sufficiently prominent it could result in efforts steering away from AI research and towards e.g. human exploration of space.

I think this kind of approach could work, but the order of operations is important! If we start conditioning on lots of unlikely and strongly correlated shifts in human values and society (as in (2)), the most likely story for how this happens is that there’s a deceptive AGI behind the scenes manipulating humanity. To make sure the model doesn’t generate that story, we step (1) needs to come first and make AGI in the relevant time period extremely unlikely.

Pulling this off seems tricky, but we can use the generative model to our advantage. We can buy ourselves a few extra years on (1), then explore timelines conditioned in different ways to try to arrange (2). To be careful we could start with very mild conditioning and explore them gradually (e.g. one month at a time), searching for a likely (but entirely human) path away from developing AGI.


Even with more restrictions, generative models still look pretty good! In particular, even if we can’t condition on past counterfactuals, it looks like there are plausible ways to safely extract an awful lot of good alignment research by conditioning generative models.

One theme that comes up again and again in thinking through these scenarios is that there are only really two levers we can pull on. The first is to make human alignment research more productive by guiding it towards more promising paths and by parallelizing across different simulations as much as possible. The second is to make AGI less likely over some timeframe, buying us more time to safely simulate human researchers.

Both approaches seem important, but which one matters most depends a lot on the (unknown) nature of the problem. If we can solve AI alignment by exploring in great breadth, the first lever is more useful. On the other hand, if the problem requires great depth, meaning sequential thinking, we the second lever is where we should focus.

New Comment
4 comments, sorted by Click to highlight new comments since:

When I process new posts, I add tags so they can more easily be found later. I wasn't sure what tag this one with beyond "AI". I think it'd increase the likelihood your post gets found and read later if you look through the available AI tags (you can search or use the Concepts page), or create new tags if you think none of the existing ones fit.

Thanks! I'll try do that in the future (and will add some to this).

Great post!

Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline? Of course, competitiveness is an issue in general, but the more factored cognition or IDA based approaches seem more realistic to me.

Alternatively, we can try to be clever and “import” research from the future repeatedly. For instance we can first ask our model to produce research from 5 years out. Then, we can condition our model on that research existing today, and again ask it for research 5 years out. The problem with this approach is that conditioning on future research suddenly appearing today almost guarantees that there is a powerful AGI involved, which could well be deceptive, and that again is very bad.

I wonder whether there might also be an issue with recursion here. In this approach, we condition on research existing today. In the IDA approach, we train the model to output such research directly. Potentially the latter can be seen as a variant of conditioning if we train with a KL-divergence penalty. In the latter approach, we are worried about fixed-point and superrationality-based nonmyopia issues. I wonder whether something like this concern would also apply to the former approach. Also, now I'm confused about whether the same issue also arises in the normal use-case as a token-by-token simulator, or whether there are some qualitative differences between these cases.


Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline?

I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hit x-risk territory.

That said, it's totally possible that the x-risk-causing generative model happens before the model that can simulate thousands of years of history. I'm not confident in this either way.

One thing that gives me hope in favor of simulating long histories is that to some extent it's "just" a matter of more compute, and if we get promising results simulating short spans of history it might not be hard to justify a lot of spending on simulating longer stretches. And there's a bright spot there too: simulating longer times likely scales sub-linearly with amount of history simulated. If you have a dynamics model then simulating for twice as long costs double the compute. If you've got a more clever model that knows how to take shortcuts/compress the dynamics you can probably do better.

I wonder whether something like this concern would also apply to the former approach.

I'm pretty concerned about this. I said a bit about this in the "No Fixed Points" section, but basically I think you have to do something to avoid fixed points, otherwise you get all sorts of world-ending optimization pressures. If you do that, you're not allowed any recursion where the model simulates itself, and then you get stuck with the problem of how to introduce future research into the past without making a malicious AGI the most likely explanation...