Thanks to Paul Christiano and Kate Woolverton for useful conversations and feedback.
In "Conditioning Predictive Models," we devote a lot of effort into trying to understand how likely predictive models are compared to other alternatives in realistic training regimes (and if we do get a predictive model how we might align it).
Here, I want to point to some very concrete behavioral differences that I think effectively differentiate predictive and non-predictive models both in theory and (hopefully) in practice as well. I think that thinking about predictive models as specifically those models that are sensitive to what I'll be talking about here can be quite clarifying, and I think these factors could present a clear pathway for getting empirical evidence on to what extent models are well-described as predictors in practice—e.g. as it relates to the RLHF conditioning hypothesis.
Furthermore, a common objection I sometimes hear to thinking about predictive models in particular (and looking into the dangers from them as a relatively separate problem) is that a model being itself deceptively aligned vs. predicting what a deceptively aligned model would do are essentially the same. While I think they are both similarly dangerous when acting directly in the world, I think they are pretty substantially different failure modes in terms of how you might go about addressing them and what sort of levers you have to control and detect them—and I think that the concrete differentiating factors here should make it pretty clear how I think they differ.
Sensitivity to distributional information
Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime.
Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have.
In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that.
That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, that sort of information should shift the predictor more than it shifts the agent. Unfortunately, such information should still shift the agent to some extent, since what other helpful agents do is evidence about what is in fact helpful. As a result, this sort of test is primarily useful for providing relative information about how likely different models are to be predictors rather than absolute information. That being said, this sort of a test might still be useful for testing which of some set of models are most and least likely to be agents vs. predictors (which could be quite relevant in the case of testing something like the RLHF conditioning hypothesis).
Furthermore, another tricky thing to doing differentiation based on sensitivity to distributional information in practice is that you have to actually convince the model of your information. If the model doesn't actually believe the information you're giving it is in fact real information about the distribution it's generating from, then you won't see any effect here.
In addition to sensitivity to additional information about the distribution being predicted, another key differentiating factor between predictors and agents is just that the predictor is doing aggregation across a distribution at all.
Conceptually, in the case where the predictor is generating from the distribution of deceptive AIs, the predictor is aggregating across that distribution whereas the agent is a sample from that distribution. This has some pretty concrete implications.
In particular, suppose you attempt a honeypot test on each of these models, where you attempt to see if they're willing to reveal their deception in some particular situation where you've tried to convince the model that it's actually in deployment rather than testing. Furthermore, suppose 99% of all deceptive models will be able to tell that this is a honeypot and won't fall for it. In that situation, 99% of the time you'll see no change in the behavior of the agent, but if you should essentially always see a change in the predictor's distribution, since if its output distribution reflects the posterior on that whole distribution, and if 1% of that distribution will show their true colors in that honeypot, then that's still a sizable effect on the posterior that should be evident, especially if you take many samples from such a predictor.
Overall, my sense is that the form of deceptive alignment where you have a predictive model predicting what a deceptively aligned AI would do is substantially more addressable than the standard deceptive alignment scenario, precisely because the concrete differentiating factors above give you additional levers to control and test the predictive model that you don't have in the agentic scenario. While my guess is that the overall existential risk from predictive models predicting deceptive AIs is less than the overall existential risk from the standard scenario, I think it is a substantial risk that seems substantially more addressable if we're careful about how we condition and test such models.
Here's a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn't care and should still act well regardless of whether it's previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it's predicting a bad agent, so it should continue to act poorly.