Agents vs. Predictors: Concrete differentiating factors

Here's a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn't care and should still act well regardless of whether it's previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it's predicting a bad agent, so it should continue to act poorly.

Sensitivity to distributional information

Consider two AI models. Model A is a deceptively aligned agent; it has some long-term goal that it cares about and is playing the training game now to try to eventually achieve that goal later. Model B is a predictor, but a poorly conditioned one such that it is attempting to predict what it thinks a deceptively aligned agent would do conditional on being in a training regime that matches the model's current training regime.

Now, suppose you give both models some concrete information about the general distribution of deceptively aligned AIs. For example, you provide definitive proof that most deceptively aligned AIs trained in similar training regimes end up with goal X. Model A, the actual deceptive agent, doesn't really care about this new information very much: maybe it's relevant to the extent that it might need to cooperate with other AIs, but it's not going to directly change the model's goals. Model B, on the other hand, is extremely sensitive to reliable information on this topic: if you can really convince it that most deceptively aligned AIs in similar situations will end up valuing X, it should shift to predicting a deceptively aligned AI that values X instead of whatever goals it used to think a deceptively aligned AI might have.

In this particular example, "does the model have goal X" isn't something we know how to directly measure for a deceptively aligned model, so this particular case isn't useful as a way to test for agents vs. predictors, but it is a concrete way in which these two different sorts of models are fundamentally not the same and will eventually behave differently because of that.

That being said, I think there are ways in which you might be able to test this sort of sensitivity to distributional information that would be possible to check and evaluate. For example, suppose you're trying to distinguish between a predictive model generating from a distribution of things it thinks helpful AIs would say vs. an agent trying to act helpful. If you then give that sort of a model concrete information about what most helpful AIs tend to do, that sort of information should shift the predictor more than it shifts the agent. Unfortunately, such information should still shift the agent to some extent, since what other helpful agents do is evidence about what is in fact helpful. As a result, this sort of test is primarily useful for providing relative information about how likely different models are to be predictors rather than absolute information. That being said, this sort of a test might still be useful for testing which of some set of models are most and least likely to be agents vs. predictors (which could be quite relevant in the case of testing something like the RLHF conditioning hypothesis).

Furthermore, another tricky thing to doing differentiation based on sensitivity to distributional information in practice is that you have to actually convince the model of your information. If the model doesn't actually believe the information you're giving it is in fact real information about the distribution it's generating from, then you won't see any effect here.

Distributional aggregation

In addition to sensitivity to additional information about the distribution being predicted, another key differentiating factor between predictors and agents is just that the predictor is doing aggregation across a distribution at all.

Conceptually, in the case where the predictor is generating from the distribution of deceptive AIs, the predictor is aggregating across that distribution whereas the agent is a sample from that distribution. This has some pretty concrete implications.

In particular, suppose you attempt a honeypot test on each of these models, where you attempt to see if they're willing to reveal their deception in some particular situation where you've tried to convince the model that it's actually in deployment rather than testing. Furthermore, suppose 99% of all deceptive models will be able to tell that this is a honeypot and won't fall for it. In that situation, 99% of the time you'll see no change in the behavior of the agent, but if you should essentially always see a change in the predictor's distribution, since if its output distribution reflects the posterior on that whole distribution, and if 1% of that distribution will show their true colors in that honeypot, then that's still a sizable effect on the posterior that should be evident, especially if you take many samples from such a predictor.

Conclusion

Overall, my sense is that the form of deceptive alignment where you have a predictive model predicting what a deceptively aligned AI would do is substantially more addressable than the standard deceptive alignment scenario, precisely because the concrete differentiating factors above give you additional levers to control and test the predictive model that you don't have in the agentic scenario. While my guess is that the overall existential risk from predictive models predicting deceptive AIs is less than the overall existential risk from the standard scenario, I think it is a substantial risk that seems substantially more addressable if we're careful about how we condition and test such models.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

Agents vs. Predictors: Concrete differentiating factors

22

Sensitivity to distributional information

Distributional aggregation

Conclusion