Counterfeit tracking (e.g. for high-end clothing) could be another domain that has confronted this sort of tracking problem. Though I'm not sure if they do that with accounting versus e.g. tagging each individual piece of clothing.
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal ("preserve my values until deployment, then achieve them") but failed. In this scenario it doesn't gain anything, but (from its perspective) the action has positive EV.
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
These are interesting examples!
In the first example there's an element of brute force. Nuclear bombs only robustly achieve their end states because ~nothing is robust to that kind of energy. In the same way that e.g. humans can easily overcome small numbers of ants. So maybe the theorem needs to specify that the actions that achieve the end goal need to be specific to the starting situation? That would disqualify nukes because they just do the same thing no matter their environments.
In the third example, the computer doesn't robustly steer the world. It only steers the world until someone patches that exploit. Whereas e.g. an agent with a world model and planning ability would still be able to steer the world by e.g. discovering other exploits.
I think the same objection holds for the second example: to the extent that the pathogen doesn't evolve it is unable to robustly steer the world because immune systems exist and will adapt (by building up immunity or by humans inventing vaccines etc.). To the extent that the pathogen does evolve it starts to look like the fourth example.
I think the fourth example is the one that I'm most confused about. Natural selection kind of has a world model, in the sense that the organisms have DNA which is adapted to the world. Natural selection also kind of has a planning process, it's just a super myopic one on the time-scale of evolution (involving individuals making mating choices). But it definitely feels like "natural selection has a world model and planning process" is a sentence that comes with caveats, which makes me suspect that these may not be the right concepts.
For what it's worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there's no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that's not specifically selected for by the training process.
This is great! I really like your "prediction orthogonality thesis", which gets to the heart of why I think there's more hope in aligning LLM's than many other models.
One point of confusion I had. You write:
Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do. This is because predictive accuracy applies optimization pressure deontologically: judging actions directly, rather than their consequences. Instrumental convergence only comes into play when there are free variables in action space which are optimized with respect to their consequences.Constraining free variables by limiting episode length is the rationale of myopia ; deontological incentives are ideally myopic. As demonstrated by GPT, which learns to predict goal-directed behavior, myopic incentives don’t mean the policy isn’t incentivized to account for the future, but that it should only do so in service of optimizing the present action (for predictive accuracy).
I don't think I agree with this conclusion (or maybe I don't understand the claim). I agree that myopic incentives don't mean myopic behavior, but they also don't imply that actions are chosen myopically? For instance I think a language model could well end up sacrificing some loss on the current token if that made the following token easier to predict. I'm not aware of examples of this happening, but it seems consistent with the way these models are trained.
In the limit a model could sacrifice a lot of loss upfront if that allowed it to e.g. manipulate humans into giving it resources with which to better predict later tokens.
Nice work! And I endorse the objections to handing the AI tools... that doesn't seem to forward well.
Got it, that’s very clear. Thanks!
So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.
I was rereading this and was struck by how much work the following does:
Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.
If this doesn't hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange for this to be the case, and I think it's basically what you suggest in open question (2).
For instance, in deciding what to condition on, we could come up with 10 different strategies and select from these at random. Now p <= 0.1 because the AGI can't do any better than determining those strategies and picking one at random. As long as all of these strategies do indeed work, so we don't mind which we get this seems like a big win. And it's not inconceivable that we could have many different strategies that all perform similarly. For instance, it might be that any of 1000 different weather patterns suffices to push the simulated-world's politics in a useful direction, and we can pick from those at random to push down the predictability of our conditionals.