Thanks for clarifying. Preventing escape seems like promising way to prevent these sorts of problems. On the other hand I'm having trouble imagining ways in which we could have sensors that can pick up on whether a model has escaped. (Maybe info-sec people have thought more about this?)
Thanks for writing this! I've found the post very interesting. I had a question/comment regarding this:
In practice, for many tasks we might want AIs to accomplish, knowing about all concrete and clear cheap-to-measure short-term outcomes will be enough to prevent most kinds of (low-stakes) oversight failures. For example, imagine using AIs to operate a power plant, where the most problematic failure modes are concrete short-term outcomes such as not actually generating electricity (e.g. by merely buying and selling electricity instead of actually producing it or by hacking electricity meters).
It seems like an additional issue here is that the AI could be messing with things beyond the system we are measuring. As an extreme example, the AI could become a "power plant maximizer" that takes over the world in order to protect the power plant from shut down. It seems like this will always be a risk because we can only realistically monitor a small part of the world. Do you have thoughts on this?
So, we can fine-tune a probe at the last layer of the measurement predicting model to predict if there is tampering using these two kinds of data: the trusted set with negative labels and examples with inconsistent measurements (which have tampering) with positive labels. We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).
This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn't it actually learning to distinguish real negatives and fake positives, since that's what it's being trained on? (Might be a typo.)
to determine whether your models are capable of gradient hacking, do evaluations to determine whether your models (and smaller versions of them) are able to do various tasks related to understanding/influencing their own internals that seem easier than gradient hacking.
What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack?
Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).
I'm interested in the relation between mechanistic anomaly detection and distillation. In theory, if we have a distilled model, we could use it for mechanistic anomaly detection: for each input x, we would check the degree to which the original model's output differs from the distilled model. If the difference is too great, we flag it as an anomaly and reject the output.
Let's say you have your original model M and your distilled model m along with some function d to quantify the difference between two outputs. If you are doing distillation, you would always just output m(x). If you are doing mechanistic anomaly detection, you output M(x) if d(M(x)−m(x)) is below some threshold and you output nothing otherwise. Here, I can see three differences between distillation and mechanistic anomaly detection:
Overall, distillation just seems better than mechanistic anomaly detection in this case? Of course mechanistic anomaly detection could be done without a distilled model, but whenever you have a distilled model, it seems beneficial to just use it rather than running mechanistic anomaly detection.
E.g. you observe that two neurons of the network always fire together and you flag it as an anomaly when they don't.
Cool work! It seems like one thing that's going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar.
Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren't used in B. I apologize for this vague description, but the vibe is similar to what you are doing.
The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that's socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don't get along).
So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.
How does shard theory explain romantic jealousy? It seems like most people feel jealous when their romantic partner does things like dancing with someone else or laughing at their jokes. How do shards like this form from simple reward circuitry? I'm having trouble coming up with a good story of how this happens. I would appreciate if someone could sketch one out for me.