CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.
This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly "the AI can't do X").
I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.
My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)
Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn't obviously help much with those, which makes it a worse target for research effort.
Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.
I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.
A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬
A few points:
A few takes:
I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.
I think you're mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.
My research mostly focuses on techniques that an AI developer could use to reduce the misalignment risk posed by deploying and developing AI, given some constraints on how much value they need to get from the AI. Given this, I basically definitionally have to be imagining the AI developer trying to mitigate misalignment risk: why else would they use the techniques I study?
But that focus isn't to say that I'm totally sure all AI developers will in fact use good safety techniques.
Another disagreement is that I think that we're better off if some AI developers (preferably more powerful ones) have controlled or aligned their models, even if there are some misaligned AIs being developed without safeguards. This is because the controlled/aligned models can be used to defend against attacks from the misaligned AIs, and to compete with the misaligned AIs (on both acquiring resources and doing capabilities and safety research).
I am not sure I agree with this change at this point. How do you feel now?
I think it's conceivable for non-deceptively-aligned models to gradient hack, right?