We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...
This is a more technical followup to the last post, putting precise bounds on when regressional Goodhart leads to failure or not. We'll first show conditions under which optimization for a proxy fails, and then some conditions under which it succeeds. (The second proof will be substantially easier.) Related work...
Thanks to Aryan Bhatt, Eric Neyman, and Vivek Hebbar for feedback. This post gets more math-heavy over time; we convey some intuitions and overall takeaways first, and then get more detailed. Read for as long as you're getting value out of things! TLDR How much should you optimize for a...