TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.
Introduction
Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?
Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for implanting the behavior, the subtlety of the behavior, and the strength of adversarial training used to prevent confession.
We also develop an investigator agent that autonomously audits a target model using a configurable toolkit. We evaluate the agent’s success across 13 different tool configurations to study which tools are most effective. Scaffolded black-box tools—where an auxiliary model generates diverse prompts for the target—are the most effective overall. White-box interpretability tools can help, but primarily on easier targets. And we identify a tool-to-agent gap: tools th