TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. Read the full Anthropic Alignment Science blog post and...
TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...
In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems. We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all...
To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding...
This post aims to quickly break down and explain the dominant mental models interpretability researchers currently use when thinking about how transformers work. In my view, the focus of transformer interpretability research is teleological: we care about the functions each component in the transformer performs and how those functions interact...