RowanWang — AI Alignment Forum

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. Read the full Anthropic Alignment Science blog post and...

Nov 25, 202541

Building and evaluating alignment auditing agents

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...

Jul 24, 202547

Modifying LLM Beliefs with Synthetic Document Finetuning

In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems. We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all...

Apr 24, 202576

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding...

Oct 28, 2022101

Gears-Level Mental Models of Transformer Interpretability

This post aims to quickly break down and explain the dominant mental models interpretability researchers currently use when thinking about how transformers work. In my view, the focus of transformer interpretability research is teleological: we care about the functions each component in the transformer performs and how those functions interact...

Mar 29, 202275

Rowan Wang

Rowan Wang

Rowan Wang

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Modifying LLM Beliefs with Synthetic Document Finetuning

Gears-Level Mental Models of Transformer Interpretability

Building and evaluating alignment auditing agents

Rowan Wang

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Modifying LLM Beliefs with Synthetic Document Finetuning

Gears-Level Mental Models of Transformer Interpretability

Building and evaluating alignment auditing agents

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Building and evaluating alignment auditing agents

Modifying LLM Beliefs with Synthetic Document Finetuning

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Gears-Level Mental Models of Transformer Interpretability