Thanks to Roger Grosse, Cem Anil, Sam Bowman, Tamera Lanham, and Mrinank Sharma for helpful discussion and comments on drafts of this post. Throughout this post, "I" refers to Ansh (Buck, Ryan, and Fabien helped substantially with the drafting and revising of this post, however). Two approaches to addressing weak...
This is a research update on some work that I’ve been doing on Scalable Oversight at Anthropic, based on the original AI safety via debate proposal and a more recent agenda developed at NYU and Anthropic. The core doc was written several months ago, so some of it is likely...
TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size...
* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...
* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to an algorithmic model. The results are also summarized here. Introduction In earlier work (unpublished), we dissected a tiny transformer that classifies whether a string of parentheses is balanced or unbalanced.[1] We hypothesized the...
* Authors sorted alphabetically. An appendix to this post. 1 More on Hypotheses 1.1 Example behaviors As mentioned above, our method allows us to explain quantitatively measured model behavior operationalized as the expectation of a function f on a distribution D. Note that no part of our method distinguishes between...