I am currently a MATS 8.0 scholar studying mechanistic interpretability with Neel Nanda. I’m also a postdoc in psychology/neuroscience. My perhaps most notable paper analyzed the last 20 years of psychology research, searching for trends in what papers do and do not replicate. I have some takes on statistics. tl;dr...
This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models. tl;dr * Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision. * Here, we sketch hypotheses for...
This post is adapted from our recent arXiv paper. Paul Bogdan and Uzay Macar are co-first authors on this work. TL;DR * Interpretability of chains-of-thought (CoTs) produced by LLMs is challenging: * Standard mechanistic interpretability studies a single token's generation but CoTs are sequences of reasoning steps that use thousands...