This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here.
There’s been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model’s self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true facts about itself; genuine introspection must causally depend on (i.e., be “grounded” in) the internal...
This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out!
It would be useful for safety purposes if AI models could accurately report on their internal processes and the factors driving their behavior. This is the motivation behind, e.g., attempts to measure introspection or CoT faithfulness in LLMs. Recent work has shown that frontier models can sometimes accurately report relatively simple, qualitative aspects of their internal choice processes, such as whether they had been fine-tuned to be risk-seeking, had received a concept injection, or had used a hint to solve a problem (although their accuracy has...