Dillon Plunkett

Message

Dillon Plunkett has not written any posts yet.

Tests of LLM introspection need to rule out causal bypassing

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here. There’s been growing interest in testing whether LLMs can introspect on their internal states or processes....

Nov 28, 202549

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out! It would be useful for safety purposes if...

Nov 14, 202512

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Dillon Plunkett

Dillon Plunkett

Dillon Plunkett

Dillon Plunkett

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions