Adding some clarifications re my personal perspective/takes on how I think about this from an AGI Safety perspective: I see these ideas as Been's brainchild, I largely just helped out with the wording and framing. I do not currently plan to work on agentic interpretability myself, but still think the ideas are interesting and plausibly useful, and I’m glad the perspective is written up! I still see one of my main goals as working on robustly interpreting potentially deceptive AIs and my guess is this is not the comparative strength of agentic interpretability.
Why care about it? From a scientific perspective, I'm a big fan of baselines and doing the simple things first. "Prompt the model and see what happens" or "ask the model what it was doing" are the obvious things you should do first when trying to understand a behaviour. In internal experiments, we often find that we can just solve a problem with careful and purposeful prompting, no need for anything fancy like SAEs or transcoders. But it seems kinda sloppy to “just do the obvious thing”, I’m sure there’s a bunch of nuance re doing this well, and in training models for this to be easy to do. I would be excited for there to be a rigorous science of when and how well these kinds of simple black box approaches actually work. This is only part of what agentic interpretability is about (there’s also white box ideas, more complex multi-turn stuff, an emphasis on building mental models of each other, etc) but it’s a direction I find particularly exciting – If nothing else, we need to answer to know where other interpretability methods can add value.
It also seems that, if we're trying to use any kind of control or scalable oversight scheme where we're using weak trusted models to oversee strong untrusted models, that the better we are at having high fidelity communication with the weaker models the better. And if the model is aligned, I feel much more excited about a world where the widely deployed systems are doing things users understand rather than inscrutable autonomous agents.
Naturally, it's worth thinking about negative externalities. In my opinion, helping humans have better models of AI Psychology seems robustly good. AIs having better models of human psychology could be good for the reasons above, but there's the obvious concern that it will make models better at being deceptive, and I would be hesitant to recommend such techniques to become standard practice without better solutions to deception. But I expect companies to eventually do things vaguely along the lines of agentic interpretability regardless, and so either way I would be keen to see research on the question of how such techniques affect model propensity and capability for deception.
Do you have ideas about how to do this?
I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.
But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.
Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.
Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord
We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising the question of whether we could ask and help AI systems to build mental models of us which help us to build mental models of the LLMs.
The core idea of agentic interpretability is for a method to `proactively assist human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user which in turn enables humans to develop better mental models of the LLM’. In other words, enable machines to help us understand them.
In the AGI safety community, interpretability is primarily considered as a means of detecting deception and other forms of misalignment. In contrast, here we use interpretability in the broad sense of techniques that aid us in building a deeper understanding of models, what they are doing, and why. Agentic interpretability is not primarily intended to be robust to adversarial systems (although we suggest an idea of “open-model surgery” in the paper, summarized below), and we will likely also need solutions to deceptive misalignment via different approaches. We instead believe that one of the main forms of AGI safety relevance is by increasing human empowerment and helping reduce bad outcomes like gradual disempowerment. What is the optimal human-AI communication framework that enables human users to understand AI operations and reasoning, rather than simply deferring all decision-making power to the AI?
The idea of Open-model surgery: While agentic interpretability is not primarily intended for adversarial systems, we suggest some interesting ways to use agentic interpretability to help detect deception: Imagine an ``open-model surgery" where researchers actively converse with the model while intervening in its internal mechanisms—ablating connections, amplifying activations, or injecting specific inputs into identified circuits. The model, guided by its understanding of the researchers' goal (to understand a specific component's function), is then encouraged to explain the resulting behavioral or internal state changes. This interactive probing, akin to neurosurgeons conversing with patients during awake brain surgery to map critical functions, offers a dynamic way to test hypotheses and build understanding. A model whose internal states are being directly manipulated and inspected, yet which is simultaneously engaged in an explanatory dialogue, faces a stringent test of coherence, analogous to an interrogation where a suspect's claims are immediately cross-referenced with physical evidence or observed physiological responses.
Increasingly powerful AI systems pose increasing challenges to our ability to understand them. Agentic interpretability recognizes that these systems also provide the opportunity to enlist them in service of our understanding of them. We believe methods that pursue this direction will have complementary benefits to the existing interpretability landscape.