Part 13 of 12 in the Engineer’s Interpretability Sequence.


On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work. 

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates. 

Reflecting on predictions

See my original post for 10 specific predictions about what today’s paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify specific and safety-relevant features should count for 3 (proofs of concept for a useful type of task) but definitely do not count for 6 (*competitively* finding and removing a harmful behavior that was represented in the training data).

Thus, my assessment is that Anthropic did 1-3 but not 4-10. I have been wrong with mech interp predictions in the past, but this time, everything I predicted with >50% probability happened, and everything I predicted with <50% probability did not happen. 

The predictions were accurate in one sense. But overall, the paper underperformed expectations. If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score -0.74. 

A review + thoughts

I think that Anthropic’s new SAE work has continued to be like lots of prior high-profile work on mechanistic interpretability – it has focused on presenting illustrative examples, streetlight demos, and cherry-picked proofs of concept. This is useful for science, but it does not yet show that SAEs are helpful and competitive for diagnostic and debugging tasks that could improve AI safety

I feel increasingly worried about how Anthropic motivates and sells its interpretability research in the name of safety. Today’s paper makes some major Motte and Bailey claims that oversell what was accomplished like “Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer,” “Sparse autoencoders produce interpretable features for large models,” and “The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.” The paper also made some omissions of past literature on interpretability illusions (e.g., Bolukbasi et al., 2021), which their methodology seems prone to. Normally, problems like this are mitigated by peer review, which Anthropic does not participate in. Meanwhile, whenever Anthropic puts out new interpretability research, I always see a laundry list of posts from the company and employees to promote it. They always seem to claim the same things – that some ‘groundbreaking new progress has been made’ and that ‘the model was even more interpretable than they thought’ but that ‘there remains progress to be made before interpretability is solved’. I won’t link to any specific person’s posts, but here are Anthropic’s posts from today and October 2023

The way that Anthropic presents its interpretability work has real-world consequences. For example, it led to this viral claim that interpretability will be solved and that we are bound for safe models. It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved. Meanwhile today, it seems that Anthropic orchestrated a New York Times article to be released alongside the paper, claiming to the public that exciting progress has been made (although the article also made helpful critical commentary on limitations!).

If interpretability is ever going to be helpful for safety, it will need to be useful and competitive in practical applications. This point has been made consistently for the better part of a decade (e.g. Ananny and Crawford, 2016Lipton, 2016Doshi-Velez and Kim, 2017Miller, 2018Krishnan, 2020Rauker et al., 2022). Despite this, it seems to me that Anthropic has so far not applied its interpretability techniques to practical tasks and show that they are competitive. Instead of testing applications and beating baselines, the recent approach has been to keep focusing on streetlight demos and showing lots of cherry-picked examples. I hope to see this change soon.

I don't think that SAE research is misguided. In my post, I pointed out 6 things that I think they could be useful for. Meanwhile, some good recent work has demonstrated proofs of concept that SAEs can be useful on certain non-cherry-picked tasks of practical value and interest (Marks et al., 2024). I think that it's very possible that SAEs and other interpretability techniques can be lenses into models that can help us find useful clues and insights. However, Anthropic's research on SAEs has yet to demonstrate practical usefulness that could help engineers in real applications. 

I know that members of the Anthropic interpretability team have been aware of this critique. Meanwhile, Anthropic and its employees consistently affirm that their work is motivated by safety in the real world. But is it? I am starting to wonder about the extent to which the interpretability team’s current agenda is better explained by practical safety work versus doing sophistical safety washing to score points in social medianews, and government

Thanks to Ryan Greenblatt and Buck Shlegris. I did not consult with them on this post, but they pointed out some useful things in a Slack thread that I put in here.


New Comment
7 comments, sorted by Click to highlight new comments since:

It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.

Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism - I assume you are referring specifically to the statement in their paper that:

Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.

I think Anthropic's interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from 'solved.' For instance, Chris Olah in the linked NYT article from today:

“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.

Also, in the paper's section on Inability to Evaluate:

it's unclear that they're really getting at the fundamental thing we care about

I think they are overstating how far/useful mechanistic interpretability is currently. However, I don't think this messaging is close to 'mechanistic interpretability solves AI Interpretability' - this error is on a16z, not Anthropic. 

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

it seems to me that Anthropic has so far failed to apply its interpretability techniques to practical tasks and show that they are competitive

Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn't been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models' behaviors. Ideally, we'd want SAEs to be competitive with them. Unfortunately,  good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines. 

This criticism feels a bit strong to me. Knowing the extent to which interpretability work scales up to larger models seems pretty important. I could have imagined people either arguing that such techniques would work worse on larger models b/c required optimizations or better because less concepts would be in superposition. Work on this feels quite important, even though there's a lot more work to be done.

Also, sharing some amount of eye-catching results seems important for building excitement for interpretability research.

Update: I skipped the TLDR when I was reading this post b/c I just read the rest. I guess I'm fine with Anthropic mostly focusing on establishing one kind of robustness and leaving other kinds of robustness for future work. I'd be more likely to agree with Steven Casper if there isn't further research from Anthropic in the next year that makes significant progress in evaluating the robustness of their approach. One additional point: independent researchers can run some of these other experiments, but they can't run the scaling experiment.

Note that scasper said:

Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,

I (like scasper) think this work is useful, but I share some of scasper's concerns.

In particular:

  • I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
    • IMO, buzz on twitter systematically overrates the results of this paper and their importance.
  • I'm uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
  • Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
  • I'd be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
    • I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I'd like to see people try on this and update faster. (Including myself: I'm not that sure!)
    • I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE's IMO.