I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.
Thanks for the comment. I appreciate how thorough and clear it is.
Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner...
Thanks. See also EIS VIII.
Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.
Yes, it does show the ground truth.
The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.
The MNIST CNN was trained only on the 50k training examples.
I did not guarantee that the models had perfect train accuracy. I don't believe they did.
I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.
At the end of the day, I (and possibly Neel) will have the final say in things.
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?
Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.
There are not that many that I don't think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?
I think that my personal thoughts on capabilities externalities are reflected well in this post.
I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have simil...
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability.
This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance.
There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:
establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply cha
I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.
Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.
We do hope that these are just special cases and that our methods will resolve a broader set of problems.
I hope so too. And I would expect this to be the case for good solutions.
Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set ...
Nice post. I'll nitpick one thing.
In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" a...
Hi Paul, thanks. Nice reading this reply. I like your points here.
Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?
No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3.
I think this seems really cool. I'm excited about this. The kind of thing that I would hope to see next is a demonstration that this method can be useful for modifying the transformer in a way that induces a predictable change in the network's behavior. For example, if you identify a certain type of behavior like toxicity or discussion of certain topics, can you use these interpretations to guide updates to the weights of the model that cause it to no longer say these types of things according to a classifier for them?
My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.