Wiki Contributions


This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance. 

There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:

establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.

And I think that some of the things I mentioned for strategy 3 do too:

giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.

So ultimately, I won't make claims about whether avoiding perpetual risk is mostly an AI governance problem or mostly a more general governance problem, but certainly there are a bunch of AI specific things in this domain. I also think they might be a bit neglected relative to some of the strategy 1 stuff. 

I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed. 

Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing.  And one could use a similar approach for significantly different applications. 


We do hope that these are just special cases and that our methods will resolve a broader set of problems.

I hope so too. And I would expect this to be the case for good solutions. 

Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others. 

So I'm inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn't identify themselves as "AI Safety" researchers and (2) being clear that we care about alignment for all of the right reasons.  Albeit this should be discussed with the appropriate amount of clarity which was your original point. 

Nice post. I'll nitpick one thing. 

In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" answers, this is easy and might be plenty useful in general. I wouldn't expect it to be tractable though in practice to reliably get good answers to open-ended questions using a set of boolean ones in this way. Supervision seems useful too because it seems to offer a more general tool.  

Hi Paul, thanks. Nice reading this reply. I like your points here.

Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?

No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3. 

I think this seems really cool. I'm excited about this. The kind of thing that I would hope to see next is a demonstration that this method can be useful for modifying the transformer in a way that induces a predictable change in the network's behavior. For example, if you identify a certain type of behavior like toxicity or discussion of certain topics, can you use these interpretations to guide updates to the weights of the model that cause it to no longer say these types of things according to a classifier for them? 

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries: 

Manual fine-tuning: 

Reverse engineering (I'd put an asterisk on these ones though because I don't expect methods like this to scale well to non-toy problems):