Arthur Conmy

Mechanistic Intepretability

Independent (Incoming PhD Student)

Wiki Contributions


I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently

Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?

I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.

If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

Thanks for the comment! 

I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent. 

In depth, when GPT-Neo is fed a sequence of tokens  where  are uniformly random and  for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from  to ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!). 

My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they "compensate" when 6.1 is ablated.

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".

(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?



This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.