Arthur Conmy

Wiki Contributions

Comments

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

Thanks for the comment! 

I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent. 

In depth, when GPT-Neo is fed a sequence of tokens  where  are uniformly random and  for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from  to ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!). 

My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they "compensate" when 6.1 is ablated.

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".

(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?

 

[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples

This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.