This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".
Interesting. This dataset could be a good idea of hackathon.
Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas
Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."
Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)
I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?