All of Charbel-Raphaël's Comments + Replies

This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".

2Adam Jermyn7mo
This indeed sure seems like there's an inner optimizer in there somewhere...
1Logan Riggs Smith7mo
Notably the model was trained across multiple episodes to pick up on RL improvement. Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.

Interesting.  This dataset could be a good idea of hackathon.

Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas

2AdamGleave1y
A related dataset is Waterbirds, described in Sagawa et al (2020) [https://arxiv.org/pdf/1911.08731.pdf], where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background. The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

3Paul Christiano1y
I think that 3 orders of magnitude is the comparison between "time taken to find a failure by randomly sampling" and "time taken to find a failure if you are deliberately looking using tools."

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?

4Charbel-Raphael Segerie1y
Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example." Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)