AI safety field building/teaching in France & CTO of omniscience.

ML engineer, interested in phenomenology.


Sorted by New

Wiki Contributions


This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".

Interesting.  This dataset could be a good idea of hackathon.

Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?