AI ALIGNMENT FORUM
AF

Charbel-Raphael Segerie

https://crsegerie.github.io/

Living in Paris

Posts

Sorted by New

94Against Almost Every Theory of Impact of Interpretability

8mo

18Introducing EffiSciences’ AI Safety Unit

10mo

49Davidad's Bold Plan for Alignment: An In-Depth Explanation

Wiki Contributions

AI Control

3mo

Comments

Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind]

Charbel-Raphael Segerie1y11

This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Charbel-Raphael Segerie2y02

Interesting. This dataset could be a good idea of hackathon.

Is there an online list with this type of datasets of interest for alignment? I am trying to organize a hackathon, I am looking for ideas

High-stakes alignment via adversarial training [Redwood Research report]

Charbel-Raphael Segerie2y40

Ah, "The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example."

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)

High-stakes alignment via adversarial training [Redwood Research report]

Charbel-Raphael Segerie2y00

I do not understand the "we only doubled the amount of effort necessary to generate an adversarial counterexample.". Aren't we talking about 3oom?