AI ALIGNMENT FORUM
AF

David Africa
Ω23210
Message
Dialogue
Subscribe

Research Scientist with the Alignment team at UK AISI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Narrow Misalignment is Hard, Emergent Misalignment is Easy
David Africa2mo30

Thanks for this update. This is really cool. I have a couple of questions, in case you have the time to answer them.

When you sweep layers do you observe a smooth change in how “efficient” the general solution is? Is there a band of layers where general misalignment is especially easy to pick up?

Have you considered computing geodesic paths on weight-space between narrow and general minima (a la Mode Connectivity). Is there a low-loss tunnel, or are they separated by high-loss barriers? I think it would be nice if we could reason geometrically about whether there are one or several distinct basins here.

Finally, in your orthogonal-noise experiment you perturb all adapter parameters at once. Have you tried layer-wise noise? I wonder whether certain layers (perhaps the same ones where the general solution is most “efficient”) dominate the robustness gap.

Reply
15Large Language Models and the Critical Brain Hypothesis
8d
0
8Research Areas in Learning Theory (The Alignment Project by UK AISI)
2mo
0
13The Alignment Project by UK AISI
2mo
0