AI ALIGNMENT FORUM
AF

1838
Nathan Hu
010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Training on Documents About Reward Hacking Induces Reward Hacking
Nathan Hu9mo*44

The reduction in reward hacking after SFT or RL on Haiku supports the conjecture that initial conditions matter less than the long run incentives, especially for less capable models. On the other hand, the alignment faking paper shows evidence that capable models can have "value crystallization." IMO a main takeaway here is that values and personas we might worry about being locked can emerge from pre-taining. A future exciting model organisms project would be to try to show these two effects together (emergent values from pre-training + lock in). Its plausible to me that repeating the above experiments, with some changes to the synthetic documents and starting from a stronger base model, might just work.

Reply
No wikitag contributions to display.
60Training on Documents About Reward Hacking Induces Reward Hacking
9mo
8