x

AI ALIGNMENT FORUM

AF

Nathan Hu — AI Alignment Forum

Nathan Hu

Nathan Hu

Message

154

2

1y

Nathan Hu

154

1y

[Linkpost] Interpreting Language Model Parameters

by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, and Lee Sharkey

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think...

Training on Documents About Reward Hacking Induces Reward Hacking

by evhub and Nathan Hu

This is a blog post reporting some preliminary work from the Anthropic Alignment Science team, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments at a lab meeting, rather...

Jan 21, 2025•135