x

AI ALIGNMENT FORUM

AF

constanzafierro — AI Alignment Forum

constanzafierro

constanzafierro

Message

80

2y

constanzafierro

80

2y

Steering Language Models with Weight Arithmetic

by Fabien Roger and constanzafierro

We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite. We show that using this direction to steer model behaviors can be used to modify traits like sycophancy,...

Nov 11, 2025•88