Steering Language Models with Weight Arithmetic
by Fabien Roger and constanzafierro
We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite. We show that using this direction to steer model behaviors can be used to modify traits like sycophancy,...
Nov 11, 202588