Interpreting Preference Models w/ Sparse Autoencoders
by Logan Riggs and Jannik Brinkmann
This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE features. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what...
Jul 1, 202475