Interpreting Preference Models w/ Sparse Autoencoders
This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE features. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what...

