x

AI ALIGNMENT FORUM

AF

Jannik Brinkmann — AI Alignment Forum

Jannik Brinkmann

Jannik Brinkmann

Message

141

3y

Jannik Brinkmann

141

3y

Interpreting Preference Models w/ Sparse Autoencoders

by Logan Riggs and Jannik Brinkmann

This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE features. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what...

Jul 1, 2024•75

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

by Logan Riggs and Jannik Brinkmann

TL;DR We achieve better SAE performance by: 1. Removing the lowest activating features 2. Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations)) with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint. As a sneak peak (the graph should make more sense as we...

Mar 15, 2024•26