Steering Llama-2 with contrastive activation additions
by Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, and TurnTrout
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...
Jan 2, 2024•125