Steering Llama-2 with contrastive activation additions
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...
Jan 2, 2024125