The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...

Jan 2, 2024125