x

AI ALIGNMENT FORUM

AF

NickGabs — AI Alignment Forum

NickGabs

Top postsTop post

NickGabs

Message

Grantmaker at Coefficient Giving

387

9

48

4y

NickGabs

Grantmaker at Coefficient Giving

Steering Llama-2 with contrastive activation additions

The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...

Jan 2, 2024•125