AI ALIGNMENT FORUM
AF

Wikitags

Sycophancy

This page is a stub.
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Sycophancy
90Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison, evhub
1y
21
49Steering Llama-2 with contrastive activation additions
Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, TurnTrout
2y
23
52Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
2y
4
13SAE features for refusal and sycophancy steering vectors
neverix, Dmitrii Kharlapenko, Arthur Conmy, Neel Nanda
11mo
0
3Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga
1y
0
Add Posts