AI ALIGNMENT FORUM
Wikitags
AF

Subscribe
Discussion0

Sycophancy

Subscribe
Discussion0
This page is a stub.
Posts tagged Sycophancy
90Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison, Evan Hubinger
1y
21
49Steering Llama-2 with contrastive activation additions
Nina Panickssery, Wuschel Schulz, NickGabs, Meg, Evan Hubinger, Alex Turner
1y
23
52Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
2y
4
13SAE features for refusal and sycophancy steering vectors
neverix, Dmitrii Kharlapenko, Arthur Conmy, Neel Nanda
8mo
0
3Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga
9mo
0
Add Posts