This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Meg
Posts
Sorted by New
115
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
3mo
69
49
Steering Llama-2 with contrastive activation additions
4mo
23
39
Towards Understanding Sycophancy in Language Models
6mo
0
57
Paper: LLMs trained on “A is B” fail to learn “B is A”
7mo
0
44
Paper: On measuring situational awareness in LLMs
8mo
13
Wiki Contributions
Comments