AI ALIGNMENT FORUM
AF

Meg
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
82Auditing language models for hidden objectives
6mo
3
118Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
2y
69
49Steering Llama-2 with contrastive activation additions
2y
23
39Towards Understanding Sycophancy in Language Models
2y
0
56Paper: LLMs trained on “A is B” fail to learn “B is A”
2y
0
44Paper: On measuring situational awareness in LLMs
2y
13