AI ALIGNMENT FORUM
AF

Meg
000
Message
Dialogue
Subscribe

Posts

Sorted by New
82Auditing language models for hidden objectives
3mo
3
118Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
1y
69
49Steering Llama-2 with contrastive activation additions
1y
23
39Towards Understanding Sycophancy in Language Models
2y
0
56Paper: LLMs trained on “A is B” fail to learn “B is A”
2y
0
44Paper: On measuring situational awareness in LLMs
2y
13

Wikitag Contributions

No wikitag contributions to display.

Comments

Sorted by
Newest
No Comments Found