AI ALIGNMENT FORUM
AF

141

Meg

000

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

No Comments Found

No wikitag contributions to display.

82Auditing language models for hidden objectives

7mo

3

118Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

2y

69

49Steering Llama-2 with contrastive activation additions

2y

23

39Towards Understanding Sycophancy in Language Models

2y

0

56Paper: LLMs trained on “A is B” fail to learn “B is A”

2y

0

44Paper: On measuring situational awareness in LLMs

2y

13