x

AI Alignment Forum

AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts

AI Alignment Posts

Popular Comments

15My most common research advice: do quick sanity checks

1d

2

10Predicting When RL Training Breaks Chain-of-Thought Monitorability

2d

0

40(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

7vik, Sid Black, Joseph Bloom

4d

0

21ControlAI 2025 Impact Report: our progress toward an international ban on ASI

Andrea_Miotti, Alex Amadori

7d

0

20Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels, Neel Nanda

8d

0

25A Toy Environment For Exploring Reasoning About Reward

jenny, Bronson Schoen

9d

1

35Metagaming matters for training, evaluation, and oversight

jenny, Bronson Schoen

16d

0

22“Act-based approval-directed agents”, for IDA skeptics

16d

6

6New RFP on Interpretability from Schmidt Sciences

17d

0

5Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors

21d

0