AI ALIGNMENT FORUM
Sorted by New
Paper: LLMs trained on “A is B” fail to learn “B is A”
Paper: On measuring situational awareness in LLMs
More examples of goal misgeneralization
How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?