AI ALIGNMENT FORUM
Sorted by New
Paper: On measuring situational awareness in LLMs
More examples of goal misgeneralization
How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?