Maximilian Kaufmann — AI Alignment Forum

Paper: LLMs trained on “A is B” fail to learn “B is A”

59[anonymous]3y

This is a linkpost for https://arxiv.org/abs/2309.12288

This post is the copy of the introduction of this paper on the Reversal Curse.

Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

Abstract

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany," it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?" Moreover, the likelihood of the correct answer ("Olaf Scholz") will not be higher than for a random name. Thus, models exhibit...

(Continue Reading - 1077 more words)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg, Maximilian Kaufmann

This is a linkpost for https://arxiv.org/abs/2309.00667

This post is a copy of the introduction of this paper on situational awareness in LLMs.

Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans.

Abstract

We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.

Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for...

(Continue Reading - 1207 more words)

More examples of goal misgeneralization

Maximilian Kaufmann4y80

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?