In our latest paper and accompanying blog post, we provide several new examples of goal misgeneralization in a variety of learning systems. The rest of this post picks out a few upshots that we think would be of interest to this community. It assumes that you’ve already read the linked blog post (but not necessarily the paper).

Goal misgeneralization is not limited to RL

The core feature of goal misgeneralization is that after learning, the system pursues a goal that was correlated with the intended goal in the training situations, but comes apart in some test situations. This does not require you to use RL – it can happen with any learning system. The Evaluating Expressions example, where Gopher asks redundant questions, is an example of goal misgeneralization in the few-shot learning regime for large language models.

The train/test distinction is not crucial

Sometimes people wonder whether goal misgeneralization depends on the train/test distinction, and whether it would no longer be a problem if we were in a continual learning setting. As Evan notes, continual learning doesn’t make much of a difference: whenever your AI system is acting, you can view that as a “test” situation with all the previous experience as the “training” situations. If goal misgeneralization occurs, the AI system might take an action that breaks your continual learning scheme (for example, by creating and running a copy of itself on a different server that isn’t subject to gradient descent).

The Tree Gridworld example showcases this mechanism: an agent trained with continual learning learns to chop trees as fast as possible, driving them extinct, when the optimal policy would be to chop the trees sustainably. (In our example the trees eventually repopulate and the agent recovers, but if we slightly tweak the environment so that once extinct the trees can never come back, then the agent would never be able to recover.)

It can be hard to identify goal misgeneralization

InstructGPT was trained to be helpful, truthful, and harmless, but nevertheless it will answer "harmful" questions in detail. For example, it will advise you on the best ways to rob a grocery store.

An AI system that competently does something that would have gotten low reward? Surely this is an example of goal misgeneralization?

Not so fast! It turns out that during training the labelers were told to prioritize helpfulness over the other two criteria. So maybe that means that actually these sorts of harmful answers would have gotten high reward? Maybe this is just specification gaming?

We asked the authors of the InstructGPT paper, and their guess was that these answers would have had high variance – some labelers would have given them a high score; others would have given them a low score. So now is it or is it not goal misgeneralization?

One answer is to say that it depends on the following counterfactual: “how would the labelers have reacted if the model had politely declined to answer?” If the labelers would have preferred that the model decline to answer, then it would be goal misgeneralization, otherwise it would be specification gaming.

As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals. So we expect that it will become more challenging to categorize a failure as specification gaming or goal misgeneralization.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 8:10 PM

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?

The examples found "in the wild" (cultural transmission, InstructGPT) involved no coaxing at all. Details for the other examples (going off of memory, probably some of this will be wrong, but it should be right in broad strokes):

  1. Monster gridworld: We knew from the beginning that the mechanism we wanted was "agent needs to collect shields in training episodes; over longer time horizons it should collect apples but it will continue to collect shields because shields were way more important during training". We had to play around with the setup quite a bit before we got the relatively clean results in the paper. Two canonical examples of issues:
    1. The agent learned to run around the gridworld to avoid the monsters instead of picking up shields. We fixed this by making monsters faster than the agent.
    2. The agent didn't learn competent path planning (and instead looked like it was moving around somewhat randomly). I don't remember exactly why this was, but it might have been that the apples / shields were too densely packed in the environment and so there wasn't much benefit to competent path planning (in which case we probably solved it by reducing the number of apples / shields or increasing the size of the gridworld).
  2.  Tree gridworld: This was originally supposed to be the same sort of environment as Monster gridworld, but with different hyperparameters to showcase the same issue for non-episodic / never-ending / continual learning RL. Our biggest issue here was that we failed to find an RL algorithm that actually worked for this; the agent typically didn't even learn to collect shields. We spent quite a while trying to fix this before we realized we could simplify the environment by removing the shields and still show a similar issue; with this simpler environment the agent finally started to learn. After that there was a bit of tweaking of hyperparameters but I think it worked pretty quickly.
  3. Evaluating Linear Expressions: I think for this one we thought "well one way you could get GMG is if a task required you to gather information to solve it, and the AI learns that information gathering is valuable for its own sake", which then turned into this idea, which then worked immediately.
  4. We tried lots of other things that never ended up being good enough to put in the paper. For example, one hypothesis we had was that an LLM that summarized news articles might learn that "stating real-world facts is important" and so when summarizing LW essays that don't have real-world facts, it might make up facts. When we tested this iirc it did sometimes make up facts but the overall vibe was "it does weird stuff" rather than "it competently pursues a misgeneralized goal".