Classifying images is one thing. But what if I'm an agent that is actually active in some setting?

The previous approach still applies: detecting when I'm out of distribution, and trying to keep my behaviour compatible with the various reward function that could be compatible with the data I've seen.

The main difference is that, if I'm acting, it's much easier to push the setting into an out of distribution state, seeking out an extremal Goodhart solution to maximise reward. But that issue is for a next post.

Mazes and doors example

We'll use the maze and door example from this post. I've has been trained to go through a maze and reach a red door (which is the only red object in the environment); the episode then ends.

I'm now in an environment where the only door is blue, and the only red thing is a window. What should I do now?

My reward function is underspecified by its training environment - this is the old problem of unidentifiability of reward functions.

There are three potential reward functions I could extrapolate from the training examples:

  • : reward for reaching a red door.
  • : reward for reaching a door.
  • : reward for reaching a red object.

The episode ended, in training, every time I reached the red door. So I can't distinguish "reaching" a point from "staying" at that point. So the following three reward functions are also possible, though less likely:

  • : reward for each turn spent next to a red door.
  • : reward for each turn spent next to a door.
  • : reward for each turn spent next to a red object.

There are other possible reward functions, but these are the most obvious. I might have different levels of credence for these rewards; as stated before, the seems less likely than the .

So, what is the optimal policy here? Note that and are irrelevant here, because the current environment doesn't contain any red doors. So, initially, to go to the blue door and the red window - which one first depends on the layout of the maze and the relative probabilities of the reward functions and .

After that, if the episode hasn't ended, the rewards are irrelevant - either they are incorrect, or they have already been accomplished. So now only the rewards and are relevant. If the first one is the most likely, I maximise expected reward by standing by the door forever; if the second is more likely, then standing by the window forever is the correct policy.


If I have the opportunity to ask for clarification about my reward function - maybe by running another training example with different specifications - then I would do so, and would be willing to pay a cost to ask[1].

Diminishing returns and other effects

If I suspect my rewards have diminishing returns, then it could be in my interests to alternate between the blue door and the red window. This is explained more fully in this post. In fact, that whole post grew out of this kind of "if I were a well-intentioned AI" reasoning. So I'll repeat the conclusion of that post:

So, as long as:

  1. We use a Bayesian mix of reward functions rather than a maximum likelihood reward function.
  2. An ideal reward function is present in the space of possible reward functions, and is not penalised in probability.
  3. The different reward functions are normalised.
  4. If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process. Then, we shouldn't unduly fear Goodhart effects [...]

If not all those conditions are met, then:

  1. The negative aspects of the Goodhart effect will be weaker if there are gains from trade and a rounded Pareto boundary.

So if those properties hold, I would tend to avoid Goodhart effects. Now, I don't know extra true information about the reward function - as I said, I'm well-intentioned, but not well-informed. But humans could include in me the fact that they fear the Goodhart effect. This very fact is informative, and, equipped with that knowledge and the list above, I can infer that the actual reward has diminishing returns, or that it is penalised in probability, or that there is a normalisation issue there. I'm already using a Bayesian mix of rewards, so it would be informative for me to know whether my human programmers are aware of that.

In the next post, we'll look at more extreme examples of AI-me acting in the world.

  1. The cost I'm willing to pay depends, of course, on the relative probabilities of the two remaining reward functions. ↩︎

New Comment