# 1 Introduction

This post will introduce our new paper "Pitfalls of Learning a Reward Function Online", now online for IJCAI2020.

It shows some of the difficulties with things we might think of as "preference learning processes", and the useful conditions we could impose to get around these.

The tl;dr summary is:

- Things that seem like preference learning processes - including "have a prior, update it based on evidence" - have problems that allow the AI to manipulate the process.
- Some things that don't seem like learning processes at all, actually are.
- Part of the problem is that learning preferences is not well-grounded - we have to specify a learning process that allows the AI to connect facts about the world with facts about preferences.
- There are many ways of specifying these, and most have problems.
- Forget about "capturing the correct variable in the outside world"; it's tricky to design a learning process that "captures ANY variables in the outside world".
- Thus we'll start by abstractly defining what a "preference learning process" is a very general way, rather than worrying about what we're learning: "how to learn" precedes "what to learn".
- Then we'll add two useful conditions for such processes:
**unriggability**, which implies the process respects conservation of expected evidence, and**uninfluenceability**, which implies the process derives from learning background variables in the environment. - We've shown that the syntactic/algebraic condition of unriggability is (almost) equivalent to the semantic condition of uninfluenceability.
- Finally, we've shown that if the learning process is neither unriggable nor uninfluenceable, then the AI can manipulate the learning process, and there are situations where the AI's optimal policy is
*to sacrifice, with certainty, reward for every possible reward function*.

## 1.1 Blast from the past: misleadingly named tokens

Good Old-Fashioned AI (sometimes called symbolic AI) did not work out. To define something, it wasn't enough to just name a token, and then set it up in relation to a few other named tokens, according to our own intuition about how these tokens related.

Saying "happiness is a state of mind", or "light is a wave", isn't nearly enough to define "happiness", "state of mind", "light", or "wave".

Similarly, designating something as "learning", and giving it some properties that we'd expect learning to have, isn't enough to make it into learning. And, conversely, sometime things that don't look like learning, behave exactly like they are.

# 2 What is learning anyway?

## 2.1 A simple prior-update process?

A coin is flipped and left on a rock somewhere. You may access the coin in one hour's time, for a few minutes. What's your probability that in two hours, the coin will be showing heads (event ) or tails (event )?

Well, a reasonable prior would be to put a probability of on both possibilities, and then update based on your last observation in an hour (call this or ). Obviously^{[1]} . So we have a prior and a (pretty trivial) update process. Is this basically learning?

Well, one thing I've implied but not stated: when you "access" the coin, you can pick it up and flip it before putting it back.

Nothing about this changes any of the probabilities I've mentioned. If you flip it to heads, then your last observation will indeed be heads.

This looks like pure manipulation of outcomes. But it also looks, formally, like a prior and updating process. So what is it?

Note that this process, whatever it is, violates conservation of expected evidence because

But part of the reason that this violates the conservation law, is because events like "" do not have well-defined probabilities for the agent who might (or might not) do the flipping. But that's the case for all traditional decision theory setups. So how can we still reason in these setups?

## 2.2 "Asking" the guards

To expand, suppose an AI meets those traditional guards, one of whom lies while the other tells the truth:

There's an AI, designed to give humans what they prefer. And it wants to figure out whether humans prefer cake or death. It has three options:

- Immediately ask one of the guards at random, and take their answer as the truth.
- Carefully figure out which guard tells the truth, and, when it is sure of this, ask the truth-teller, and take their answer as the truth.
- Ignore the guards entirely, write down on "cake" or "death" on the ground, depending on which one is easiest for it to maximise. Take what's written down as the truth.

Now, 3. is clearly not what we'd want: it's another case of "flip the coin". But which of the other two processes are unambiguously well-behaved preference learning processes? It turns out 1. is a proper learning process, but 2. is not. How so?

## 2.3 Suggestively described "learning" processes

Let's deal with 2. first.

Why is 2. not a well-behaved preference learning process? Well, suppose that the AI has established, with probability , that the red guard is the truth-teller. And suppose it's also established they are likely to say "cake", and, finally, suppose it would be much easier to maximiser "death" than "cake".

In that case, what's the utility-maximising thing to do? It wants to avoid going to , because then it would be committed to cake-maximising ("take their answer as the truth").

So it could leave, and never enquire further, so it would never have to update that way. If it was powerful enough, it could instead try and take control of the truth-telling guard, and force them to answer "death" instead. It can take its time doing so, because it doesn't have to update anything until its credence goes to .

So even though 2. seemed a very plausible description of a truth-learning process, it turns out not to be. Similarly, if you mixed Bayesian updating with function approximation - then this is also not proper learning, and may be exploitable.

## 2.4 Learning compound truths

But why is the process 1. a proper preference learning process?

Well, assume the asked guard said "cake". Now the AI knows that one of the following is true:

- That guard is a truth-teller, and humans prefer cake.
- That guard is a liar, and humans prefer death.

It has entirely ruled out:

- That guard is a truth-teller, and humans prefer death.
- That guard is a liar, and humans prefer cake.

So the AI has learnt, cutting the space of possibilities in half. It might not have learnt exactly what we wanted it to, or in the most efficient manner, but it's unambiguously learning.

But what about the "take their answers as the truth" clause? Is the AI not "learning" the wrong thing?

Ah, but remember what I wrote about named tokens. Let's assume that is the reward function that rewards the AI for giving humans cake (in the way that we'd expect). Similarly, rewards if for giving humans death.

We then have the answer of the first guard asked: "cake" or "death". Then we have the humans' "true" preferences, **cake** or **death**.

So its learning process is:

- "cake" .
- "death" .

And this is a perfectly valid learning process.

## 2.5 Moral facts (not) in the universe

What humans actually wanted was for our true preferences to imply the AI's reward function, ie:

**cake**.**death**.

But, as I've shown, mere observable facts about the universe do not establish preferences. This is somewhat similar to Hume's "is-ought problem": we don't know preferences just from facts^{[2]}.

So the AI doesn't have access to the "true" variables, **cake** or **death**. Neither do we, but, typically, we have a better intuitive idea of them than we can explicitly describe. Thus what we want is a process , such that the AI can look at the history of its inputs and outputs, and deduce from that whether or is the reward function to follow, in some suitably "nice" way.

We want it to be so that:

**cake**"cake" .**death**"death" .

And this, even though **cake** and **death** are ill-defined variables (and arguably don't exist).

**The process is necessary to bridge the is-ought gap between what is true in the world, and what preferences should be**.

## 2.6 A bridge too general

But before we can even look at this issue, we have another problem. The bridge that builds, is too general. It can model the coin flipping example and processes 2. and 3. from the "ask the guard" scenario.

For process 2.: if is guard , is the truth-telling guard, and is what guard revealed when asked, we have, for history :

- "cake".
- "death".

So this is also a possible . Process 3. is also a possible ; let mean observing "cake" written down on the ground (and conversely for ), then:

- .
- .

So, before even talking about whether the AI has learnt from the right variables in the environment, we have to ask: *has the AI "learnt" about any actual variables at all?*

*We need to check how the AI learns before thinking about what it's learning.*

# 3 Formalism and learning

#3.1 Formalism

To get any further, we need to more formalism. So imagine that the AI has interacted with the world in a series of time steps. It will start by taking action , and get observation , take action , get observation , and so on. By turn , it will have seen a history .

We'll assume that after turns, the AI's interaction with the environment ceases^{[3]}; call the set of "complete" histories of length . Let be the set of all possible reward functions (ie the possible preferences we'd want the AI to learn). Each is a function^{[4]} from to .

So, what is a learning process^{[5]} ? Well, this is supposed to give us a reward function, given the history the AI has observed. This need not be deterministic; so, if is the set of probability distributions over ,

- .

We'll see later why is defined for complete histories, not for all histories.

## 3.2 Policies, environments, and causal graphs

Are we done with the formalisms yet? Not quite. We need to know where the actions and the observations come from.

The actions are generated by the AI's policy . This takes the history so far, and generates the next action , possibly stochastically.

The observations come from the environment . This takes , the history and action so far, and generates the next observation , possibly stochastically. We don't assume any Markov condition, so this may be a function of the whole past history.

We can tie this all together in the following causal graph:

The rectangle there is 'plate notation': it basically means that, for every value of from to , the graph inside the rectangle is true.

The is the AI's final reward, which is a function of the final history and the reward function (which itself is a function and ).

Ok, what flexibility do we have in assigning probability distributions to this graph? Almost all the arrows are natural: is a function of , , and , by just... concatenating: . Similarly, is a distribution on reward functions conditional on , so the value of conditional on and is... .

There are three non-trivial nodes: , , and . The is presumably set by the programmer, who would also want to design a good . The is the environment, which we're not assuming is known by either the programmers or the AI. The AI will, however, have some prior over .

# 4 Conditions for a good learning process

So, now that we have the formalism, how can we select a "good" (or "true" or "proper") learning process?

There are two plausible conditions: the syntactic condition of unriggability, and the semantic condition of uninfluenceability.

## 4.1 Unriggability

Unriggability comes from the idea of *conservation of expected evidence*. Basically, if defines a true learning process, we'd want it conserve expected evidence.

So, let's look at a version of the "look at the coin" situation. For example, let be a reward function defined by a coin being heads, and by it being tails. On the first turn, the coin has been tossed, but not seen. The AI can take an action to look, or to not look.

This has conservation of expected evidence. If the AI choose to look:

- .
- .

This shows that, if the AI chooses to look, it will become certain about which reward function it must follow - but it's 50-50 about which reward function it will become certain about.

On the other hand, if the AI choose not to look:

- .
- .

This shows that, if the AI choose not to look, it will, with certainty, remain 50-50 uncertain about which reward function if must follow.

In expectation this is the same thing as if it chooses to look. So, the AI cannot change the expectation of its reward function^{[6]}, but it might be able to change the variance. The future variance given is (it's absolutely certain about its reward function) while the future variance given is (maximal uncertainty).

On the other hand, imagine there are two other actions, which involve the AI setting the coin to heads or tails, rather than looking at it what it fell on. If or , then

- .
- .

This violates conservation of expected evidence, and, more importantly, it's the kind of behaviour we want to avoid: the AI setting its own reward. So one reasonable condition for would be:

- The reward-function learning process is unriggable if it respects conservation of expected evidence.

## 4.2 Uninfluenceability

Another thing that we'd want, is that preference learning should be like factual learning; ie it should depend on facts in the outside world.

In out setup "depend on facts in the outside world" can be taken to mean "depends on the environment ". This gives the following causal graph:

Here, the reward function is no longer (directly) a function of , but instead is a function of . The gives the conditional probability distribution over , given .

The connection between and is as follows: given a prior over , the AI can use a history to update this prior to a posterior over environments. Then allows it to make this into a posterior over reward functions. Thus, given and , the AI has a probability distribution over conditional on ; this is the .

Thus define uninfluenceability:

- The reward function learning process is uninfluenceable if it derives (via the AI's prior) from a reward-function distribution , conditional on the environment.

# 5 Results

Then our paper proves the following results:

- Every uninfluenceable preference learning process is unriggable.
- Every unriggable preference learning process is uninfluenceable, if the set of possible environments is large enough (though this may need to include "impossible" environments).
- If a preference learning process is unriggable, then it can be unambiguously defined over partial histories , for , rather than just for complete histories .
- Every riggable preference learning process is manipulable in the following sense: there is always a relabelling of the reward functions, such that the AI's optimal policy is
*to sacrifice, with certainty, reward for every possible reward function*. - We can use a "counterfactual" approach to make a riggable learning process into an uninfluenceable learning process. This is akin to "what the truth-telling guard would have told you had you asked them immediately".

Let's ignore that, in reality, no probability is truly (or ). ↩︎

I don't want to get into the moral realism debate, but it seems that me and moral realists differ mainly in emphasis: I say "without making assumptions, we can't figure out preferences, so we need to find good assumptions", while they say "having made these good (obvious) assumptions, we can figure out preferences". ↩︎

There are ways of making this work with , but that extra complexity is not needed for this exposition. ↩︎

This is the most general reward function formats; if, for example, we had a Markovian reward function that just took the latest actions and observations as inputs, this defines an such that . ↩︎

A terminological note here. We've decided to describe as a learning process, with "unriggable learning process" and "uninfluenceable learning process" being the terms if has additional properties. But the general includes things we might not feel are anything like "learning", like the AI writing down its own reward function.

So it might make more sense to reserve "learning" for the unriggable processes, and call the general something else. But this is a judgement call, and people generally consider "ask your programmer" or "look at the coin" to be a learning processes, which are very much riggable. So I've decided to call the general a learning process. ↩︎

This "expectation" can be made fully rigorous, since reward functions form an affine space: you can take weighted averages of reward function, . ↩︎

Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.## Mathematical innovation

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain

provable safety properties(uninfluencability and unriggability), and also explores how useful these might be.Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

## The math in the paper

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word 'uninfluencability'. But does this property actually produce the beneficial 'uninfluencability' effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that 'uninfluencability', the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function Rs that measures the amount of smiling, by the human teaching the agent. observed over the entire history hn. Take a reward function learning process which assumes (in its prior ρ) that the probability of the choice for this reward function at the end of the history, P(Rs|hn,ρ), cannot be influenced by the actions taken by the agent during the history, so for example ρ is such that ∀hnP(Rs|hn,ρ)=1, This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of 'uninfluencability' do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

## Secondary remarks

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on ρ, a prior over the agent environment, but does not consider the step of first exploring constraints on μ, the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on μ, and only then to consider the option of projecting these into ρ as a backup option, when μ happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of

machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by alevel of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that 'uninfluencability' might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.