How to Contribute to Theoretical Reward Learning Research

This is the eighth (and, for now, final) post in the theoretical reward learning sequence, which starts in this post. Here, I will provide a few pointers to anyone who might be interested in contributing to further work on this research agenda, in the form of a few concrete and shovel-ready open problems, a few ideas on how those problems may be approached, and a few useful general insights about this problem setting.

The theoretical reward learning research agenda tackles one of the core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do. At the same time, this research agenda has also proven to be quite tractable; I have been able to produce a lot of results in a relatively short period of time, many of which both contribute to building up a deeper theoretical basis for reward learning, while also having more immediate practical implications. If this research agenda gets a bit more attention, then I think it is entirely realistic to expect that we could have something like a “mathematical theory of outer alignment” within a few years. This would solve a meaningful number of important subproblems to the broader AI safety problem (even though there of course also are many important problems it does not cover). It would elucidate the relationship between objectives and outcomes, it would tell us how to learn good task representations, and it would tell us what the limitations of these representations may be (and potentially also how to handle those limitations). If you are looking to contribute to AI safety research, if you find my motivating assumptions compelling, and if you are inclined towards theoretical research over empirical research, then you may want to consider contributing to the theoretical reward learning research agenda. In this post, I want to provide a few starting points for how to do just that.

Some Open Problems

The first post in this sequence lists several (open-ended) research problems within the theoretical reward learning research agenda. However, in addition to this, I also want to provide some more concrete research questions, together with a few thoughts on how to approach them. This list is by no means exhaustive, it is simply a list of fairly well-defined problems that should be shovel-ready straight away:

Generalise Existing Results to Multi-Agent and Partially Observable Environments

Almost all of my papers that relate to the theoretical reward learning research agenda use Markov Decision Processes (MDPs) as their main theoretical model. This comes with a few important limitations, including notably the assumption that the environment is fully observable, and that it only contains one agent. An important problem is therefore to extend and generalise all those results to richer classes of environments. In the simplest case, this may be POMDPs or Markov games, but we can also consider even more general environments.

Note that this problem is much less trivial than it may at first seem, at least if we are to solve it in full generality. The reason for this is that we quickly encounter decision-theoretic problems if we try to create a fully general formalisation of a “decision problem”. For example, one may naïvely think that the most general formalisation of a decision problem is the “general RL setting”, which is a class of problems that are similar to MDPs, but where the transition function and the reward function are allowed to depend arbitrarily on past transitions (instead of only depending on the current transition) — this is the AIXI problem setting. For example, this setting trivially subsumes POMDPs. However, even in the general RL setting, it is relatively straightforward to prove that there always exists a deterministic optimal policy. This means that the general RL setting cannot capture game-theoretic dynamics in a satisfactory way (since game-theoretic problems may require the agent to randomise its actions), and so it is not actually fully general.

Of course, we can amend the general RL setting by adding other agents. However, this does not seem entirely satisfactory either. After all, an “agent” is not some kind of ontologically basic entity. The properties of agents emerge from (and should be derivable from) the properties of their parts. Moreover, whether or not a system constitutes an “agent” is plausibly not a binary property, etc. More generally, the fact that the general RL setting cannot model agents in a satisfactory way probably means that it is missing something more basic. For example, Newcomblike decision problems involve dynamics that are similar to game-theoretic dynamics, even when they do not involve other agents. Thus, while we can improve the general RL setting by adding other agents to the model, the resulting formalism is likely to still exclude broad classes of situations.

One way to approach this is to ask why the general RL setting is unable to capture game-theoretic situations, and then generalise from there. My current guess is that the main issue is that game-theoretic problems involve situations where your uncertainty about the outcome of an action is correlated with your uncertainty about what action you are about to take. For example, if you are playing rock-paper-scissors, you may think that the more likely you are to play rock, the more likely it is that your opponent will play paper, etc. This would in turn suggest a formalisation of a decision problem where the transition function of the environment is allowed to depend not only on your actions, but also on the probability with which you take those actions. A setup like this is explored in this paper.

At any rate, the core of this open problem is to identify interesting more general formalisations of “decision problems”, and to determine if the results found in previous papers also apply within these more general formalisations. This could use formalisations that are already well-known and well-studied, such as POMDPs and Markov games, or it could involve new formalisations. It is likely to involve solving somewhat hard mathematical problems.

How Should We Quantify the Differences Between Reward Functions?

One of the core problems of the theoretical reward learning research agenda is to answer which reward learning methods are guaranteed to converge to reward functions that are “close” to the underlying true reward function. To answer this, we must first answer how to quantify the differences between reward functions. In this paper, we provide one possible answer to that question in the form of STARC metrics (see also this post). However, there is a lot of room to improve on and generalise this solution. STARC metrics are essentially a continuous measurement of how much the policy orderings of two reward functions differ relative to a particular transition function and discount factor. We could therefore create more demanding metrics by requiring that the reward functions have a similar policy ordering for all transition functions. Alternatively, we could also create less demanding metrics by loosening the requirement that the reward functions must have similar preferences between all policies — perhaps it could be enough for them to have similar optimal policies, for example? Or perhaps there are different ways to do this quantification altogether. Finding good ways to quantify these differences is particularly important, because basically all other results will directly rely on this choice.

Here are a few examples of concrete, mathematical questions, that to the best of my knowledge haven’t been solved, and which could provide guidelines for how to do this:

What are necessary and sufficient conditions for two reward functions to have the same ordering of policies for all transition functions?
1. Two reward functions have the same ordering of policies for a particular transition function if and only if they differ by a combination of potential shaping, positive linear scaling, and S’-redistribution — see this paper. Moreover, of these transformations, only S’-redistribution depends on the transition function. One may therefore expect that two reward functions should have the same ordering of policies for all transition functions if and only if they differ by potential shaping and positive linear scaling. However, this is not the case. To see this, let be three arbitrary states such that $s_{1} \neq s_{2}$ , and let $a$ be an arbitrary action. Let $R_{1}$ be the reward function which is 0 for all transitions except that $R_{1} (s_{0}, a, s_{1}) = 1$ and $R_{1} (s_{0}, a, s_{2}) = ϵ$ , and let $R_{2}$ be the reward function which is 0 for all transitions except that $R_{2} (s_{0}, a, s_{1}) = ϵ$ and $R_{2} (s_{0}, a, s_{2}) = 1$ . Now $R_{1}$ and $R_{2}$ have the same ordering of policies for all transition functions, even though they do not differ by potential shaping and positive linear scaling.
2. A solution to this problem will be necessary to create reward metrics that are stronger than STARC metrics.
What are necessary and sufficient conditions for two reward functions to have the same optimal policies for all transition functions?
1. It has been proven that potential shaping transformations are the only additive transformations that preserve optimal policies for all transition functions — see this paper. However, there are non-additive transformations that also do this — consider positive linear scaling, or the example given under point 1. Alternatively, let $R_{1}, R_{2}$ be reward functions which are always non-positive, and assume that $R_{1} (s, a, s^{'}) = 0$ if and only if $R_{2} (s, a, s^{'}) = 0$ . Then $R_{1}$ and $R_{2}$ have the same optimal policies for all transition functions, even though $R_{1}$ and $R_{2}$ need not differ by potential shaping or positive linear scaling.
2. It may be that there are no succinct conditions that characterise this relationship. In that case, it would still be interesting to know the computational complexity of the problem of determining if two reward functions have the same optimal policies for all transition functions. Is it in P? It is definitely in co-NP (because you can find the optimal policies in a given MDP in time that is polynomial in the size of the MDP), but it is unclear to me if it is co-NP hard.
3. CONJECTURE: Two reward functions $R_{1}, R_{2}$ have the same optimal policies for all transition functions if and only if they have the same policy ordering for all transition functions, unless it is the case that the optimal policies of $R_{1}$ and $R_{2}$ are independent of the transition function, as in the example given in point (a).
Are there succinct necessary and sufficient conditions that describe when two reward functions have the same optimal policies relative to a given transition function?
1. It is not too hard to derive conditions that characterise this, see this paper. However, these conditions end up being very messy. Cleaner conditions may allow for the creation of reward metrics that are less strict than STARC metrics.

Given solutions to the above questions, the next question is whether the corresponding equivalence conditions can be translated into metrics (in the same way as how STARC metrics are a continuous measurement of how much the policy orderings of two reward functions differ relative to a particular transition function). Alternatively, there may also be other completely different ways to usefully quantify the differences between reward functions.

How Should We Optimise Misspecified Reward Functions?

In practice, we should not expect to be able to create reward functions that perfectly capture our preferences. This then raises the question of how we should optimise reward functions, in light of this fact. Can we create some special “conservative” optimisation methods that yield provable guarantees? There is a lot of existing work on this problem, including e.g. quantilizers, or the work on side-effect minimisation. However, I think there are several promising directions for improving this work, especially once more parts of the theoretical reward learning agenda have been solved.

For example, much of the existing work on conservative optimisation makes no particular assumptions about how the reward function is misspecified. A better understanding of reward learning methods would give us a better understanding of how learnt reward functions are likely to differ from the “true” reward function, and this understanding could potentially be used to create better methods for conservative optimisation. For example, in this paper, we derive a (tight) criterion that describes how a policy may be optimised according to some proxy reward while still ensuring that the true reward doesn’t decrease, based on the STARC-distance between the proxy reward and the true reward. In other words, suppose that we have some reward learning method that has produced a reward function $R_{H}$ , and that we expect (based on the number of training samples, etc) that this reward function is likely to have a STARC-distance of at most $ϵ$ to the underlying true reward function $R^{⋆}$ . The criterion from this paper then tells us how to optimise any policy $π$ for $R_{H}$ , while ensuring that its reward according to $R^{⋆}$ cannot decrease. Other distance metrics between reward functions may similarly produce other conservative optimisation methods.

One interesting case of conservative optimisation is the case where we have multiple reward functions produced from different sources. If we have several sources of information about a latent variable (such as a reward function), then we can normally simply combine the evidence using Bayes' theorem. However, if the observation models are misspecified, then the posterior distributions will conflict, and the straightforward way to combine them is likely to lead to nonsensical results. This issue is explained in this paper. This applies to the problem of inferring human preferences -- there are several sources of information we can rely on, but they can often lead to conflicting conclusions. For example, if you ask people what kind of chocolate they prefer, they will typically say that they like dark chocolate, but if you let people choose which type of chocolate to eat, they will typically pick light chocolate. If we take these two pieces of information at face value, they lead to incompatible conclusions. The reason for this is misspecification -- humans sometimes misreport their preferences, and sometimes do things that are against their preferences. This means that we cannot combine this information in the straightforward way, and that special methods are required.

We can set up this problem as follows; suppose we have $n$ reward learning methods, and that they have produced $n$ distributions over reward functions $Δ_{1} \dots Δ_{n}$ . We know that any (or all) of these may be subject to some degree of misspecification, and we wish to derive a policy $π$ that is reasonably good in the light of what we know. There are several considerations that we might want $π$ to satisfy:

We might want $π$ to get reasonably high reward according to each of the reward function distributions $Δ_{1} \dots Δ_{n}$ .
We might want $π$ to preserve option value, in case more information can be obtained later.
We might want to supply $π$ with the ability to make active queries for resolving ambiguity. In this case, we want $π$ to make intelligent use of this ability, by making relevant queries when they are needed.

There are several different ways to approach this problem. A simple option might be to let $π$ maximise ${min}_{i \in {1 \dots n}} J_{i} (π)$ . Alternatively, one could explore solution concepts from social choice theory (where each $Δ_{i}$ is considered to be a voter). A good solution to this problem would provide methods for aggregating misspecified (but informative) preferences in a useful but conservative way.

Compare Different Objective Specification Languages

Reward functions are the most popular method for expressing temporally extended tasks, but they are not the only option. For example, you also have multi-objective RL, temporal logic, constrained RL, risk-averse RL, and reward machines. Moreover, the expressivity of many of these methods are incomparable (for some results on this topic, see this paper). For example, linear temporal logic (LTL) can express the instruction “press button A a finite number of times”, which reward functions cannot, but reward functions can express “pressing button A is better than pressing button B, but pressing button B is better than pressing button C”, which LTL cannot. It would be interesting to have a more complete characterisation of what these specification methods can and cannot express, and how these compare to each other. This paper makes a start on this problem, but it could be improved and extended (for example, by using more realistic assumptions about how a specification may be created, or by providing necessary-and-sufficient conditions that characterise exactly what objectives a given specification language can capture).

Another interesting question could be to determine when a given specification language is able to approximate another language, even when it cannot capture it exactly. For example, multi-objective RL (MORL) is more expressive than (scalar) reward functions (see this paper). However, perhaps it is the case that for any MORL specification, there exists a scalar reward function $R$ such that maximisation of $R$ still satisfies some regret bound with respect to the original MORL specification? And so on. These questions are important, because all other results have to assume that instructions/preferences are expressed within some particular format. An answer to this question would tell us something about which format is correct, and what might be the cost of assuming the wrong format.

Shorter Proposals

In addition to the open problems above, I also want to give a longer list with a few shorter research proposals (but without providing as much detail as above):

How robust is RLHF to misspecification? I have done a lot of analysis of how robust IRL is to misspecification, but not as much on RLHF (and similar methods). Part of this extension would be relatively straightforward, using the tools I have already developed. For example, if we have a deterministic environment, and consider the original version of RLHF applied to trajectories which are possible in that environment (and ignore impossible trajectories), then RLHF would be ORD-robust to misspecification of the temperature parameter, and no other misspecification (using the terminology of this paper, with the proof being closely analogous to the proof of Theorem 3.2 in that paper). However, a more general and comprehensive analysis of this question would require more work, and would be interesting to see.
Sample complexity & the finite data case: Much of the analysis I have done looks at the asymptotic behaviour of reward learning algorithms, in the limit of infinite data. It may be interesting to also study the finite-data case in more detail. For example, one could derive (upper and lower) bounds on how much training data different reward learning algorithms require to learn a good reward function with high probability (analogous to PAC-guarantees). Upper bounds on the sample complexity of machine learning algorithms tend to be much higher than the actual sample complexity you get in practice. However, these bounds can still be informative — if RLHF has a polynomial theoretical sample complexity and IRL has an exponential theoretical sample complexity, then I would expect RLHF to need less data in practice. Moreover, proofs often help with improving our understanding of a problem, over and above what is contributed by the statement that is proven. And lower bounds must of course apply in practice as well as in theory.
More work on IRL: There are many more classes of policies which can be analysed in the context of IRL, and other open questions that are not covered by my papers. For example:
1. Can we examine more realistic models of human behaviour from cognitive science and behavioural psychology, such as prospect theory?
2. Can we use agents that plan ahead a finite number of steps as a good model of bounded rationality? (Perhaps we can also let the planning horizon vary depending on some random variable.)
3. I have done a lot of theoretical analysis of IRL, but no empirical analysis. Doing a more extensive empirical study may be useful, especially to test questions around sensitivity to misspecification. Some empirical analysis is provided in this paper, but there is much room to extend these results.
Does higher discounting mean more corrigibility? Some people have the intuition that an agent which discounts at a higher rate (and therefore puts less emphasis on long-term planning) will be more corrigible than an agent that discounts at a lower rate (and therefore puts more emphasis on long-term planning). Is this true? This question can be approached empirically, but also theoretically (by considering how the STARC-distance between different reward functions change as the discount value is changed, for example).
Analyse preferences that change over time. Most reward learning methods (and most RL methods) assume that preferences correspond to a static reward function, but in reality, a person’s preferences will change over time. Can we model this? If we do, do these models suggest any natural ways to modify existing formalisms? What happens if you apply standard IRL to an agent whose preferences change over time? Do you get reasonable results, or might you make nonsensical inferences?
Define “agentic” preferences. Some people have the intuition that we can distinguish between “agentic” and “non-agentic” preferences, where agentic preferences give rise to convergent instrumental goals, but non-agentic preferences do not. Can we formalise this distinction? For example, perhaps agentic preferences simply correspond to sparse reward functions, or something like that? Moreover, given a formal definition, can we derive any additional results about these reward functions? For example, perhaps they tend to have a large STARC-distance, etc. Are most reward learning methods likely to produce reward functions that encode agentic preferences, or non-agentic preferences? What assumptions underpin this result?
What preferences lead to convergent instrumental goals? The idea of convergent instrumental goals is another common feature of many stories of how we could get catastrophic risks from AI. However, in practice, it does not seem like all reward functions actually give rise to convergent instrumental goals in this way. Can we say anything about this? Can we formalise the notion of convergent instrumental goals, and say what reward functions incentivise such goals, and which do not? Are typical reward learning methods likely to generate reward functions which do incentivise such goals? Given a reward function which does generate such goals, is there a way to translate it into a reward function which does not? Some existing work on this problem can be found here: 1, 2.
Further work on Goodhart's Law. There has been a lot of work on Goodhart’s Law, including e.g. 1,2,3,4,5,6. However, it seems likely that further work on this topic would be useful. Can you think of any ways to formalise further interesting questions on Goodhart’s Law?
To what extent are most reward functions orthogonal? Many people have the intuition that “most” reward functions are opposed to each other, in the sense that maximising an arbitrary reward or utility function is likely to be catastrophic according to your preferences (whatever your preferences may be). Is this true, and if so, can it be translated to a formal statement? It is not hard to show that two randomly generated reward functions are very likely to be nearly orthogonal (in the very literal sense of having a small dot product), provided that the underlying state space is sufficiently large. However, this is not quite the same thing as saying that maximising a random reward function is likely to be catastrophic. Is there a precise sense in which this is true?
1. Appendix C of the STARC paper provides some visual intuitions that are likely to be useful here. However, note that the set of all occupancy measures may be arbitrarily thin in a given direction (this last sentence may only make sense after reading Appendix C).
2. I expect something like this to be true: for any MDP $(S, A, τ, μ_{0}, R_{1}, γ)$ , if $R_{2}$ is a randomly generated reward function sampled from some reasonably uniform distribution, then there is a high probability that there are policies $π_{1}, π_{2}$ such that $π_{1}$ is nearly optimal under $R_{1}$ and $π_{2}$ is nearly maximally suboptimal under $R_{1}$ , and such that $J_{2} (π_{1}) \approx J_{2} (π_{2})$ . Furthermore, the policy $π_{3}$ that is optimal under $R_{2}$ is likely to be roughly between the best and worst policy, as measured by $R_{1}$ . As $| S |$ and $| A |$ go to infinity, these probabilities should approach 1.
3. Except that the above statement probably isn’t exactly true, and may depend on some assumptions about $R_{1}$ and $τ$ . If one or both of $R_{1}$ and $τ$ are also randomly generated, then something like this is very likely to be true.

In addition to this, there are also many open problems listed in earlier papers, especially this one.

Some Useful Insights and Proofs

I also want to share a few neat mathematical insights which I have found to be particularly useful for solving problems in this area. These insights are:

Understanding Potential Shaping

Given an MDP with state-space $S$ , a potential function is a function $Φ : S \to R$ that maps each state to a real number. We say that two reward functions $R_{1}, R_{2}$ differ by potential shaping with $Φ$ (for discount $γ$ ) if

$R_{2} (s, a, s^{'}) = R_{1} (s, a, s^{'}) + γ Φ (s^{'}) - Φ (s)$

for all states s, s’ and all actions a. Potential shaping was first introduced in this paper, where they show that any two reward functions that differ by potential shaping have the same optimal policies in all MDPs. Moreover, they also show that potential shaping transformations are the only additive reward transformations with this property.

Potential shaping has a lot of neat properties that it is worth being aware of, which aren’t explicitly discussed in the original paper. In particular:

Given two reward functions $R_{1}, R_{2}$ , let $G_{1}$ and $G_{2}$ be their corresponding trajectory return functions. Then $G_{1}$ and $G_{2}$ differ by constant shift for all trajectories starting in a given state $s$ , if and only if $R_{1}$ and $R_{2}$ differ by potential shaping. More precisely, if $ξ$ is a trajectory starting in state $s$ , and $R_{2}$ is given by potential shaping of $R_{1}$ with potential function $Φ$ , then $G_{2} (ξ) = G_{1} (ξ) - Φ (s)$ (and vice versa!). Thus, when you see “potential shaping”, you can think, “constant shift of $G$ .”
Note that (1) also implies that $G_{1}$ and $G_{2}$ differ by an affine transformation if and only if $R_{1}$ and $R_{2}$ differ by a combination of potential shaping and linear scaling.
Note that (2) also implies that $R_{1}$ and $R_{2}$ induce the same preferences between all distributions of trajectories if and only if $R_{1}$ and $R_{2}$ differ by a combination of potential shaping and positive linear scaling.
Note that (1) implies that if $R_{1}$ and $R_{2}$ differ by potential shaping with $Φ$ , then $Q_{2} (s, a) = Q_{1} (s, a) - Φ (s)$ .
Note that (4) implies that $R_{1}$ and $R_{2}$ differ by potential shaping with $Φ$ , then $J_{2} (π) = J_{1} (π) - E [Φ (S_{0})]$ for all policies $π$ , where the expectation is over a state $S_{0}$ sampled from the initial state distribution of the MDP.
Note that (5) implies that if $R_{1}$ and $R_{2}$ differ by potential shaping, then they induce the same preferences between all pairs of policies (which, as a special case, means that they have the same optimal policies). Moreover, since potential shaping does not depend on the transition function in any way, this holds for all transition functions.

The above claims are fairly easy to prove, but for explicit proofs, see this paper (Lemma B.1, B.3, and B.4, and Theorem 3.11).

Occupancy Measures

Given a policy $π$ , let its “occupancy measure” $η^{π}$ be the $| S | | A | | S |$ -dimensional vector in which the dimension corresponding to transition $(s, a, s^{'})$ is equal to

$\sum_{t = 0}^{\infty} γ^{t} P (S_{t} = s, A_{t} = a, S_{t + 1} = s^{'})$

where the probability is over trajectories sampled from $π$ , given a particular transition function and initial state distribution.

Now note that the expected discounted cumulative reward $J (π)$ of a policy $π$ is given by the dot product of the occupancy measure of $π$ and the reward function $R$ , if $R$ is viewed as an $| S | | A | | S |$ -dimensional vector. That is, $J (π) = η^{π} \cdot R$ .

This means that occupancy measures let us split the policy evaluation function J into two parts, the first of which is independent of the reward function, and the second of which is linear in the reward function. This is very helpful for proving things!

The following facts are also useful to know about the occupancy measure:

Let $Π^{+}$ be the set of all policies that take each action with positive probability in each state, and let $m : Π^{+} \to R^{| S | | A | | S |}$ be the function that maps each policy in $Π^{+}$ to its occupancy measure. Then $m$ is injective on this domain. For a proof, see Lemma A.7 in this paper.
More generally, $m$ is in fact a homeomorphism between $Π^{+}$ and its image. For a proof, see Lemma A.9 in this paper.
Let $Ω = {η^{π} : π \in Π}$ be the set of all occupancy measures for all (stationary) policies. Then $Ω$ is the convex hull of the occupancy measures of the set of all deterministic policies. This means that $Ω$ is a convex polygon. For a proof, see this paper.
Moreover, while each point in $Ω$ is $| S | | A | | S |$ -dimensional, the entirety of $Ω$ is located in an affine subspace of $R^{| S | | A | | S |}$ that has only $| S | (| A | - 1)$ dimensions. For a proof, see Lemma A.8 in this paper.

These facts greatly simplify a lot of proofs about reward functions in arbitrary MDPs to relatively simple geometric problems!

Proving Regret Bounds

Many problems within the theoretical reward-learning research agenda involve proving regret bounds. As such, I will provide a simple way to derive regret bounds, which can be extended and modified for more complicated cases:

Proposition: If $L_{\infty} (R_{1}, R_{2}) = ϵ$ , then for any policy $π$ , we have that $| J_{1} (π) - J_{2} (π) | \leq \frac{ϵ}{1 - γ}$ .
Proof: This follows from straightforward algebra:
$\begin{matrix} | J_{1} (π) - J_{2} (π) | = & ∣ ∣ E_{ξ \sim π} [\infty \sum t = 0 γ^{t} R_{1} (S_{t}, A_{t}, S_{t + 1})] - E_{ξ \sim π} [\infty \sum t = 0 γ^{t} R_{2} (S_{t}, A_{t}, S_{t + 1})] ∣ ∣ = & \infty \sum t = 0 γ^{t} | E_{ξ \sim π} [R_{1} (S_{t}, A_{t}, S_{t + 1}) - R_{2} (S_{t}, A_{t}, S_{t + 1})] | \leq & \infty \sum t = 0 γ^{t} E_{ξ \sim π} [| R_{1} (S_{t}, A_{t}, S_{t + 1}) - R_{2} (S_{t}, A_{t}, S_{t + 1}) |] \leq & \infty \sum t = 0 γ^{t} L_{\infty} (R_{1}, R_{2}) = (\frac{1}{1 - γ}) L_{\infty} (R_{1}, R_{2}) . \end{matrix}$

Here the second line follows from the linearity of expectation, and the third line follows from Jensen's inequality. Recall that the linearity of expectation is guaranteed to hold for infinite sums only if that sum is absolutely convergent, but that is true in this case (because of the discounting, and assuming that the reward function has a bounded magnitude).

Proposition: If $| J_{1} (π) - J_{2} (π) | \leq U$ for all policies $π$ , and $J_{1} (π_{1}) \geq J_{1} (π_{2})$ , then $J_{2} (π_{1}) - J_{2} (π_{2}) \leq 2 \cdot U$ .
Proof: First note that U must be non-negative.
Next, note that if $J_{1} (π_{1}) < J_{1} (π_{2})$ then $J_{1} (π_{1}) - J_{1} (π_{2}) < 0$ , and so the proposition holds. Now consider the case when $J_{1} (π_{1}) \geq J_{1} (π_{2})$ :
$\begin{matrix} J_{1} (π_{1}) - J_{1} (π_{2}) & = J_{1} (π_{1}) - J_{2} (π_{2}) + J_{2} (π_{2}) - J_{1} (π_{2}) \leq | J_{1} (π_{1}) - J_{2} (π_{2}) | + | J_{2} (π_{2}) - J_{1} (π_{2}) | \end{matrix}$
Our assumptions imply that $| J_{2} (π_{2}) - J_{1} (π_{2}) | \leq U$ .
We next show that $| J_{1} (π_{1}) - J_{2} (π_{2}) | \leq U$ as well. Our assumptions imply that
$\begin{matrix} | J_{1} (π_{1}) - J_{2} (π_{1}) | \leq U ⟹ & J_{2} (π_{1}) \geq J_{1} (π_{1}) - U ⟹ & J_{2} (π_{2}) \geq J_{1} (π_{1}) - U \end{matrix}$
Here the last implication uses the fact that $J_{2} (π_{2}) \geq J_{2} (π_{1})$ . A symmetric argument also shows that $J_{1} (π_{1}) \geq J_{2} (π_{2}) - U$ (recall that we assume that $J_{1} (π_{1}) \geq J_{1} (π_{2})$ ). Together, this implies that $J_{1} (π_{1}) - J_{2} (π_{2}) | \leq U$ . We have thus shown that if $J_{1} (π_{1}) \geq J_{1} (π_{2})$ then
$J_{2} (π_{1}) - J_{2} (π_{2}) \leq | J_{1} (π_{1}) - J_{2} (π_{2}) | + | J_{2} (π_{2}) - J_{1} (π_{2}) | \leq 2 \cdot U .$

Too see how to prove more advanced regret bounds, see e.g. the STARC paper.

The Set of Nonstationary Policies and the Set of Trajectories are Compact

A stationary policy is a function $π : S \to Δ (A)$ from states to distributions of actions. If we have a finite set of states and actions, then the set of all policies $Π$ is clearly compact. In particular, each policy can be represented as an $| S | | A |$ -dimensional vector, and the set of all these vectors is closed and bounded, which means that it is compact in the usual topology. This fact is very useful, and does for example directly imply that there is some policy that achieves maximal expected reward (given that $J (π)$ is continuous in $π$ , which it is).

A non-stationary policy is a function $π : (S \times A)^{⋆} \times S \to Δ (A)$ from finite trajectories to distributions of actions. They generalise stationary policies, in that they allow the actions to depend on the past trajectory of the agent. Such policies are required in partially identifiable environments, and are also needed for agents with preferences that aren’t temporally consistent, for example. However, the set of all non-stationary policies is not compact in the usual topology. In fact, if we want to represent such policies as vectors then these vectors would have to be infinite-dimensional, even if we have a finite state and action space (because the past history can be arbitrarily long). It is therefore useful to know that the set of all non-stationary policies in fact is compact relative to a slightly different (but still very sensible) topology.

In particular, let $d : Π \times Π \to R$ be the function where $d (π_{1}, π_{2})$ is 0 if $π_{1} = π_{2}$ , and otherwise $1 / e^{t}$ , where $t$ is the length of the shortest trajectory $ξ$ such that $π_{1} (ξ) \neq π_{2} (ξ)$ . Here $Π$ is the set of all non-stationary policies.

Now $d$ is a metric on $Π$ , and $(Π, d)$ is a compact metric space. I will leave it as an exercise to the reader to prove that $d$ is a metric. To show that $(Π, d)$ is compact, I will show that $(Π, d)$ is totally bounded and complete.

To see that $(Π, d)$ is totally bounded, let $ϵ > 0$ be picked arbitrarily, and let $t = ln (1 / ϵ)$ , so that $ϵ = 1 / e^{t}$ . Let $a$ be an arbitrary action, and let $X$ be the set of all policies $π$ such that $π (ξ) = a$ for all trajectories $ξ$ longer than $t$ . Now $X$ is finite. Moreover, for any policy $π$ , there is a policy $π^{'} \in X$ such that $d (π, π^{'}) < ϵ$ (given by letting $π^{'} (ξ) = π (ξ)$ for all trajectories $ξ$ of length $t$ or less). Since $ϵ$ was picked arbitrarily, this means that $(Π, d)$ is totally bounded.

To see that $(Π, d)$ is complete, let ${π_{i}}$ be an arbitrary Cauchy sequence in $Π$ . This means that for each $ϵ > 0$ there is a positive integer $N$ such that for all $m, n > N$ , we have that $d (π_{m}, π_{n}) < ϵ$ . In our case, this is equivalent to saying that there for each positive integer $t$ is a positive integer $N$ such that for $m, n > N$ we have $π_{m} (ξ) = π_{n} (ξ)$ for all trajectories $ξ$ of length at most $t$ . We can thus define a policy $π_{\infty}$ by letting $π_{\infty} (ξ) = x$ if there is an $N$ such that for all $n > N$ , we have that $π_{n} (ξ) = x$ (where $x$ is a distribution over actions). Now $π_{\infty}$ is the limit of ${π_{i}}$ , and $π_{\infty}$ is in $Π$ . Thus $(Π, d)$ is complete.

Together, this implies that $(Π, d)$ is compact. This is nice, and makes certain things easier to prove. For example, the policy evaluation function $J$ is continuous relative to the topology induced by $d$ (this can be shown with a normal $ϵ / δ$ -proof).^[1] And so, by the extreme value theorem, there is a policy in $(Π, d)$ that maximises $J$ . A similar proof can also be used to show that the set of all trajectories $(Ξ, d)$ is compact, where $d$ is the metric defined analogously to the above.

Closing Remarks

I hope that this sequence has made it easier to learn about the theoretical foundations of reward learning, and easier to get an overview of my recent research. I want to welcome discussion and criticism, so please make a comment if you have any questions or remarks!

I think this research agenda is promising: it tackles one of the core difficulties of AI safety in a fairly direct way, and it has proven to be highly tractable. With a bit more work, I think we could have something like a complete mathematical theory of outer alignment within a few years.

At the same time, I will in all likelihood not be active in this research area for the next ~2 years at least (not because I have lost interest in it, but because I will instead be investigating some questions in the Guaranteed Safe AI space). So, if this research agenda looks interesting to you, I would absolutely encourage you to consider trying to solve some of these problems yourself! This research direction is very fruitful, and ready to be attacked further.

^{^}
Note however that this still requires $R$ to be bounded.

AI ALIGNMENT FORUM
AF