This is the eighth (and, for now, final) post in the theoretical reward learning sequence, which starts in this post. Here, I will provide a few pointers to anyone who might be interested in contributing to further work on this research agenda, in the form of a few concrete and shovel-ready open problems, a few ideas on how those problems may be approached, and a few useful general insights about this problem setting.
The theoretical reward learning research agenda tackles one of the core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do. At the same time, this research agenda has also proven to be quite tractable; I have been able to produce a lot of results in a relatively short period of time, many of which both contribute to building up a deeper theoretical basis for reward learning, while also having more immediate practical implications. If this research agenda gets a bit more attention, then I think it is entirely realistic to expect that we could have something like a “mathematical theory of outer alignment” within a few years. This would solve a meaningful number of important subproblems to the broader AI safety problem (even though there of course also are many important problems it does not cover). It would elucidate the relationship between objectives and outcomes, it would tell us how to learn good task representations, and it would tell us what the limitations of these representations may be (and potentially also how to handle those limitations). If you are looking to contribute to AI safety research, if you find my motivating assumptions compelling, and if you are inclined towards theoretical research over empirical research, then you may want to consider contributing to the theoretical reward learning research agenda. In this post, I want to provide a few starting points for how to do just that.
The first post in this sequence lists several (open-ended) research problems within the theoretical reward learning research agenda. However, in addition to this, I also want to provide some more concrete research questions, together with a few thoughts on how to approach them. This list is by no means exhaustive, it is simply a list of fairly well-defined problems that should be shovel-ready straight away:
Almost all of my papers that relate to the theoretical reward learning research agenda use Markov Decision Processes (MDPs) as their main theoretical model. This comes with a few important limitations, including notably the assumption that the environment is fully observable, and that it only contains one agent. An important problem is therefore to extend and generalise all those results to richer classes of environments. In the simplest case, this may be POMDPs or Markov games, but we can also consider even more general environments.
Note that this problem is much less trivial than it may at first seem, at least if we are to solve it in full generality. The reason for this is that we quickly encounter problems if we try to create a fully general formalisation of a “decision problem”. For example, one may naïvely think that the most general formalisation of a decision problem is the “general RL setting”, which is a class of problems that are similar to MDPs, but where the transition function and the reward function are allowed to depend arbitrarily on past transitions (instead of only depending on the current transition) — this is the problem setting. For example, this setting trivially subsumes POMDPs. However, even in the general RL setting, it is relatively straightforward to prove that there always exists a deterministic optimal policy. This means that the general RL setting cannot capture game-theoretic dynamics in a satisfactory way (since game-theoretic problems may require the agent to randomise its actions), and so it is not actually fully general.
Of course, we can amend the general RL setting by adding other agents. However, this does not seem entirely satisfactory either. After all, an “agent” is not some kind of ontologically basic entity. The properties of agents emerge from (and should be derivable from) the properties of their parts. Moreover, whether or not a system constitutes an “agent” is plausibly not a binary property, etc. More generally, the fact that the general RL setting cannot model agents in a satisfactory way probably means that it is missing something more basic. For example, decision problems involve dynamics that are similar to game-theoretic dynamics, even when they do not involve other agents. Thus, while we can improve the general RL setting by adding other agents to the model, the resulting formalism is likely to still exclude broad classes of situations.
One way to approach this is to ask why the general RL setting is unable to capture game-theoretic situations, and then generalise from there. My current guess is that the main issue is that game-theoretic problems involve situations where your uncertainty about the outcome of an action is correlated with your uncertainty about what action you are about to take. For example, if you are playing rock-paper-scissors, you may think that the more likely you are to play rock, the more likely it is that your opponent will play paper, etc. This would in turn suggest a formalisation of a decision problem where the transition function of the environment is allowed to depend not only on your actions, but also on the probability with which you take those actions. A setup like this is explored in this paper.
At any rate, the core of this open problem is to identify interesting more general formalisations of “decision problems”, and to determine if the results found in previous papers also apply within these more general formalisations. This could use formalisations that are already well-known and well-studied, such as POMDPs and Markov games, or it could involve new formalisations. It is likely to involve solving somewhat hard mathematical problems.
One of the core problems of the theoretical reward learning research agenda is to answer which reward learning methods are guaranteed to converge to reward functions that are “close” to the underlying true reward function. To answer this, we must first answer how to quantify the differences between reward functions. In this paper, we provide one possible answer to that question in the form of STARC metrics (see also this post). However, there is a lot of room to improve on and generalise this solution. STARC metrics are essentially a continuous measurement of how much the policy orderings of two reward functions differ relative to a particular transition function and discount factor. We could therefore create more demanding metrics by requiring that the reward functions have a similar policy ordering for all transition functions. Alternatively, we could also create less demanding metrics by loosening the requirement that the reward functions must have similar preferences between all policies — perhaps it could be enough for them to have similar optimal policies, for example? Or perhaps there are different ways to do this quantification altogether. Finding good ways to quantify these differences is particularly important, because basically all other results will directly rely on this choice.
Here are a few examples of concrete, mathematical questions, that to the best of my knowledge haven’t been solved, and which could provide guidelines for how to do this:
Given solutions to the above questions, the next question is whether the corresponding equivalence conditions can be translated into metrics (in the same way as how STARC metrics are a continuous measurement of how much the policy orderings of two reward functions differ relative to a particular transition function). Alternatively, there may also be other completely different ways to usefully quantify the differences between reward functions.
In practice, we should not expect to be able to create reward functions that perfectly capture our preferences. This then raises the question of how we should optimise reward functions, in light of this fact. Can we create some special “conservative” optimisation methods that yield provable guarantees? There is lot of existing work on this problem, including e.g. quantilizers, or the work on minimisation. However, I think there are several promising directions for improving this work, especially once more parts of the theoretical reward learning agenda have been solved.
For example, much of the existing work on conservative optimisation makes no particular assumptions about how the reward function is misspecified. A better understanding of reward learning methods would give us a better understanding of how learnt reward functions are likely to differ from the “true” reward function, and this understanding could potentially be used to create better methods for conservative optimisation. For example, in this paper, we derive a (tight) criterion that describes how a policy may be optimised according to some proxy reward while still ensuring that the true reward doesn’t decrease, based on the STARC-distance between the proxy reward and the true reward. In other words, suppose that we have some reward learning method that has produced a reward function , and that we expect (based on the number of training samples, etc) that this reward function is likely to have a STARC-distance of at most to the underlying true reward function . The criterion from this paper then tells us how to optimise any policy for , while ensuring that its reward according to cannot decrease. Other distance metrics between reward functions may similarly produce other conservative optimisation methods.
One interesting case of conservative optimisation is the case where we have multiple reward functions produced from different sources. If we have several sources of information about a latent variable (such as a reward function), then we can normally simply combine the evidence using Bayes' theorem. However, if the observation models are misspecified, then the posterior distributions will conflict, and the straightforward way to combine them is likely to lead to nonsensical results. This issue is explained in this paper. This applies to the problem of inferring human preferences -- there are several sources of information we can rely on, but they can often lead to conflicting conclusions. For example, if you ask people what kind of chocolate they prefer, they will typically say that they like dark chocolate, but if you let people choose which type of chocolate to eat, they will typically pick light chocolate. If we take these two pieces of information at face value, they lead to incompatible conclusions. The reason for this is misspecification -- humans sometimes misreport their preferences, and sometimes do things that are against their preferences. This means that we cannot combine this information in the straightforward way, and that special methods are required.
We can set up this problem as follows; suppose we have reward learning methods, and that they have produced distributions over reward functions . We know that any (or all) of these may be subject to some degree of misspecification, and we wish to derive a policy that is reasonably good in the light of what we know. There are several considerations that we might want to satisfy:
There are several different ways to approach this problem. A simple option might be to let maximise . Alternatively, one could explore solution concepts from social choice theory (where each is considered to be a voter). A good solution to this problem would provide methods for aggregating misspecified (but informative) preferences in a useful but conservative way.
Reward functions are the most popular method for expressing temporally extended tasks, but they are not the only option. For example, you also have multi-objective RL, temporal logic, constrained RL, risk-averse RL, and reward machines. Moreover, the expressivity of many of these methods are incomparable (for some results on this topic, see this paper). For example, linear temporal logic (LTL) can express the instruction “press button A a finite number of times”, which reward functions cannot, but reward functions can express “pressing button A is better than pressing button B, but pressing button B is better than pressing button C”, which LTL cannot. It would be interesting to have a more complete characterisation of what these specification methods can and cannot express, and how these compare to each other. This paper makes a start on this problem, but it could be improved and extended (for example, by using more realistic assumptions about how a specification may be created, or by providing necessary-and-sufficient conditions that characterise exactly what objectives a given specification language can capture).
Another interesting question could be to determine when a given specification language is able to approximate another language, even when it cannot capture it exactly. For example, multi-objective RL (MORL) is more expressive than (scalar) reward functions (see this paper). However, perhaps it is the case that for any MORL specification, there exists a scalar reward function such that maximisation of still satisfies some regret bound with respect to the original MORL specification? And so on. These questions are important, because all other results have to assume that instructions/preferences are expressed within some particular format. An answer to this question would tell us something about which format is correct, and what might be the cost of assuming the wrong format.
In addition to the open problems above, I also want to give a longer list with a few shorter research proposals (but without providing as much detail as above):
In addition to this, there are also many open problems listed in earlier papers, especially this one.
I also want to share a few neat mathematical insights which I have found to be particularly useful for solving problems in this area. These insights are:
Given an MDP with state-space , a potential function is a function that maps each state to a real number. We say that two reward functions differ by potential shaping with (for discount ) if
for all states s, s’ and all actions a. Potential shaping was first introduced in this paper, where they show that any two reward functions that differ by potential shaping have the same optimal policies in all MDPs. Moreover, they also show that potential shaping transformations are the only additive reward transformations with this property.
Potential shaping has a lot of neat properties that it is worth being aware of, which aren’t explicitly discussed in the original paper. In particular:
The above claims are fairly easy to prove, but for explicit proofs, see this paper (Lemma B.1, B.3, and B.4, and Theorem 3.11).
Given a policy , let its “occupancy measure” be the -dimensional vector in which the dimension corresponding to transition is equal to
where the probability is over trajectories sampled from , given a particular transition function and initial state distribution.
Now note that the expected discounted cumulative reward of a policy is given by the dot product of the occupancy measure of and the reward function , if is viewed as an -dimensional vector. That is, .
This means that occupancy measures let us split the policy evaluation function J into two parts, the first of which is independent of the reward function, and the second of which is linear in the reward function. This is very helpful for proving things!
The following facts are also useful to know about the occupancy measure:
These facts greatly simplify a lot of proofs about reward functions in arbitrary MDPs to relatively simple geometric problems!
Many problems within the theoretical reward-learning research agenda involve proving regret bounds. As such, I will provide a simple way to derive regret bounds, which can be extended and modified for more complicated cases:
Proposition: If , then for any policy , we have that .
Proof: This follows from straightforward algebra:
Here the second line follows from the linearity of expectation, and the third line follows from Jensen's inequality. Recall that the linearity of expectation is guaranteed to hold for infinite sums only if that sum is absolutely convergent, but that is true in this case (because of the discounting, and assuming that the reward function has a bounded magnitude).
Proposition: If for all policies , and , then .
Proof: First note that U must be non-negative.
Next, note that if then , and so the proposition holds. Now consider the case when :
Our assumptions imply that .
We next show that as well. Our assumptions imply that
Here the last implication uses the fact that . A symmetric argument also shows that (recall that we assume that ). Together, this implies that . We have thus shown that if then
Too see how to prove more advanced regret bounds, see e.g. the STARC paper.
A stationary policy is a function from states to distributions of actions. If we have a finite set of states and actions, then the set of all policies is clearly compact. In particular, each policy can be represented as an -dimensional vector, and the set of all these vectors is closed and bounded, which means that it is compact in the usual topology. This fact is very useful, and does for example directly imply that there is some policy that achieves maximal expected reward (given that is continuous in , which it is).
A non-stationary policy is a function from finite trajectories to distributions of actions. They generalise stationary policies, in that they allow the actions to depend on the past trajectory of the agent. Such policies are required in partially identifiable environments, and are also needed for agents with preferences that aren’t temporally consistent, for example. However, the set of all non-stationary policies is not compact in the usual topology. In fact, if we want to represent such policies as vectors then these vectors would have to be infinite-dimensional, even if we have a finite state and action space (because the past history can be arbitrarily long). It is therefore useful to know that the set of all non-stationary policies in fact is compact relative to a slightly different (but still very sensible) topology.
In particular, let be the function where is 0 if , and otherwise , where is the length of the shortest trajectory such that . Here is the set of all non-stationary policies.
Now is a metric on , and is a compact metric space. I will leave it as an exercise to the reader to prove that is a metric. To show that is compact, I will show that is totally bounded and complete.
To see that is totally bounded, let be picked arbitrarily, and let , so that . Let be an arbitrary action, and let be the set of all policies such that for all trajectories longer than . Now is finite. Moreover, for any policy , there is a policy such that (given by letting for all trajectories of length or less). Since was picked arbitrarily, this means that is totally bounded.
To see that is complete, let be an arbitrary Cauchy sequence in . This means that for each there is a positive integer such that for all , we have that . In our case, this is equivalent to saying that there for each positive integer is a positive integer such that for we have for all trajectories of length at most . We can thus define a policy by letting if there is an such that for all , we have that (where is a distribution over actions). Now is the limit of , and is in . Thus is complete.
Together, this implies that is compact. This is nice, and makes certain things easier to prove. For example, the policy evaluation function is continuous relative to the topology induced by (this can be shown with a normal -proof).[1] And so, by the extreme value theorem, there is a policy in that maximises . A similar proof can also be used to show that the set of all trajectories is compact, where is the metric defined analogously to the above.
I hope that this sequence has made it easier to learn about the theoretical foundations of reward learning, and easier to get an overview of my recent research. I want to welcome discussion and criticism, so please make a comment if you have any questions or remarks!
I think this research agenda is promising: it tackles one of the core difficulties of AI safety in a fairly direct way, and it has proven to be highly tractable. With a bit more work, I think we could have something like a complete mathematical theory of outer alignment within a few years.
At the same time, I will in all likelihood not be active in this research area for the next ~2 years at least (not because I have lost interest in it, but because I will instead be investigating some questions in the Guaranteed Safe AI space). So, if this research agenda looks interesting to you, I would absolutely encourage you to consider trying to solve some of these problems yourself! This research direction is very fruitful, and ready to be attacked further.
Note however that this still requires to be bounded.