Scalar reward is not enough for aligned AGI

Peter Vamplew

This post was authored by Peter Vamplew and Cameron Foale (Federation University), and Richard Dazeley (Deakin University)

Introduction

Recently some of the most well-known researchers in reinforcement learning Silver, Singh, Precup and Sutton published a paper entitled Reward is Enough, which proposes the reward-is-enough hypothesis: “Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment". Essentially, they argue that the overarching goal of maximising reward is sufficient to explain all aspects of natural and artificial intelligences.

Of specific interest to this forum is the contention that suitably powerful methods based on maximisation of a scalar reward (as in conventional reinforcement learning) provide a suitable pathway for the creation of artificial general intelligence (AGI). We are concerned that the promotion of such an approach by these influential researchers increases the risk of development of AGI which is not aligned with human interests, and this led us to work with a team of collaborators on a recent pre-print Scalar Reward is Not Enough which argues against the assumption made by the reward-is-enough hypothesis that scalar rewards are sufficient to underpin intelligence.

The aim of this post is to provide an overview of our arguments as they relate to the creation of aligned AGI. In this post we will focus on reinforcement learning methods, both because that is the main approach mentioned by Silver et al, and also because it is our own area of expertise. However the arguments apply to any form of AI based on maximisation of a numeric measure of reward or utility.

Does aligned AGI require multiple objectives?

In discussing the development of intelligence, Silver et al argue that complex, general intelligence may arise from the combination of complex environments and simple reward signals, and provide the following illustrative example:

“For example, consider a signal that provides +1 reward to the agent each time a round-shaped pebble is collected. In order to maximise this reward signal effectively, an agent may need to classify pebbles, to manipulate pebbles, to navigate to pebble beaches, to store pebbles, to understand waves and tides and their effect on pebble distribution, to persuade people to help collect pebbles, to use tools and vehicles to collect greater quantities, to quarry and shape new pebbles, to discover and build new technologies for collecting pebbles, or to build a corporation that collects pebbles.”

Silver et al present the ability of a reward-maximising agent to develop such wide-ranging, impactful behaviours on the basis of a simple scalar reward as a positive feature of this approach to developing AI. However we were struck by the similarity between this scenario and the infamous paper-clip maximiser thought experiment which has been widely discussed in the AI safety literature. The dangers posed by unbounded maximisation of a simple objective are well-known in this community, and it is concerning to see them totally overlooked in a paper advocating RL as a means for creating AGI.

We have previously argued that the creation of human-aligned AI is an inherently multiobjective problem. By incorporating rewards for other objectives in addition to the primary objective (such as making paperclips or collecting rocks), the designer of an AI system can reduce the likelihood of unsafe behaviour arising. In addition to safety objectives, there may be many other aspects of desirable behaviour which we wish to encourage an AI/AGI to adopt – for example, adhering to legal frameworks, societal norms, ethical guidelines, etc. Of course, it may not be possible for an agent to simultaneously maximise all of these objectives (for example, sometimes illegal actions may be required in order to maximise safety; different ethical frameworks may be in disagreement in particular scenarios), and so we contend that it may be necessary to incorporate concepts from multiobjective decision-making in order to manage trade-offs between conflicting objectives.

Our collaborator Ben Smith and his colleagues Roland Pihlakas and Robert Klassert recently posted to this forum an excellent review of the benefits of multiobjective approaches to AI safety, so rather than duplicating those arguments here we refer the reader to that post, and to our prior paper.

For the remainder of this post we assume that the aim is to create AGI which takes into account both a primary objective (such as collecting rocks) along with one or more alignment objectives, and we will consider the extent to which technical approaches based on either scalar or vector rewards (with a separate element for each objective) may achieve that goal.

Does the reward-is-enough hypothesis only consider scalar rewards?

A question which has arisen in previous online discussion of our pre-print is whether we are creating a straw-man in contending that Silver et al assume scalar rewards. While it is true that the reward-is-enough hypothesis (as quoted above) does not explicitly state any restriction on the nature of the reward, this is specified later in Section 2.4 (“A reward is a special scalar observation Rt"), and Silver et al also refer to Sutton’s reward hypothesis which states that “all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)”. This view is also reflected in our prior conversations with the authors; following a presentation we gave on multiobjective reinforcement learning in 2015, Richard Sutton stated that “there is no such thing as a multiobjective problem”.

In Reward is Enough, Silver et al do acknowledge that multiple objectives may exist, but contend that these can be represented via a scalar reward signal (“…a scalar reward signal can represent weighted combinations of objectives…”). They also argue that scalar methods should be favoured over explicitly multiobjective approaches as they represent a more general solution (although we would argue that the multiobjective case with n>=1 objectives is clearly more general than the special case of scalar reward with n=1):

“Rather than maximising a generic objective defined by cumulative reward, the goal is often formulated separately for different cases: for example multi-objective learning, risk-sensitive objectives, or objectives that are specified by a human-in-the-loop …While this may be appropriate for specific applications, a solution to a specialised problem does not usually generalise; in contrast a solution to the general problem will also provide a solution for any special cases.”

Can a scalar reward adequately represent multiple objectives?

As mentioned earlier, Silver al state that “a scalar reward signal can represent weighted combinations of objectives”. While this statement is true, the question remains as to whether this representation is sufficient to support optimal decision-making with regards to those objectives.

While Silver et al don’t clearly specify the exact nature of this representation, the mention of “weighted combinations” suggests that they are referring to a linear weighted sum of the objectives. This is the most widely adopted approach to dealing with multiple objectives in the scalar RL literature – for example, common benchmarks such as gridworlds often provide a reward of -1 on each time step to encourage rapid movement towards a goal state, and a separate negative reward for events such as colliding with walls or stepping in puddles). The assumption is that selecting an appropriate set of weights will allow the agent to discover a policy that produces the optimal trade-off between the two objectives. However, this may not be the case as or some environments the expected returns for certain policies may mean that there is no set of weights that lead to the discovery of those policies; if those policies do in fact correspond to the best compromise between the objectives then we may be forced to settle for a sub-optimal solution. Even if a policy is theoretically findable, identifying the weights that achieve this is non-trivial as the relationship between the weights and the returns achieved by the agent can be highly non-linear. We observed this in our recent work on minimising side-effects; for some problems tuning the agent to find a safe policy was much more difficult and time-consuming for single-objective agents than for multi-objective agents.

It is possible to address these issues by using a non-linear function to scalarise the objectives. However, this introduces several new problems. We will illustrate these by considering a scalarisation function that aims to maximise one objective subject to reaching a threshold on a second objective. For simplicity we will also assume an episodic task (i.e. one with a defined, reachable end state). On each timestep the performance with respect to each objective can be calculated, but cannot immediately be scalarised (as, for example, the reward with respect to an objective may first reach the threshold, before subsequently falling below it in later time-steps due to negative rewards). So, these per-objective values must be accumulated external to the agent, and the agent will receive zero reward on all time-steps except at the end of the episode, when the true scalarised value can be calculated and provided as a scalar reward. This has a number of implications:

It results in a very sparse reward signal which will make learning slow. While this does not directly contradict the reward-is-enough hypothesis (which says nothing about learning efficiency), it is nevertheless an argument against adopting this approach, particularly for complex tasks.
For stochastic environments, the optimal decision at any point in time depends not only on the current state of the environment but also on the per-objective rewards received so far. To ensure convergence of RL algorithms it becomes necessary to use an augmented state, which concatenates the environmental state with the vector of accumulated rewards. If we are providing the agent with this information as part of its state representation, then surely it makes sense to also leverage this information more directly rather than restricting it to maximising the sparse scalar reward?
For non-linear scalarisations, we can distinguish between two different criteria which an agent is seeking to optimise. An agent learning from a pre-scalarised reward can only aim to optimise with regards to the Expected Scalarised Return (ESR). However in some circumstances it may be more appropriate to maximise the Scalarised Expected Return (SER). An agent using a pre-scalarised reward cannot do this, whereas a multiobjective agent that has learned directly from vector rewards can. This distinction becomes particularly important in the context of multi-agent systems, where the optimal solution may be completely different depending on which of these criteria each agent is aiming to maximise.

We also note that regardless of the nature of the scalarisation being used, an agent provided with a pre-scalarised scalar reward can only learn with regards to its current reward signal. If that signal changes, then the agent must discard its current policy and learn a new policy with regards to the modified reward. It may be possible to retain some of the prior learning, as a model-based agent could retain its model of state-transition dynamics; nevertheless it would still require considerable observations of the new reward signal before it could adapt its policy to that reward, and it will be performing sub-optimally prior to that point in time.

We believe this limitation of scalar RL is of particular significance in the context of aligned AGI. Human preferences are not fixed, either at an individual or societal level, and can in fact change very rapidly – the events of recent years have illustrated this with regards to issues such as climate change, animal rights and public health. It is vital that an aligned AGI can adapt rapidly, preferably immediately, to such changes, and methods based on scalar rewards cannot provide this level of flexibility.

Therefore, we advocate for the alternative approach of multiobjective reinforcement learning (MORL). In MORL the agent is directly provided with the vector rewards associated with whatever objectives we wish it to consider, and also with a function that defines the utility of the end-user of the system. Through experience the agent learns the vector-valued returns associated with actions and uses these in conjunction with the utility function to derive the policy that it follows. Note that while this also involves a scalarisation step in which vector values are converted to scalars, this occurs internally within the agent after it has been provided with the vector reward, whereas in scalar RL the scalarisation is external to the agent, before the reward is provided.

This change, when coupled with some algorithmic modifications, provides the following potential benefits (for more details on MORL algorithms and these benefits, please refer to this survey and this practical guide):

The utility function can be either linear or non-linear, as required to best match the user’s true utility, thereby placing no restrictions on the policies that can be discovered.
The agent may optimise either for ESR or SER optimality as required to suit the context in which it is being applied.
The agent can make use of the dense reward information provided by vector rewards, rather than sparser scalar rewards.
By using off-policy learning, the agent can learn not just the policy that is optimal with regards to its current utility function, but also policies that would be optimal for all possible definitions of this function (this is known as multi-policy learning). This allows for rapid adaptation should this function alter (e.g. to reflect changes in the laws, norms or ethics of our society). It also facilitates human-in-the-loop decision making – rather than pre-defining the utility function, the agent can learn all possible optimal policies (or a subset thereof) and present them to a human decision-maker who selects the policy which will actually be performed).
In our opinion defining a vector-valued reward and associated utility function is more intuitive than attempting to construct a complicated scalar reward signal that correctly captures all the desired objectives. Therefore, this approach should reduce the risk of reward misspecification. It also enables the possibility of using several independently specified reward signals in order to further reduce this risk.
The use of vector-values and multi-policy learning facilitates the production of more informative explanations than are possible within a scalar agent, as the agent can directly provide information about the trade-offs being made between objectives (e.g. “I chose to go left as this would only take a few seconds longer than going right, and reduced the chance of collision with a human by 50%”), whereas a scalar agent will be unable to extract such information from its pre-scalarised reward

Are there downsides to multiobjective approaches?

Multiobjective RL will generally be more computationally expensive than scalar RL (particularly scalar RL using linear scalarisation), and this difference will be greater as the number of objectives increases. However, the use of multi-policy learning offers the potential for substantial improvement in sample efficiency for environments where the reward/utility is subject to change.

In addition, MORL is a newer and less extensively studied area of research compared to scalar RL (more details on this below). As such MORL algorithms, and particularly the implementation of those algorithms, have yet to be as thoroughly optimised as their scalar RL counterparts, particularly in the area of deep RL for high-dimensional state spaces.

What is the state of research into multiobjective aligned AI?

Academic research in computational RL dates back at least 40 years and has seen steady growth since the turn of the century, with particularly rapid expansion over the last five years or so. Meanwhile there was minimal research in MORL prior to around 2013. While there has been a rapid growth in papers in/on MORL since then, this has largely been matched by growth in RL research in general, with the result that MORL still constitutes less than 1% of RL research. Given the importance of MORL to aligned AI, this is a situation which we hope will change in coming years.

Not surprisingly given the relatively short history of MORL research, there has so far been limited work in applications of MORL to aligned AI. However, research in this area has started to emerge in recent years, addressing varied topics such as reward misspecification, learning of ethics or norms, and interpretability and explainability. We have provided a short list of recommended reading at the end of this post, and we refer the reader again to the post of Smith, Pihlakas and Klassert for an overview of work in this area.

Conclusion

In conclusion we believe that multiobjective approaches are essential to developing human-aligned agents, and that the use of scalar rewards to create AGI is insufficient to adequately address the issue of alignment. We find Silver et al’s advocacy for this scalar approach concerning as they are highly influential researchers, and this article could lead to other researchers adopting this approach, which we believe has inherent risks which are not acknowledged in Reward is Enough. Our concerns are heightened by the fact the authors are based at DeepMind which, given their resources, would appear to be positioned as one of the most likely sources for the emergence of AGI.

Recommended Reading

For general background on MORL:

Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67-113.

Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., ... & Roijers, D. M. (2021). A practical guide to multi-objective reinforcement learning and planning. arXiv preprint arXiv:2103.09568.

MORL for Aligned AI

Vamplew, P., Dazeley, R., Foale, C., Firmin, S., & Mummery, J. (2018). Human-aligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20(1), 27-40.

Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K., ... & Rossi, F. (2018). Interpretable multi-objective reinforcement learning through policy orchestration. arXiv preprint arXiv:1809.08343.

Horie, N., Matsui, T., Moriyama, K., Mutoh, A., & Inuzuka, N. (2019). Multi-objective safe reinforcement learning: the relationship between multi-objective reinforcement learning and safe reinforcement learning. Artificial Life and Robotics, 24(3), 352-359.

Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K. R., ... & Rossi, F. (2019). Teaching AI agents ethical values using reinforcement learning and policy orchestration. IBM Journal of Research and Development, 63(4/5), 2-1.

Zhan, H., & Cao, Y. (2019). Relationship explainable multi-objective reinforcement learning with semantic explainability generation. arXiv preprint arXiv:1909.12268.

Sukkerd, R., Simmons, R., & Garlan, D. (2020). Tradeoff-focused contrastive explanation for MDP planning. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) (pp. 1041-1048). IEEE.

Vamplew, P., Foale, C., Dazeley, R., & Bignold, A. (2021). Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Engineering Applications of Artificial Intelligence, 100, 104186.