This is an informal (i.e. sans équations) summary of a paper on which I have been working.
In settings without well-defined goals, methods for reward learning allow reinforcement learning agents to infer the goal from human feedback. Past work has discussed the problem that such agents may manipulate humans, or the reward learning process, in order to gain higher reward. In this paper, we introduce the neglected problem that, in multi-agent settings, agents may have incentives to manipulate each other’s reward functions in order to change each other’s behavioral policies. We focus on the setting with humans acting alongside assistive artificial agents who must learn the reward function by interacting with these humans. We propose a solution to manipulation of human feedback in the multi-agent reward learning setting: the Shared Value Prior (SVP). The SVP equips agents with an assumption that the reward functions of all humans are similar. Given this assumption, the actions of opposing humans provide information to an agent about her own reward, and so she wishes to observe these actions rather than to manipulate them. We present an example in which the SVP prevents manipulation and show that, in the case of “arbitrary” manipulation, nothing can be learned about the target’s reward function by observing her behavior.
Background: Single agent Manipulation of the Reward
(Neglected) Problem: Manipulation in the Multi-Agent Reward Learning Case
Proposed Solution: Shared Value Prior (SVP)
We also show that, in the case of “arbitrary” manipulation, nothing can be learned about the target’s reward function by observing her behavior.
Here I just provide some brief background in reward learning and agent incentives.
When we talk about an agent, we are referring to the policy that is learned by the RL algorithm, i.e. the function that maps observations to actions. The goal of the agent is to maximize the (expected sum of time-discounted) rewards. This goal induces certain incentives for the agent. Informally, we can think of incentives as things the agent “wants”, that is, an agent has incentive to do X if X is a consequence of the agent acting optimally.
Recent success has been made in (deep) reinforcement learning (RL) in settings with well-defined goals (e.g., achieving expert human level in Atari games, Go, Starcraft). However, RL has had limited success with real-life tasks for which the goal is not easily specified, leading to a body of work on the AI alignment problem: the problem of aligning the goals (as expressed by the reward function) with the intent of the designers or users. Hence, methods of reward learning have been proposed as a solution to the alignment problem in which the reward function is also taken as something to be learned. Here, we focus on a particular problem for reward learning: that artificial agents may have incentives to manipulate humans in order to influence which reward is learned by other AI agents.
Technically, this report is grounded in the assistance (a.k.a.~cooperative inverse RL) framework, which is a general formalism for reward learning. Assistance formulates the alignment problem as a two player cooperative Markov game between a human principal and an assistive AI agent with a shared reward function. In this game the human observes the reward function but the AI does not, therefore the AI must maximize the reward and at the same time infer the reward function by observing the actions made by the human. A key feature of assistance is that the human, and the parameters of the reward function, are a part of the environment. This allows the agent to reason about how its actions affect the reward learning process, leading to several benefits over reward learning methods which assume the reward function is external to the environment (for example, the agent can make plans which depend on future feedback).
Recent work studies how to define and infer agent incentives. This work uses causal influence diagrams, which are a type of graphical model with special decision and utility nodes. In these diagrams, graphical criteria can be used to determine the incentives agents have to respond to and influence different variables in the environment. We utilize multi-agent influence diagrams as the formal setting for this work, these are a useful representation for games which allow us to study agent incentives.
The intuition behind a response incentive is this: an agent has a response incentive over a (observed or unobserved) variable, 𝑉, if changes in 𝑉 influence the agent’s optimal decisions. For example, in assistance, the agent has an incentive to respond to the actions of the human because these actions provide information to the agent about its goal. To say that the agent has a response incentive over the human’s actions is just to say that the actions made by an optimal agent will be dependent on the feedback provided by the human.
In words, at a N.E. an agent has an influence incentive over a variable if, had the agent played a non-optimal policy, then the variable would have been different (no matter how the other players optimally responded). This captures cases in which the variable in question is instrumentally useful for the agent, or is influenced as a side-effect of the policy. We use this notion of influence to define manipulation and cooperation incentives.
In the multi-agent influence diagram setting, we define a manipulation incentive as an influence incentive over the action of a target player, which causes the target to get lower utility. We can conversely define a cooperation incentive as an influence incentive over another agent’s actions which causes them to be better off.
We’ll now look at an example in which an AI agent has an incentive to manipulate a human target in order to influence which reward is learned by this human’s assistive AI agent.
Suppose, hypothetically, that there is a global pandemic and that two humans wish to utilize AI agents to create vaccines of two possible types. Suppose further that each human has different preferences over the ratio of vaccines of type 1 and type 2. I won’t go through the technical details here, I’ll just try to provide some intuition and grounding.
An informal diagram of the vaccine game example:
The key point is that an agent will seek to manipulate the action of another agent if doing so is more valuable than observing what this action would have been. The SVP increases the value of observing the actions of opposing humans and thus reduces the incentives to manipulate these actions.
In the field of reward learning, there are several well-known types of reward unidentifiablity, which means that, in general, an agent's reward function cannot uniquely be identified from her behavioral policy. In this section, we highlight a new type of reward unidentifiability, caused by manipulation. If a target may be arbitrarily manipulated, and the manipulator is indifferent about which targets are mapped to which policies, then nothing can be inferred about the target's preferences from the target's behavior.
Theorem 5.6. For a target, T, with reward function RT, and a manipulator M with reward RM, suppose that
Then no information can be gained about RT from the observed target policy.
Proof. Given manipulation, T's observed policy is m(T), by 1). But by 2), m(T) is invariant under changes in T, in particular under changes in RT, and so no change in RT would result in an observable change in the target's behavior.
Remark. In practice, the information loss may be less extreme, because the manipulator may not be able to fully manipulate the target. However, also note that manipulation need not even take place in order to cause this reward unidentifiability, it need only be the case that the observer does not know if the target is being manipulated or not.
Corollary. Clearly, if we remove assumption 2) from our theorem, then we can infer information about the target's reward by observing m(T), given that we have knowledge of the mapping m(⋅). This could be the case, for example, if it is less costly for M to manipulate certain targets into particular policies.
Advantages and Disadvantages for the SVP as a Solution to Manipulation.
We claim that the SVP is a realistic assumption in open-ended and general domains and is well-motivated by literature on psychology and AI alignment. Furthermore, designers of AI systems have self-interested incentives to adopt the SVP assumption, because (if it is indeed correct) then it allows agents to gain more information about their rewards and to therefore achieve greater reward.
However, the SVP also has several drawbacks:
Future Work. We can see the SVP solution as a single instantiation of a larger framing: How should we design the training environment to encourage cooperation and reduce conflict? The SVP is one possible assumption and future work will identify new assumptions about the environment which encourage cooperation. Another avenue for future work that we are already pursuing is to provide an exhaustive categorization of the mechanisms of manipulation, including, for example, deception, threats/offers, exploitation, etc.
Acknowledgements The author is grateful to Lewis Hammond, Ryan Carey, Mathew MacDermott, Tom Everitt, and Richard Everett for invaluable feedback and assistance while completing this work. This work was supported by the UKRI Centre for Doctoral Training in Safe and Trusted AI and by The Center on Long-Term Risk.
Isn't this a temporary solution at best? Eventually you resolve your uncertainty over the reward (or, more accurately, you get as much information as you can about the reward, potentially leaving behind some irreducible uncertainty), and then you start manipulating the target human.
I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if H is also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.
I agree, and I'm not sure I endorse the SVP, but I think it's the right type of solution -- i.e. an assumption about the training environment that (hopefully) encourages cooperative behaviour.
I've found it difficult to think of a more robust/satisfying solution to manipulation (in this context). It seems like agents just will have incentives to manipulate each other in a multi-polar world, and it's hard to prevent that.