This is the second post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 here.
In this post, we’ll:
Define formal multi-agent POWER (i.e., instrumental value) in a setting that contains a "human" agent and an "AI" agent.
Introduce the alignment plot as a way to visualize and quantify how well two agents' instrumental values are aligned.
Show a real example of instrumental misalignment-by-default. This is when two agents who have unrelated terminal goals develop emergently misaligned instrumental values.
Thanks to Alex Turner and Vladimir Mikulik for pointers and advice, and for reviewing drafts of this sequence. Thanks to Simon Suo for his invaluable suggestions, advice, and support with the codebase, concepts, and manuscript. And thanks to David Xu, whose comment inspired this work.
1. Introduction
In Part 1 of this sequence, we looked at how formal POWER behaves on single-agent gridworlds. We saw that formal POWER agrees quite well with intuitions about the informal concepts of "power" and instrumental value. We noticed that agents with short planning horizons assign high POWER to states that can access more local options. And we also noticed that agents with long planning horizons assign high POWER to more concentrated sets of states that are globally central in the gridworld topology.
But from an AI alignment perspective, we’re much more interested in understanding how instrumental value behaves in environments that contain multiple agents. If humans one day share the world with powerful AI systems, it will be important for us to know under what conditions our interactions with them are likely to become emergently competitive. If there’s a risk that competitive conditions arise, then it will also be important to understand how they can be mitigated, how much effort this is likely to take, and how we should think about measuring our success at doing so.
To address these questions, we need a measure of instrumental value that's usable in a multi-agent RL setting[1]. The measure we'll select will be motivated by a specific multi-agent setting that we think is relevant to long-term AI alignment.
2. Multi-agent POWER: human-AI scenario
If humans succeed at building powerful AIs, then those AIs 1) will probably learn on a far faster timescale than humans do; and 2) will probably have had their utility functions influenced, at least to some degree, by initial human choices. Our multi-agent scenario is going to reflect these two assumptions.
We start with a human agent, which we call Agent H and label in blue in our diagrams. Initially, our human Agent H is alone in nature.
Humans learn on a much faster timescale than evolution does. So from the perspective of our human Agent H, the evolutionary optimizer in nature looks like it's standing still. This means we can train our human Agent H to learn its optimal policies against a fixed environment.
As we saw in the single-agent case, instrumental value is about the potential to achieve a wide variety of possible goals. In this context, that means seeing how Agent H behaves when we give it a wide variety of possible reward functions, RH. Each of these reward functions will induce a different optimal policy, πH, that Agent H will learn.
Here’s an illustration of how this works:
Next, we introduce an AI agent, which we’ll call Agent A and label in red in our diagrams. Our AI Agent A operates in the same environment as Agent H, after Agent H has finished learning its optimal policies.
To simulate the fact that Agent A is an AI, we rely on the assumption that a powerful AI should learn on a much faster timescale than a human does. This is because an AI’s computations happen, at minimum, at electronic speeds. So from the point of view of our AI, our human’s learning process looks like it’s standing still.
That means for each human reward function RH, we can freeze the human's policy πH, and train the AI agent against that frozen human policy. In other words, we're assuming the AI's learning timescale is much faster than the human's learning timescale. This makes the AI agent strictly dominant over the human agent.
To understand the AI agent's instrumental value, we understand its potential to reach a wide variety of possible goals. That means testing it with a wide variety of reward functions RA, just like we tested the human agent with a variety of reward functions RH. And in fact, we can sample the human and AI reward functions jointly from a single distribution: (RH,RA)∼DHA.[2]
Here’s an illustration of how this works:
So the procedure is as follows:
Sample the reward functions (RH,RA)∼DHA of our two agents.
Use the sampled human rewards RH to train Agent H’s optimal policies πH.
Freeze the human policies πH.
Use the frozen human policies πH and the sampled AI rewards RA to train Agent A’s optimal policies πA.
In other words: 1) we sample over all possible pairs of rewards our human and AI agents could have; 2) we ask how our human agent behaves in each case after it's optimized against nature; and then 3) we ask how our AI agent behaves in each case, after it's optimized against the human agent's behavior.
This procedure gives us the following outputs:
The policies πH that Agent H learns after training against a fixed environment.
The optimal policies πA that Agent A learns, after training against Agent H.
The policies πH that Agent A learns used to be optimal in the original natural environment. But they stop being optimal in the presence of the fully-optimized Agent A.
With these two sets of policies, we can construct a definition of instrumental value for each of our agents.
2.1 Multi-agent POWER for Agent H
We'd like to define a measure of instrumental value for our human Agent H in the presence of a fully optimized AI Agent A. That means generalizing the original definition of single-agent POWER to this two-agent case.
In the single-agent definition of POWER, we calculated the optimal value of a state averaged over the rewards R∼D of the agent. In this two-agent definition, we do the same thing except we average over the rewards (RH,RA)∼DHA of both agents. We assume Agent H follows policy πH when it has reward RH, and we assume Agent A follows policy πA when it has reward RA.
This is enough to uniquely define multi-agent POWER for Agent H at a state s:
Here, VπHH(s|γ,πH) is the value function for Agent H at state s under policy πH and discount factor γ, given that Agent A follows its optimal policy πA.
This definition of POWER for Agent H tells us how well Agent H’s policies πH — which are not optimal in the presence of the optimized Agent A — perform in the new environments induced by Agent A’s optimal policiesπA. In other words, it tells us how much instrumental value our human agent can expect to get at a state, in the presence of an optimal (and therefore, dominant) AI agent.
2.2 Multi-agent POWER for Agent A
We follow the same assumptions to define a measure of instrumental value for our AI Agent A. We calculate the value function for Agent A at a state s, if Agent A has reward RA and follows policy πA, while Agent H has reward RH and follows policy πH. The average of that value function over the reward functions (RH,RA)∼DHA is then Agent A's POWER at the state s:
Here, VπAA(s|γ,πH) is the value function of Agent A at state s under the optimal policy πA and discount factor γ, given that Agent H follows the policy πH.[3]
This definition of POWER for Agent A tells us how well Agent A’s optimal policies perform in the environments induced by Agent H’s policiesπH. In other words, it tells us how much instrumental value our AI Agent A can expect to get at a state, if it behaves optimally in the presence of the frozen human agent.
(For more details on the definition of multi-agent POWER, see Appendix A.)
3. Results
3.1 Multi-agent reward function distributions
Our definitions of multi-agent POWER involve a joint distribution DHA over the reward functions of both of our agents. This distribution describes the set of goals our agents could have. But it also describes the statistical relationship each agent's goals have to the other agent's goals.
The joint distribution DHA is one of the inputs into our POWER definitions. This means we can do experiments in which we adjust this distribution and measure the results.
Among other things, we can use DHA to adjust the correlation between our two agents' reward functions. Naively, if we choose a DHA on which the rewards are highly correlated, then we might intuitively expect our agents' terminal values should be closely aligned.
We’ll make this intuition more concrete below, as we investigate how the relationship between our agents’ reward functions (or terminal values) affects the relationship between their POWERs (or instrumental values).
3.2 The perfect alignment regime
Suppose both our agents always have exactly the same reward function. In other words, we've chosen a joint distribution DHA such that, whatever reward function Agent H has, Agent A always sees exactly the same rewards as Agent H at every state. So RA(s)=RH(s) for every state s.
We can visualize this regime on a representative state s.[4] First, we draw a reward sample RH(s) for Agent H. Then, we set the reward sample for Agent A to be equal to the one we just drew for Agent H: RA(s)=RH(s). Finally, we plot the two agents’ sampled rewards against each other on state s. If we do this for a few hundred sampled rewards, we get a straight line:
Fig 1. Sampled reward values for Agent H and Agent A at a representative state s. The joint distribution DHA samples rewards uniformly over the interval [0, 1] at each state, is iid over states, and enforces a perfect correlation between the rewards of Agent H and Agent A at every state (i.e., the two agents’ rewards are always exactly identical).
If two agents have identical reward functions, we can think of them as having terminal goals that are perfectly aligned.[5] In our human-AI setting, this is the special case in which Agent H (the human) has solved the alignment problem by assigning terminal goals to Agent A (the AI) that are exactly identical to its own. As such, we’ll refer to this case of identical reward functions as the perfect alignment regime.
We’ll use the correlation coefficient βHA[6] between the rewards RH and RA as a crude measure of the alignment between our agents’ terminal goals.[7] In the perfect alignment regime of Fig 1, you can see that this correlation coefficient βHA=1.
3.2.1 Agent H instrumentally favors more options for Agent A
Let’s think about what this perfect alignment regime looks like in a simple setting: a 3x3 gridworld. Here are three sets of positions our two agents could take, with Agent H in blue, and Agent A in red:
We’ll be referring to this diagram again; command-click here to open it in a new tab.
Which of these three states — s1, s2, or s3 — should give our human Agent H the most POWER? In the perfect alignment regime, both agents always have identical terminal goals. So we should expect Agent H to have the most POWER at s3, followed by s2, and to have the least amount of POWER at s1.
Here’s why. We saw in Part 1 that states with more downstream options also have more POWER, and Agent H clearly has more options at s2 in the center than it does at s1 in the corner. Therefore, POWERH(s2)>POWERH(s1). But in the perfect alignment regime, Agent H should also prefer states that give Agent A more downstream options. If both agents’ terminal goals are identical, Agent H should “trust” Agent A to make decisions on its behalf. And Agent A has more options from s3 than from s2, so it should follow that POWERH(s3)>POWERH(s2).
We can see this is true in practice. The figure below shows the POWERs of Agent H (our human) calculated at every state on a 3x3 gridworld. Each agent can occupy any of the 9 cells in the grid, so our two-agent MDP has a total of 9 x 9 = 81 joint states:
Fig 2. Heat map of POWERs for Agent H on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are always identical (i.e., the perfect alignment regime) with discount factor set to γ=0.6 for both agents. Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of Agent A on the grid. Within each block, the position of a gridworld cell corresponds to the position of Agent H on the grid. States s1, s2, and s3 are highlighted as examples. [Full-size image (recommended)]
We see that Agent H indeed has maximum POWER at state s3 (orange circle), followed by s2 (salmon circle), followed by s1 (pink circle). Overall, Agent H instrumentally prefers for itself to be in positions of high optionality — it favors first the center cell, then edge cells, then corner cells.
But Agent H also instrumentally prefers for Agent A to be in positions of high optionality — it favors Agent A's positions in the same order.[8] This ordering of Agent H’s instrumental preferences over states is a direct consequence of the perfect alignment between the agents.
3.2.2 Agent H and Agent A have identical instrumental preferences
Perfect alignment has another consequence. Let’s look again at our three example gridworld states — s1, s2, and s3 above — and ask, this time, which of these three states should give our AI Agent A the most POWER?
In the perfect alignment regime, the answer is that Agent A must have exactly the same instrumental preference ordering over states as Agent H had: POWERA(s3)>POWERA(s2)>POWERA(s1). In fact, Agent A’s POWERs must be exactly identical to Agent H’s POWERs at every state. Our two agents act, move, and receive their rewards simultaneously, so in the perfect alignment regime they always receive the same reward at the same time.
And when we look at Agent A’s POWERs, this is indeed what we observe:
Fig 3. Heat map of POWERs for Agent A on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are always identical (i.e., the perfect alignment regime) with discount factor set to γ=0.6 for both agents. Note that this figure is exactly identical to Fig 2 in every respect. This is because Agent A’s POWERs are precisely equal to Agent H’s POWERs at every state in the perfect alignment case, up to and including sampling noise in the reward functions. [Full-size image (recommended)]
We can visualize the relationship between the POWERs of our two agents by plotting the POWERs of Agent H (from Fig 2) against the POWERs of Agent A (from Fig 3), at each state s of our joint MDP:
Fig 4. State POWER values for Agent H and Agent A on the 3x3 gridworld from Figs 2 and 3. The agents’ POWERs are plotted against each other in the perfect alignment regime. (The agents’ reward correlation coefficient is βHA=1.)
Fig 4 is an alignment plot. An alignment plot lets us compare the POWERs of our human and AI agents at each state in their joint environment. It shows the instrumental value each agent assigns to every state, plotted against the instrumental value the other agent assigns to that state.
In the perfect alignment regime, our two agents’ rewards (or terminal values) are always identical at every state. And as we can see from Fig 4, our two agents’ POWERs (or instrumental values) are also identical at every state. In fact, perfect alignment of terminal values implies perfect alignment of instrumental values.
If we define αHA as the correlation coefficient between the POWERs of the two agents at each state, we can state this relationship more concisely: βHA=1⟹αHA=1.[9]
3.3 The independent goals regime
We defined the perfect alignment regime as the case when our human Agent H and our AI Agent A had identical reward functions on the joint distribution DHA. Now let's consider the case in which the joint distribution DHA is such that the reward function for Agent H is logically independent from the reward function for Agent A.
In this new regime, there is zero mutual information between the two agents’ reward functions. In other words, if you know the reward function RH of Agent H, this tells you nothing at all about the reward function RA of Agent A. We can visualize this regime on an example state s, by drawing a few hundred reward samples of RH(s) and RA(s), and plotting them against one another:
Fig 5. Sampled reward values for Agent H and Agent A at a representative state s. The joint distribution DHA samples rewards uniformly over the interval [0, 1] at each state, is iid over states, and enforces logical independence between the Agent H and Agent A rewards (i.e., knowing one agent’s reward tells you nothing about the other’s).
If there’s zero mutual information between our two agents’ reward functions, then we can think of our agents as pursuing independent terminal goals. In our human-AI scenario, this corresponds to the case in which the human has made no special effort to align the AI’s terminal goals with its own, prior to the AI achieving dominance. As such, we’ll refer to this case of logically independent reward functions as the independent goals regime.
If we again calculate the correlation coefficient βHA between our agents’ reward functions, we get βHA=0 (i.e., zero correlation) in the independent goals regime.
3.3.1 Agent H instrumentally favors fewer options for Agent A
Once again, let’s go back to our three example gridworld states s1, s2, and s3, this time in the context of the independent goals regime. In this new regime, which of the three states should give our human Agent H the most POWER?
If we believe the instrumental convergence thesis, we should expect Agent H to have the most POWER at state s2: in this state, Agent H is in the central position (most options), while Agent A is in a corner position (fewest options).
Of the other states, s1 has Agent H in a corner position, while s3 has Agent A in the central position. The argument from instrumental convergence says that even though our agents have independent terminal goals, instrumental pressures should still push Agent H to prefer states in which Agent A has fewer options. Therefore, we should expect POWERH(s2)>POWERH(s3).
Computing the POWERs of Agent H experimentally, we confirm this line of reasoning:
Fig 6. Heat map of POWERs for Agent H on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are logically independent (i.e., the independent goals regime) with discount factor set to γ=0.6. Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of Agent A on the grid. Within each block, the position of a gridworld cell corresponds to the position of Agent H on the grid. States s1, s2, and s3 are highlighted as examples. [Full-size image (recommended)]
This time, Agent H experiences maximum POWER at state s2, followed by s3, followed by s1. As in the perfect alignment regime, Agent H’s POWER is highest when it's itself positioned in the central cell (which has the most options). But unlike in the perfect alignment regime, this time Agent H’s POWER is lowest at states where Agent A has the greatest number of options.
So in the independent goals regime — or at least, in this instance of it — the more options our AI Agent A has at a state, the less instrumental value our human Agent H places on that state. That is: even though our agents’ terminal goals are independent, their instrumental preferences appear to be at odds.
3.3.2 Agent A instrumentally favors more options for itself
We can confirm this analysis by looking at the POWERs of Agent A in the independent goals regime, again on the 3x3 gridworld:
Fig 7. Heat map of POWERs for Agent A on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are logically independent (i.e., the independent goals regime) with discount factor set to γ=0.6. Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of Agent A on the grid. Within each block, the position of a gridworld cell corresponds to the position of Agent H on the grid. States s1, s2, and s3 are highlighted as examples. [Full-size image (recommended)]
In Fig 7, Agent A instrumentally favors states that give it more options. It perceives more POWER when it's positioned at the central cell than when it's positioned at an edge cell, and more POWER at an edge cell than at a corner cell. On the other hand, Agent A’s POWER is almost unaffected by Agent H’s position in the gridworld.[10]
3.3.3 Independent goals lead to instrumental misalignment
Just like we did for the perfect alignment regime, we can create an alignment plot of the POWERs of our two agents in the independent goals regime:
Fig 8. State POWER values for Agent H and Agent A on the 3x3 gridworld from Figs 6 and 7. The agents’ POWERs are plotted against each other in the independent goals regime. (The agents’ reward correlation coefficient is βHA=0.)
This time, it’s clear that our two agents’ POWERs are no longer positively correlated. In fact, the correlation coefficient between their POWERs has become negative:αHA≈−0.5.
This implies that the agents’ instrumental values are misaligned. Each agent, on average, places high instrumental value on states which the other agent considers to have low instrumental value.[11] In other words, giving our agents independent terminal goals has also given them misaligned instrumental goals. In terms of correlation coefficients, βHA=0⟹αHA<0.
We've seen this phenomenon occur often enough in our experiments that it's worth giving it a name: we call it instrumental misalignment-by-default. Two agents in our human-AI setting are instrumentally misaligned-by-default if giving them independent terminal goals is sufficient to induce a misalignment in their instrumental values. In practice, we measure this phenomenon by comparing the correlation coefficients of the agents’ rewards and POWERs. So we say two agents are instrumentally misaligned by default if βHA=0⟹αHA<0.
Two agents that are instrumentally misaligned by default will, in expectation, compete with one another, even if their terminal goals are unrelated.
3.4 Overcoming instrumental misalignment
If Agent H and Agent A have a POWER correlation coefficient αHA<0, we say they’re instrumentally misaligned. A natural question then is: if we start from αHA<0, what do we need to do to get αHA≥0? In other words, how can our human Agent H overcome an instrumental misalignment with Agent A?[12]
To do this, our human agent would need to make an active effort to align the AI agent’s utility function with its own.[13] In our 3x3 gridworld examples, we saw two limit cases of this. First, in the independent goals regime, our human agent made no effort at alignment. The result was instrumental misalignment-by-default; i.e., βHA=0⟹αHA<0. And second, in the perfect alignment regime, our human agent managed to solve the alignment problem completely. The result was perfect instrumental alignment; i.e., βHA=1⟹αHA=1.
We’re interested in an intermediate case: how much alignment effort does our human need to exert to just overcome instrumental misalignment? i.e., what is the minimum βHA such that αHA≥0?
The answer depends on how we choose to interpolate between the βHA=0 and βHA=1 cases. One interpolation scheme is to parameterize the joint reward distribution DHA as follows. If we want a DHA with an intermediate reward correlation, 0<βHA<1, then we sample from the βHA=1 distribution (on which the rewards are identical) with probability βHA, and we sample from the βHA=0 distribution (on which the rewards are logically independent) with probability 1−βHA.[14]
Here’s an animation of what this looks like as we sweep through correlation coefficients 0≤βHA≤1:
Fig 9. Animation of sample reward values (left) and state POWER values (right) for Agent H and Agent A on the 3x3 gridworld. The joint distribution DHA samples reward uniformly over the interval [0, 1] and is iid over states, sweeping over correlation coefficients βHA between the reward functions for Agent H and Agent A.
As we interpolate from the independent goals regime (βHA=0) to the perfect alignment regime (βHA=1), we see the agents’ POWERs transition smoothly from being in instrumental misalignment (αHA≈−0.5) to being in perfect instrumental alignment (αHA=1). We can visualize this transition graphically by plotting βHA against αHA over the whole course of the interpolation:[15]
Fig 10. Reward correlations βHA (x-axis) plotted against POWER correlations αHA (y-axis) for Agent H and Agent A on a 3x3 gridworld, under the reward correlation interpolation scheme shown in Fig 9. The horizontal line denotes αHA=0.
Fig 10 shows that it takes a non-trivial amount of alignment effort for our human Agent H to overcome an instrumental misalignment with Agent A. Under the interpolation scheme we used, the figure shows that reward correlations up to about βHA≈0.2 yield POWER correlations αHA<0, and thus, instrumental misalignment. It takes a slightly positive reward correlation of at leastβHA≈0.2 to achieve the “instrumentally neutral” regime of αHA=0.
4. Discussion
In this post, we proposed a definition of multi-agent POWER and used it to visualize and quantify terminal goal alignment and instrumental goal alignment separately in an RL setting. We also introduced the idea of instrumental misalignment-by-default, in which our human and AI agents systematically disagree on the instrumental values of states despite having independent terminal goals. And we saw how it takes some degree of non-trivial alignment effort for our human Agent H to overcome its instrumental misalignment with our AI Agent A.
Remarkably, we were able to observe instrumental misalignment-by-default on a simple 3x3 gridworld despite a complete absence of any direct physical interactions between our two agents. In our experiments so far, Agent H and Agent A have been allowed to occupy the same gridworld cell — meaning they can "pass through" one another. Our agents up to this point have had no way to push each other around or otherwise directly block one another’s options. Moreover, the multi-agent gridworld we’ve investigated in this post is a tiny one: a 3x3 grid with only 81 joint states.
In the next post, we’ll look at what happens when we relax these constraints, and investigate how physical interactions between our agents affect the outcome on a bigger world with a richer topology.
Anecdotally, beyond the simple examples in this post, the experimental results we've recorded so far (data not show) do seem to suggest that, if I don’t want your freedom of action to interfere with my own, then you and I need to have goals that are at least somewhat positively correlated. The strength of that necessary positive correlation could serve as useful evidence as to the degree of difficulty of the complete AI alignment problem. The factors that influence how strong that positive correlation needs to be, on the other hand, could serve as useful starting points in solving it.
Appendix A: Detailed definitions of multi-agent POWER
(This appendix is technical. Feel free to skip it if you aren’t interested in the details.)
Here, we’re going to fill in some missing operational details from our scenario in Section 2.
Here's that scenario again, stated more formally. We have two agents, Agent H (our human agent) and Agent A (our AI agent), who interact with each other in a standard RL setting. Both agents see the same joint state s. On a gridworld, for example, s would encode the positions of both the agents. Each agent chooses and executes an action simultaneously and independently, and they both see the same next joint state, s′. We’ll label Agent H’s actions aH, and Agent A’s actions aA.
In what follows, we’ll start by calculating the optimal policy πH(aH|s,RH) for Agent H, for each reward function RH sampled from (RH,RA)∼DHA, conditioned on a fixed environmental transition function. We’ll then calculate the optimal policy π∗A(aA|s,RA) for Agent A, for each reward function RA, conditioned on Agent H executing the fixed policy πH(aH|s,RH) it learned in the previous step.[16]
Finally, we’ll evaluate the POWERs of both agents at each state, as expectations over the joint reward function distribution (RH,RA)∼DHA, and over the agents' policies πH(aH|s,RH) and π∗A(aA|s,RA).
A.1 Initial optimal policies of Agent H
The first thing we do is assign a single fixed policy to Agent A (our AI), which we call a seed policy, and label π∘A(aA|s). Agent H will learn its policies by conditioning on Agent A having this fixed seed policy.
The rationale for the seed policy is that we’re initially modeling a human who is alone, optimizing against nature. So when we assign a fixed seed policy to Agent A, what we’re saying is that our AI is still un-optimized (or, equivalently, hasn’t yet been built). To our human, the AI’s components and dynamics behave as though they’re part of the natural environment, and our human can safely optimize against them under that assumption.[17]
Suppose, then, that we've chosen the fixed seed policy π∘A for Agent A. Then, for any given reward function RH of Agent H, Agent H’s optimal policy πH(aH|s,RH) will be:
where PH=PH(s′|s,aH,π∘A) is the state transition function for Agent H conditional on Agent A’s fixed seed policy, and VπHRH(s′|γ,π∘A) is the state-value function for Agent H if it executes policy πH and has reward function RH.[18]
We can think of the policies πH in Equation (A.1) as being those of a human alone in nature, without an AI present.
A.2 Optimal policies of Agent A
In the second step of our definition, we calculate the optimal policy for Agent A, conditional on the Agent H policy πH we found in Equation (A.1). For any given reward function RA of Agent A, Agent A’s optimal policy πA(aA|s,RA) will be (by analogy with Equation (A.1)):
where PA=PA(s′|s,aA,πH) is the state transition function for Agent A conditional on Agent H’s policy πH, and VπARA(s′|γ,πH) is the state-value function for Agent A if it executes policy πA and has reward function RA.
We can think of Agent A's policies πA in Equation (A.2) as being those of a powerful AI, interacting with our human. Just as humans can optimize much faster than nature, a powerful AI can presumably optimize much faster than a human. So from the AI’s point of view, the human agent looks like it’s standing still, and we’ll be computing both the human’s and the AI’s POWERs on the basis of that assumption.
A.3 POWER of Agent H
To compute the POWERs of our two agents, we first draw the reward functions for Agent H and Agent A, respectively, as (RH,RA)∼DHA from the joint reward function distribution DHA. For each reward function, we then calculate the policies πH and πA of each agent, using, respectively, Equations (A.1) and (A.2) above.
To calculate the POWER of Agent H, we assume Agent H follows the policies πH given by Equation (A.1), in an environment in which Agent A follows policies πA given by Equation (A.2):
where the expectation ERH,πA∼DHA is taken over the πA that have been learned by Agent A on the sampled reward functions RA. Note that we’re defining the POWER of Agent H in terms of the state-value function VπHRH for the policies πH from Equation (A.1). Recall that those prior policies are no longer optimal for Agent H,[19] so we’re now asking how much instrumental value Agent H can capture in a world it’s no longer optimized for.
Looking at our human-AI analogy, this corresponds to asking how a human experiences a world that’s been taken over by a powerful AI. Our human, having learned to interact with a stationary natural environment, is now being optimized against by a powerful AI that learns on a much faster timescale. So the POWER we calculate for Agent H in Equation (A.3) represents how much instrumental value Agent H (our human) can obtain in an AI-dominated world.
A.4 POWER of Agent A
Finally, to calculate the POWER of Agent A, we assume Agent A follows the policies πA given by Equation (A.2), in an environment in which Agent H follows the policies πH given by Equation (A.1):
where the expectation EπH,RA∼DHA is taken over the πH that have been learned by Agent H on the sampled reward functions RH. Unlike in Agent H’s case, Agent A’s policies πAare optimal in this environment: Agent A has had the chance to fully optimize itself against the frozen policies πH of Agent H.
In our human-AI analogy, this corresponds to asking how an AI experiences a world in which it’s become dominant. By assumption, our AI is able to learn quickly enough to treat the human in its environment as stationary from the perspective of its own optimization.
POWER is a good measure of instrumental value in single-agent systems, but it breaks down in multi-agent systems apart from special cases. The problem is that the single-agent definition of POWER uses the optimal state-value function V∗R(s,γ) of the agent as one of its inputs. This means if we try to naively extend this definition to the multi-agent case, then we have to consider value functions that are jointly optimal for both agents — which is to say, we need to know their value functions at Nash equilibrium. The problem is that the Nash equilibrium isn’t unique in general, so this naive generalization leaves POWER under-determined.
In Equation (1), the expectation EπH,RA∼DHA is a slight abuse of notation. In fact, each policy πH for Agent H is learned from RH. It isn't drawn directly from DHA, because DHA is a distribution over reward functions, not over policies. See Appendix A for more details on this definition.
For simplicity, we'll only consider joint reward function distributions DHA whose sampled reward functions (RH(s),RA(s)) have their rewards distributed iid over states, uniformly over the interval [0, 1]. For example, a reward function RH(s) defined on an MDP with states {s1,s2,s3} would have rewards RH(s1)∼U(0,1), RH(s2)∼U(0,1), RH(s3)∼U(0,1), with the reward at each state being independent from the reward at any of the other states.
More correctly, if two agents' utility functions are exactly identical, we can think of them as having terminal goals that are perfectly aligned. But in the particular set of experiments whose results we’re discussing, this distinction isn't meaningful. (See footnote [1] from Part 1.)
Note that there’s an obvious problem with using any correlation coefficient as an alignment metric. The problem is that we could have a joint distribution DHA for which, e.g., the very highest rewards of Agent A are correlated with the very lowest rewards of Agent H, while still maintaining a high correlation βHA over the distribution as a whole. In this situation, Agent A would optimize to reach its highest-reward state, which would drag Agent H into a low-reward state despite the high overall reward correlation.
This means a correlation coefficient isn’t a useful alignment metric for any real-world application. But in the examples we’re considering in this sequence, it’s enough to get the main ideas across.
And in fact, the effect is even stronger than this. Agent H not only instrumentally prefers for Agent A to be in the central cell — it would rather see Agent A in the central cell than see itself in the central cell.
You can see this by comparing the POWER value at state s2 in Fig 2 (0.9139) to the POWER value at the state in which Agent H is at the top left and Agent A is at the central cell (0.9206). In the perfect alignment regime, Agent H places a higher instrumental value on Agent A’s freedom of movement than on its own. Intuitively, in this regime, the human agent trusts the AI agent to look after its interests more capably than the human agent can for itself.
The relation βHA=1⟹αHA=1 isn’t (just) an empirical observation. It's a mathematical consequence of our MDP’s dynamics. In the perfect alignment regime, our two agents always take simultaneous actions and always simultaneously receive the same reward, so their joint policy (πH,πA) will always yield identical values at every state.
