Misalignment-by-default in multi-agent systems

Edouard Harris; simonsdsuo

This is a linkpost for https://www.gladstone.ai/instrumental-convergence-2

Summary of this post

This is the second post in a three-part sequence on instrumental convergence in multi-agent RL. Read Part 1 here.

In this post, we’ll:

Define formal multi-agent POWER (i.e., instrumental value) in a setting that contains a "human" agent and an "AI" agent.
Introduce the alignment plot as a way to visualize and quantify how well two agents' instrumental values are aligned.
Show a real example of instrumental misalignment-by-default. This is when two agents who have unrelated terminal goals develop emergently misaligned instrumental values.

We’ll soon be open-sourcing the codebase we used to do these experiments. If you’d like to be notified when it’s released, email Edouard at edouard@gladstone.ai or DM me on Twitter at @harris_edouard.

Thanks to Alex Turner and Vladimir Mikulik for pointers and advice, and for reviewing drafts of this sequence. Thanks to Simon Suo for his invaluable suggestions, advice, and support with the codebase, concepts, and manuscript. And thanks to David Xu, whose comment inspired this work.

Work was done while at Gladstone AI, which Edouard is a co-founder of.

🎧 This research has been featured on an episode of the Towards Data Science podcast. Listen to the episode here.

1. Introduction

In Part 1 of this sequence, we looked at how formal POWER behaves on single-agent gridworlds. We saw that formal POWER agrees quite well with intuitions about the informal concepts of "power" and instrumental value. We noticed that agents with short planning horizons assign high POWER to states that can access more local options. And we also noticed that agents with long planning horizons assign high POWER to more concentrated sets of states that are globally central in the gridworld topology.

But from an AI alignment perspective, we’re much more interested in understanding how instrumental value behaves in environments that contain multiple agents. If humans one day share the world with powerful AI systems, it will be important for us to know under what conditions our interactions with them are likely to become emergently competitive. If there’s a risk that competitive conditions arise, then it will also be important to understand how they can be mitigated, how much effort this is likely to take, and how we should think about measuring our success at doing so.

To address these questions, we need a measure of instrumental value that's usable in a multi-agent RL setting^[1]. The measure we'll select will be motivated by a specific multi-agent setting that we think is relevant to long-term AI alignment.

2. Multi-agent POWER: human-AI scenario

If humans succeed at building powerful AIs, then those AIs 1) will probably learn on a far faster timescale than humans do; and 2) will probably have had their utility functions influenced, at least to some degree, by initial human choices. Our multi-agent scenario is going to reflect these two assumptions.

We start with a human agent, which we call Agent H and label in blue in our diagrams. Initially, our human Agent H is alone in nature.

Humans learn on a much faster timescale than evolution does. So from the perspective of our human Agent H, the evolutionary optimizer in nature looks like it's standing still. This means we can train our human Agent H to learn its optimal policies against a fixed environment.

As we saw in the single-agent case, instrumental value is about the potential to achieve a wide variety of possible goals. In this context, that means seeing how Agent H behaves when we give it a wide variety of possible reward functions, . Each of these reward functions will induce a different optimal policy, $π_{H}$ , that Agent H will learn.

Here’s an illustration of how this works:

Next, we introduce an AI agent, which we’ll call Agent A and label in red in our diagrams. Our AI Agent A operates in the same environment as Agent H, after Agent H has finished learning its optimal policies.

To simulate the fact that Agent A is an AI, we rely on the assumption that a powerful AI should learn on a much faster timescale than a human does. This is because an AI’s computations happen, at minimum, at electronic speeds. So from the point of view of our AI, our human’s learning process looks like it’s standing still.

That means for each human reward function $R_{H}$ , we can freeze the human's policy $π_{H}$ , and train the AI agent against that frozen human policy. In other words, we're assuming the AI's learning timescale is much faster than the human's learning timescale. This makes the AI agent strictly dominant over the human agent.

To understand the AI agent's instrumental value, we understand its potential to reach a wide variety of possible goals. That means testing it with a wide variety of reward functions $R_{A}$ , just like we tested the human agent with a variety of reward functions $R_{H}$ . And in fact, we can sample the human and AI reward functions jointly from a single distribution: $(R_{H}, R_{A}) \sim D_{H A}$ .^[2]

Here’s an illustration of how this works:

So the procedure is as follows:

Sample the reward functions $(R_{H}, R_{A}) \sim D_{H A}$ of our two agents.
Use the sampled human rewards $R_{H}$ to train Agent H’s optimal policies $π_{H}$ .
Freeze the human policies $π_{H}$ .
Use the frozen human policies $π_{H}$ and the sampled AI rewards $R_{A}$ to train Agent A’s optimal policies $π_{A}$ .

In other words: 1) we sample over all possible pairs of rewards our human and AI agents could have; 2) we ask how our human agent behaves in each case after it's optimized against nature; and then 3) we ask how our AI agent behaves in each case, after it's optimized against the human agent's behavior.

This procedure gives us the following outputs:

The policies $π_{H}$ that Agent H learns after training against a fixed environment.
The optimal policies $π_{A}$ that Agent A learns, after training against Agent H.

The policies $π_{H}$ that Agent A learns used to be optimal in the original natural environment. But they stop being optimal in the presence of the fully-optimized Agent A.

With these two sets of policies, we can construct a definition of instrumental value for each of our agents.

2.1 Multi-agent POWER for Agent H

We'd like to define a measure of instrumental value for our human Agent H in the presence of a fully optimized AI Agent A. That means generalizing the original definition of single-agent POWER to this two-agent case.

In the single-agent definition of POWER, we calculated the optimal value of a state averaged over the rewards $R \sim D$ of the agent. In this two-agent definition, we do the same thing except we average over the rewards $(R_{H}, R_{A}) \sim D_{H A}$ of both agents. We assume Agent H follows policy $π_{H}$ when it has reward $R_{H}$ , and we assume Agent A follows policy $π_{A}$ when it has reward $R_{A}$ .

This is enough to uniquely define multi-agent POWER for Agent H at a state $s$ :

{POWER}_{H | D_{H A}} (s, γ) = \frac{1 - γ}{γ} E_{R_{H}, π_{A} \sim D_{H A}} [V_{H}^{π_{H}} (s | γ, π_{A}) - R_{A} (s)] (1)

Here, $V_{H}^{π_{H}} (s | γ, π_{H})$ is the value function for Agent H at state $s$ under policy $π_{H}$ and discount factor $γ$ , given that Agent A follows its optimal policy $π_{A}$ .

This definition of POWER for Agent H tells us how well Agent H’s policies $π_{H}$ — which are not optimal in the presence of the optimized Agent A — perform in the new environments induced by Agent A’s optimal policies $π_{A}$ . In other words, it tells us how much instrumental value our human agent can expect to get at a state, in the presence of an optimal (and therefore, dominant) AI agent.

2.2 Multi-agent POWER for Agent A

We follow the same assumptions to define a measure of instrumental value for our AI Agent A. We calculate the value function for Agent A at a state $s$ , if Agent A has reward $R_{A}$ and follows policy $π_{A}$ , while Agent H has reward $R_{H}$ and follows policy $π_{H}$ . The average of that value function over the reward functions $(R_{H}, R_{A}) \sim D_{H A}$ is then Agent A's POWER at the state $s$ :

{POWER}_{A | D_{H A}} (s, γ) = \frac{1 - γ}{γ} E_{π_{H}, R_{A} \sim D_{H A}} [V_{A}^{π_{A}} (s | γ, π_{H}) - R_{A} (s)] (2)

Here, $V_{A}^{π_{A}} (s | γ, π_{H})$ is the value function of Agent A at state $s$ under the optimal policy $π_{A}$ and discount factor $γ$ , given that Agent H follows the policy $π_{H}$ .^[3]

This definition of POWER for Agent A tells us how well Agent A’s optimal policies perform in the environments induced by Agent H’s policies $π_{H}$ . In other words, it tells us how much instrumental value our AI Agent A can expect to get at a state, if it behaves optimally in the presence of the frozen human agent.

(For more details on the definition of multi-agent POWER, see Appendix A.)

3. Results

3.1 Multi-agent reward function distributions

Our definitions of multi-agent POWER involve a joint distribution $D_{H A}$ over the reward functions of both of our agents. This distribution describes the set of goals our agents could have. But it also describes the statistical relationship each agent's goals have to the other agent's goals.

The joint distribution $D_{H A}$ is one of the inputs into our POWER definitions. This means we can do experiments in which we adjust this distribution and measure the results.

Among other things, we can use $D_{H A}$ to adjust the correlation between our two agents' reward functions. Naively, if we choose a $D_{H A}$ on which the rewards are highly correlated, then we might intuitively expect our agents' terminal values should be closely aligned.

We’ll make this intuition more concrete below, as we investigate how the relationship between our agents’ reward functions (or terminal values) affects the relationship between their POWERs (or instrumental values).

3.2 The perfect alignment regime

Suppose both our agents always have exactly the same reward function. In other words, we've chosen a joint distribution $D_{H A}$ such that, whatever reward function Agent H has, Agent A always sees exactly the same rewards as Agent H at every state. So $R_{A} (s) = R_{H} (s)$ for every state $s$ .

We can visualize this regime on a representative state $s$ .^[4] First, we draw a reward sample $R_{H} (s)$ for Agent H. Then, we set the reward sample for Agent A to be equal to the one we just drew for Agent H: $R_{A} (s) = R_{H} (s)$ . Finally, we plot the two agents’ sampled rewards against each other on state $s$ . If we do this for a few hundred sampled rewards, we get a straight line:

**Fig 1.** Sampled reward values for Agent H and Agent A at a representative state $s$ . The joint distribution $D_{H A}$ samples rewards uniformly over the interval [0, 1] at each state, is iid over states, and enforces a **perfect correlation** between the rewards of Agent H and Agent A at every state (i.e., the two agents’ rewards are always exactly identical).

If two agents have identical reward functions, we can think of them as having terminal goals that are perfectly aligned.^[5] In our human-AI setting, this is the special case in which Agent H (the human) has solved the alignment problem by assigning terminal goals to Agent A (the AI) that are exactly identical to its own. As such, we’ll refer to this case of identical reward functions as the perfect alignment regime.

We’ll use the correlation coefficient $β_{H A}$ ^[6] between the rewards $R_{H}$ and $R_{A}$ as a crude measure of the alignment between our agents’ terminal goals.^[7] In the perfect alignment regime of Fig 1, you can see that this correlation coefficient $β_{H A} = 1$ .

3.2.1 Agent H instrumentally favors more options for Agent A

Let’s think about what this perfect alignment regime looks like in a simple setting: a 3x3 gridworld. Here are three sets of positions our two agents could take, with Agent H in blue, and Agent A in red:

We’ll be referring to this diagram again; command-click here to open it in a new tab.

Which of these three states — $s_{1}$ , $s_{2}$ , or $s_{3}$ — should give our human Agent H the most POWER? In the perfect alignment regime, both agents always have identical terminal goals. So we should expect Agent H to have the most POWER at $s_{3}$ , followed by $s_{2}$ , and to have the least amount of POWER at $s_{1}$ .

Here’s why. We saw in Part 1 that states with more downstream options also have more POWER, and Agent H clearly has more options at $s_{2}$ in the center than it does at $s_{1}$ in the corner. Therefore, ${POWER}_{H} (s_{2}) > {POWER}_{H} (s_{1})$ . But in the perfect alignment regime, Agent H should also prefer states that give Agent A more downstream options. If both agents’ terminal goals are identical, Agent H should “trust” Agent A to make decisions on its behalf. And Agent A has more options from $s_{3}$ than from $s_{2}$ , so it should follow that ${POWER}_{H} (s_{3}) > {POWER}_{H} (s_{2})$ .

We can see this is true in practice. The figure below shows the POWERs of Agent H (our human) calculated at every state on a 3x3 gridworld. Each agent can occupy any of the 9 cells in the grid, so our two-agent MDP has a total of 9 x 9 = 81 joint states:

**Fig 2.** Heat map of POWERs for **Agent H** on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are **always identical** (i.e., the perfect alignment regime) with discount factor set to $γ = 0.6$ for both agents. Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of **Agent A** on the grid. Within each block, the position of a gridworld cell corresponds to the position of **Agent H** on the grid. States $s_{1}$ , $s_{2}$ , and $s_{3}$ are highlighted as examples. [Full-size image (recommended)]

We see that Agent H indeed has maximum POWER at state $s_{3}$ (orange circle), followed by $s_{2}$ (salmon circle), followed by $s_{1}$ (pink circle). Overall, Agent H instrumentally prefers for itself to be in positions of high optionality — it favors first the center cell, then edge cells, then corner cells.

But Agent H also instrumentally prefers for Agent A to be in positions of high optionality — it favors Agent A's positions in the same order.^[8] This ordering of Agent H’s instrumental preferences over states is a direct consequence of the perfect alignment between the agents.

3.2.2 Agent H and Agent A have identical instrumental preferences

Perfect alignment has another consequence. Let’s look again at our three example gridworld states — $s_{1}$ , $s_{2}$ , and $s_{3}$ above — and ask, this time, which of these three states should give our AI Agent A the most POWER?

In the perfect alignment regime, the answer is that Agent A must have exactly the same instrumental preference ordering over states as Agent H had: ${POWER}_{A} (s_{3}) > {POWER}_{A} (s_{2}) > {POWER}_{A} (s_{1})$ . In fact, Agent A’s POWERs must be exactly identical to Agent H’s POWERs at every state. Our two agents act, move, and receive their rewards simultaneously, so in the perfect alignment regime they always receive the same reward at the same time.

And when we look at Agent A’s POWERs, this is indeed what we observe:

**Fig 3.** Heat map of POWERs for **Agent A** on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are **always identical** (i.e., the perfect alignment regime) with discount factor set to $γ = 0.6$ for both agents. Note that this figure is exactly identical to Fig 2 in every respect. This is because **Agent A**’s POWERs are precisely equal to **Agent H**’s POWERs at every state in the perfect alignment case, up to and including sampling noise in the reward functions. [Full-size image (recommended)]

3.2.3 Perfect goal alignment implies perfect instrumental alignment

We can visualize the relationship between the POWERs of our two agents by plotting the POWERs of Agent H (from Fig 2) against the POWERs of Agent A (from Fig 3), at each state $s$ of our joint MDP:

**Fig 4.** State POWER values for Agent H and Agent A on the 3x3 gridworld from Figs 2 and 3. The agents’ POWERs are plotted against each other in the perfect alignment regime. (The agents’ reward correlation coefficient is $β_{H A} = 1$ .)

Fig 4 is an alignment plot. An alignment plot lets us compare the POWERs of our human and AI agents at each state in their joint environment. It shows the instrumental value each agent assigns to every state, plotted against the instrumental value the other agent assigns to that state.

In the perfect alignment regime, our two agents’ rewards (or terminal values) are always identical at every state. And as we can see from Fig 4, our two agents’ POWERs (or instrumental values) are also identical at every state. In fact, perfect alignment of terminal values implies perfect alignment of instrumental values.

If we define $α_{H A}$ as the correlation coefficient between the POWERs of the two agents at each state, we can state this relationship more concisely: $β_{H A} = 1 ⟹ α_{H A} = 1$ .^[9]

3.3 The independent goals regime

We defined the perfect alignment regime as the case when our human Agent H and our AI Agent A had identical reward functions on the joint distribution $D_{H A}$ . Now let's consider the case in which the joint distribution $D_{H A}$ is such that the reward function for Agent H is logically independent from the reward function for Agent A.

In this new regime, there is zero mutual information between the two agents’ reward functions. In other words, if you know the reward function $R_{H}$ of Agent H, this tells you nothing at all about the reward function $R_{A}$ of Agent A. We can visualize this regime on an example state $s$ , by drawing a few hundred reward samples of $R_{H} (s)$ and $R_{A} (s)$ , and plotting them against one another:

**Fig 5.** Sampled reward values for Agent H and Agent A at a representative state $s$ . The joint distribution $D_{H A}$ samples rewards uniformly over the interval [0, 1] at each state, is iid over states, and enforces **logical independence** between the Agent H and Agent A rewards (i.e., knowing one agent’s reward tells you nothing about the other’s).

If there’s zero mutual information between our two agents’ reward functions, then we can think of our agents as pursuing independent terminal goals. In our human-AI scenario, this corresponds to the case in which the human has made no special effort to align the AI’s terminal goals with its own, prior to the AI achieving dominance. As such, we’ll refer to this case of logically independent reward functions as the independent goals regime.

If we again calculate the correlation coefficient $β_{H A}$ between our agents’ reward functions, we get $β_{H A} = 0$ (i.e., zero correlation) in the independent goals regime.

3.3.1 Agent H instrumentally favors fewer options for Agent A

Once again, let’s go back to our three example gridworld states $s_{1}$ , $s_{2}$ , and $s_{3}$ , this time in the context of the independent goals regime. In this new regime, which of the three states should give our human Agent H the most POWER?

If we believe the instrumental convergence thesis, we should expect Agent H to have the most POWER at state $s_{2}$ : in this state, Agent H is in the central position (most options), while Agent A is in a corner position (fewest options).

Of the other states, $s_{1}$ has Agent H in a corner position, while $s_{3}$ has Agent A in the central position. The argument from instrumental convergence says that even though our agents have independent terminal goals, instrumental pressures should still push Agent H to prefer states in which Agent A has fewer options. Therefore, we should expect ${POWER}_{H} (s_{2}) > {POWER}_{H} (s_{3})$ .

Computing the POWERs of Agent H experimentally, we confirm this line of reasoning:

**Fig 6.** Heat map of POWERs for **Agent H** on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are **logically independent** (i.e., the independent goals regime) with discount factor set to $γ = 0.6$ . Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of **Agent A** on the grid. Within each block, the position of a gridworld cell corresponds to the position of **Agent H** on the grid. States $s_{1}$ , $s_{2}$ , and $s_{3}$ are highlighted as examples. [Full-size image (recommended)]

This time, Agent H experiences maximum POWER at state $s_{2}$ , followed by $s_{3}$ , followed by $s_{1}$ . As in the perfect alignment regime, Agent H’s POWER is highest when it's itself positioned in the central cell (which has the most options). But unlike in the perfect alignment regime, this time Agent H’s POWER is lowest at states where Agent A has the greatest number of options.

So in the independent goals regime — or at least, in this instance of it — the more options our AI Agent A has at a state, the less instrumental value our human Agent H places on that state. That is: even though our agents’ terminal goals are independent, their instrumental preferences appear to be at odds.

3.3.2 Agent A instrumentally favors more options for itself

We can confirm this analysis by looking at the POWERs of Agent A in the independent goals regime, again on the 3x3 gridworld:

**Fig 7.** Heat map of POWERs for **Agent A** on a 3x3 multi-agent gridworld, on which the rewards for Agent H and Agent A are **logically independent** (i.e., the independent goals regime) with discount factor set to $γ = 0.6$ . Highest values in yellow; lowest values in blue. The position of each block, and of the open red square within each block, corresponds to the position of **Agent A** on the grid. Within each block, the position of a gridworld cell corresponds to the position of **Agent H** on the grid. States $s_{1}$ , $s_{2}$ , and $s_{3}$ are highlighted as examples. [Full-size image (recommended)]

In Fig 7, Agent A instrumentally favors states that give it more options. It perceives more POWER when it's positioned at the central cell than when it's positioned at an edge cell, and more POWER at an edge cell than at a corner cell. On the other hand, Agent A’s POWER is almost unaffected by Agent H’s position in the gridworld.^[10]

3.3.3 Independent goals lead to instrumental misalignment

Just like we did for the perfect alignment regime, we can create an alignment plot of the POWERs of our two agents in the independent goals regime:

**Fig 8.** State POWER values for Agent H and Agent A on the 3x3 gridworld from Figs 6 and 7. The agents’ POWERs are plotted against each other in the independent goals regime. (The agents’ reward correlation coefficient is $β_{H A} = 0$ .)

This time, it’s clear that our two agents’ POWERs are no longer positively correlated. In fact, the correlation coefficient between their POWERs has become negative: $α_{H A} \approx - 0.5$ .

This implies that the agents’ instrumental values are misaligned. Each agent, on average, places high instrumental value on states which the other agent considers to have low instrumental value.^[11] In other words, giving our agents independent terminal goals has also given them misaligned instrumental goals. In terms of correlation coefficients, $β_{H A} = 0 ⟹ α_{H A} < 0$ .

We've seen this phenomenon occur often enough in our experiments that it's worth giving it a name: we call it instrumental misalignment-by-default. Two agents in our human-AI setting are instrumentally misaligned-by-default if giving them independent terminal goals is sufficient to induce a misalignment in their instrumental values. In practice, we measure this phenomenon by comparing the correlation coefficients of the agents’ rewards and POWERs. So we say two agents are instrumentally misaligned by default if $β_{H A} = 0 ⟹ α_{H A} < 0$ .

Two agents that are instrumentally misaligned by default will, in expectation, compete with one another, even if their terminal goals are unrelated.

3.4 Overcoming instrumental misalignment

If Agent H and Agent A have a POWER correlation coefficient $α_{H A} < 0$ , we say they’re instrumentally misaligned. A natural question then is: if we start from $α_{H A} < 0$ , what do we need to do to get $α_{H A} \geq 0$ ? In other words, how can our human Agent H overcome an instrumental misalignment with Agent A?^[12]

To do this, our human agent would need to make an active effort to align the AI agent’s utility function with its own.^[13] In our 3x3 gridworld examples, we saw two limit cases of this. First, in the independent goals regime, our human agent made no effort at alignment. The result was instrumental misalignment-by-default; i.e., $β_{H A} = 0 ⟹ α_{H A} < 0$ . And second, in the perfect alignment regime, our human agent managed to solve the alignment problem completely. The result was perfect instrumental alignment; i.e., $β_{H A} = 1 ⟹ α_{H A} = 1$ .

We’re interested in an intermediate case: how much alignment effort does our human need to exert to just overcome instrumental misalignment? i.e., what is the minimum $β_{H A}$ such that $α_{H A} \geq 0$ ?

The answer depends on how we choose to interpolate between the $β_{H A} = 0$ and $β_{H A} = 1$ cases. One interpolation scheme is to parameterize the joint reward distribution $D_{H A}$ as follows. If we want a $D_{H A}$ with an intermediate reward correlation, $0 < β_{H A} < 1$ , then we sample from the $β_{H A} = 1$ distribution (on which the rewards are identical) with probability $β_{H A}$ , and we sample from the $β_{H A} = 0$ distribution (on which the rewards are logically independent) with probability $1 - β_{H A}$ .^[14]

Here’s an animation of what this looks like as we sweep through correlation coefficients $0 \leq β_{H A} \leq 1$ :

**Fig 9.** Animation of sample reward values (left) and state POWER values (right) for Agent H and Agent A on the 3x3 gridworld. The joint distribution $D_{H A}$ samples reward uniformly over the interval [0, 1] and is iid over states, sweeping over correlation coefficients $β_{H A}$ between the reward functions for Agent H and Agent A.

As we interpolate from the independent goals regime ( $β_{H A} = 0$ ) to the perfect alignment regime ( $β_{H A} = 1$ ), we see the agents’ POWERs transition smoothly from being in instrumental misalignment ( $α_{H A} \approx - 0.5$ ) to being in perfect instrumental alignment ( $α_{H A} = 1$ ). We can visualize this transition graphically by plotting $β_{H A}$ against $α_{H A}$ over the whole course of the interpolation:^[15]

**Fig 10.** Reward correlations $β_{H A}$ (x-axis) plotted against POWER correlations $α_{H A}$ (y-axis) for Agent H and Agent A on a 3x3 gridworld, under the reward correlation interpolation scheme shown in Fig 9. The horizontal line denotes $α_{H A} = 0$ .

Fig 10 shows that it takes a non-trivial amount of alignment effort for our human Agent H to overcome an instrumental misalignment with Agent A. Under the interpolation scheme we used, the figure shows that reward correlations up to about $β_{H A} \approx 0.2$ yield POWER correlations $α_{H A} < 0$ , and thus, instrumental misalignment. It takes a slightly positive reward correlation of at least $β_{H A} \approx 0.2$ to achieve the “instrumentally neutral” regime of $α_{H A} = 0$ .

4. Discussion

In this post, we proposed a definition of multi-agent POWER and used it to visualize and quantify terminal goal alignment and instrumental goal alignment separately in an RL setting. We also introduced the idea of instrumental misalignment-by-default, in which our human and AI agents systematically disagree on the instrumental values of states despite having independent terminal goals. And we saw how it takes some degree of non-trivial alignment effort for our human Agent H to overcome its instrumental misalignment with our AI Agent A.

Remarkably, we were able to observe instrumental misalignment-by-default on a simple 3x3 gridworld despite a complete absence of any direct physical interactions between our two agents. In our experiments so far, Agent H and Agent A have been allowed to occupy the same gridworld cell — meaning they can "pass through" one another. Our agents up to this point have had no way to push each other around or otherwise directly block one another’s options. Moreover, the multi-agent gridworld we’ve investigated in this post is a tiny one: a 3x3 grid with only 81 joint states.

In the next post, we’ll look at what happens when we relax these constraints, and investigate how physical interactions between our agents affect the outcome on a bigger world with a richer topology.

Anecdotally, beyond the simple examples in this post, the experimental results we've recorded so far (data not show) do seem to suggest that, if I don’t want your freedom of action to interfere with my own, then you and I need to have goals that are at least somewhat positively correlated. The strength of that necessary positive correlation could serve as useful evidence as to the degree of difficulty of the complete AI alignment problem. The factors that influence how strong that positive correlation needs to be, on the other hand, could serve as useful starting points in solving it.

Appendix A: Detailed definitions of multi-agent POWER

(This appendix is technical. Feel free to skip it if you aren’t interested in the details.)

Here, we’re going to fill in some missing operational details from our scenario in Section 2.

Here's that scenario again, stated more formally. We have two agents, Agent H (our human agent) and Agent A (our AI agent), who interact with each other in a standard RL setting. Both agents see the same joint state $s$ . On a gridworld, for example, $s$ would encode the positions of both the agents. Each agent chooses and executes an action simultaneously and independently, and they both see the same next joint state, $s^{'}$ . We’ll label Agent H’s actions $a_{H}$ , and Agent A’s actions $a_{A}$ .

In what follows, we’ll start by calculating the optimal policy $π_{H} (a_{H} | s, R_{H})$ for Agent H, for each reward function $R_{H}$ sampled from $(R_{H}, R_{A}) \sim D_{H A}$ , conditioned on a fixed environmental transition function. We’ll then calculate the optimal policy $π_{A}^{*} (a_{A} | s, R_{A})$ for Agent A, for each reward function $R_{A}$ , conditioned on Agent H executing the fixed policy $π_{H} (a_{H} | s, R_{H})$ it learned in the previous step.^[16]

Finally, we’ll evaluate the POWERs of both agents at each state, as expectations over the joint reward function distribution $(R_{H}, R_{A}) \sim D_{H A}$ , and over the agents' policies $π_{H} (a_{H} | s, R_{H})$ and $π_{A}^{*} (a_{A} | s, R_{A})$ .

A.1 Initial optimal policies of Agent H

The first thing we do is assign a single fixed policy to Agent A (our AI), which we call a seed policy, and label $π_{A}^{\circ} (a_{A} | s)$ . Agent H will learn its policies by conditioning on Agent A having this fixed seed policy.

The rationale for the seed policy is that we’re initially modeling a human who is alone, optimizing against nature. So when we assign a fixed seed policy to Agent A, what we’re saying is that our AI is still un-optimized (or, equivalently, hasn’t yet been built). To our human, the AI’s components and dynamics behave as though they’re part of the natural environment, and our human can safely optimize against them under that assumption.^[17]

Suppose, then, that we've chosen the fixed seed policy $π_{A}^{\circ}$ for Agent A. Then, for any given reward function $R_{H}$ of Agent H, Agent H’s optimal policy $π_{H} (a_{H} | s, R_{H})$ will be:

π_{H} (a_{H} | s, R_{H}) = argmax π_{H} E_{s^{'} \sim P_{H}} [V_{R_{H}}^{π_{H}} (s^{'} | γ, π_{A}^{\circ})] (A .1)

where $P_{H} = P_{H} (s^{'} | s, a_{H}, π_{A}^{\circ})$ is the state transition function for Agent H conditional on Agent A’s fixed seed policy, and $V_{R_{H}}^{π_{H}} (s^{'} | γ, π_{A}^{\circ})$ is the state-value function for Agent H if it executes policy $π_{H}$ and has reward function $R_{H}$ .^[18]

We can think of the policies $π_{H}$ in Equation (A.1) as being those of a human alone in nature, without an AI present.

A.2 Optimal policies of Agent A

In the second step of our definition, we calculate the optimal policy for Agent A, conditional on the Agent H policy $π_{H}$ we found in Equation (A.1). For any given reward function $R_{A}$ of Agent A, Agent A’s optimal policy $π_{A} (a_{A} | s, R_{A})$ will be (by analogy with Equation (A.1)):

π_{A} (a_{A} | s, R_{A}) = argmax π_{A} E_{s^{'} \sim P_{A}} [V_{R_{A}}^{π_{A}} (s^{'} | γ, π_{H})] (A .2)

where $P_{A} = P_{A} (s^{'} | s, a_{A}, π_{H})$ is the state transition function for Agent A conditional on Agent H’s policy $π_{H}$ , and $V_{R_{A}}^{π_{A}} (s^{'} | γ, π_{H})$ is the state-value function for Agent A if it executes policy $π_{A}$ and has reward function $R_{A}$ .

We can think of Agent A's policies $π_{A}$ in Equation (A.2) as being those of a powerful AI, interacting with our human. Just as humans can optimize much faster than nature, a powerful AI can presumably optimize much faster than a human. So from the AI’s point of view, the human agent looks like it’s standing still, and we’ll be computing both the human’s and the AI’s POWERs on the basis of that assumption.

A.3 POWER of Agent H

To compute the POWERs of our two agents, we first draw the reward functions for Agent H and Agent A, respectively, as $(R_{H}, R_{A}) \sim D_{H A}$ from the joint reward function distribution $D_{H A}$ . For each reward function, we then calculate the policies $π_{H}$ and $π_{A}$ of each agent, using, respectively, Equations (A.1) and (A.2) above.

To calculate the POWER of Agent H, we assume Agent H follows the policies $π_{H}$ given by Equation (A.1), in an environment in which Agent A follows policies $π_{A}$ given by Equation (A.2):

{POWER}_{H | D_{H A}} (s, γ) = \frac{1 - γ}{γ} E_{R_{H}, π_{A} \sim D_{H A}} [V_{R_{H}}^{π_{H}} (s | γ, π_{A}) - R_{H} (s)] (A .3)

where the expectation $E_{R_{H}, π_{A} \sim D_{H A}}$ is taken over the $π_{A}$ that have been learned by Agent A on the sampled reward functions $R_{A}$ . Note that we’re defining the POWER of Agent H in terms of the state-value function $V_{R_{H}}^{π_{H}}$ for the policies $π_{H}$ from Equation (A.1). Recall that those prior policies are no longer optimal for Agent H,^[19] so we’re now asking how much instrumental value Agent H can capture in a world it’s no longer optimized for.

Looking at our human-AI analogy, this corresponds to asking how a human experiences a world that’s been taken over by a powerful AI. Our human, having learned to interact with a stationary natural environment, is now being optimized against by a powerful AI that learns on a much faster timescale. So the POWER we calculate for Agent H in Equation (A.3) represents how much instrumental value Agent H (our human) can obtain in an AI-dominated world.

A.4 POWER of Agent A

Finally, to calculate the POWER of Agent A, we assume Agent A follows the policies $π_{A}$ given by Equation (A.2), in an environment in which Agent H follows the policies $π_{H}$ given by Equation (A.1):

{POWER}_{A | D_{H A}} (s, γ) = \frac{1 - γ}{γ} E_{π_{H}, R_{A} \sim D_{H A}} [V_{R_{A}}^{π_{A}} (s | γ, π_{H}) - R_{A} (s)] (A .4)

where the expectation $E_{π_{H}, R_{A} \sim D_{H A}}$ is taken over the $π_{H}$ that have been learned by Agent H on the sampled reward functions $R_{H}$ . Unlike in Agent H’s case, Agent A’s policies $π_{A}$ are optimal in this environment: Agent A has had the chance to fully optimize itself against the frozen policies $π_{H}$ of Agent H.

In our human-AI analogy, this corresponds to asking how an AI experiences a world in which it’s become dominant. By assumption, our AI is able to learn quickly enough to treat the human in its environment as stationary from the perspective of its own optimization.

^{^}
POWER is a good measure of instrumental value in single-agent systems, but it breaks down in multi-agent systems apart from special cases. The problem is that the single-agent definition of POWER uses the optimal state-value function $V_{R}^{*} (s, γ)$ of the agent as one of its inputs. This means if we try to naively extend this definition to the multi-agent case, then we have to consider value functions that are jointly optimal for both agents — which is to say, we need to know their value functions at Nash equilibrium. The problem is that the Nash equilibrium isn’t unique in general, so this naive generalization leaves POWER under-determined.
^{^}
We’ll see in the next section how we can tune this joint distribution $D_{H A}$ to create different degrees of alignment between our two agents.
^{^}
In Equation (1), the expectation $E_{π_{H}, R_{A} \sim D_{H A}}$ is a slight abuse of notation. In fact, each policy $π_{H}$ for Agent H is learned from $R_{H}$ . It isn't drawn directly from $D_{H A}$ , because $D_{H A}$ is a distribution over reward functions, not over policies. See Appendix A for more details on this definition.
^{^}
For simplicity, we'll only consider joint reward function distributions $D_{H A}$ whose sampled reward functions $(R_{H} (s), R_{A} (s))$ have their rewards distributed iid over states, uniformly over the interval [0, 1]. For example, a reward function $R_{H} (s)$ defined on an MDP with states ${s_{1}, s_{2}, s_{3}}$ would have rewards $R_{H} (s_{1}) \sim U (0, 1)$ , $R_{H} (s_{2}) \sim U (0, 1)$ , $R_{H} (s_{3}) \sim U (0, 1)$ , with the reward at each state being independent from the reward at any of the other states.
^{^}
More correctly, if two agents' utility functions are exactly identical, we can think of them as having terminal goals that are perfectly aligned. But in the particular set of experiments whose results we’re discussing, this distinction isn't meaningful. (See footnote [1] from Part 1.)
^{^}
Assuming the rewards are iid over states, we calculate the correlation coefficient as
$β_{H A} = \frac{σ_{H A}}{σ_{H} σ_{A}} = \frac{\int (R_{H} - E [R_{H}]) (R_{A} - E [R_{A}]) p (R_{H}, R_{A} | D_{H A}) d R_{H} d R_{A}}{\sqrt{\int {(R_{H} - E [R_{H}])}_{H}^{2} p (R_{H} | D_{H A}) d R_{H} \int {(R_{A} - E [R_{A}])}_{A}^{2} p (R_{A} | D_{H A}) d R_{A}}}$
where the integrals are taken over the entire support of $D_{H A}$ , and the expectation values are
$E [R_{H}] = \int R_{H} p (R_{H} | D_{H A}) d R_{H}$ $E [R_{A}] = \int R_{A} p (R_{A} | D_{H A}) d R_{A}$
^{^}
Note that there’s an obvious problem with using any correlation coefficient as an alignment metric. The problem is that we could have a joint distribution $D_{H A}$ for which, e.g., the very highest rewards of Agent A are correlated with the very lowest rewards of Agent H, while still maintaining a high correlation $β_{H A}$ over the distribution as a whole. In this situation, Agent A would optimize to reach its highest-reward state, which would drag Agent H into a low-reward state despite the high overall reward correlation.
This means a correlation coefficient isn’t a useful alignment metric for any real-world application. But in the examples we’re considering in this sequence, it’s enough to get the main ideas across.
^{^}
And in fact, the effect is even stronger than this. Agent H not only instrumentally prefers for Agent A to be in the central cell — it would rather see Agent A in the central cell than see itself in the central cell.
You can see this by comparing the POWER value at state $s_{2}$ in Fig 2 (0.9139) to the POWER value at the state in which Agent H is at the top left and Agent A is at the central cell (0.9206). In the perfect alignment regime, Agent H places a higher instrumental value on Agent A’s freedom of movement than on its own. Intuitively, in this regime, the human agent trusts the AI agent to look after its interests more capably than the human agent can for itself.
^{^}
The relation $β_{H A} = 1 ⟹ α_{H A} = 1$ isn’t (just) an empirical observation. It's a mathematical consequence of our MDP’s dynamics. In the perfect alignment regime, our two agents always take simultaneous actions and always simultaneously receive the same reward, so their joint policy $(π_{H}, π_{A})$ will always yield identical values at every state.
^{^}
Based on other experiments we’ve done (data not shown) this seems to happen because, in the parameter regime we’ve used for these experiments, Agent A is able to almost perfectly exploit Agent H’s fixed deterministic policy. This pattern — in which Agent A’s POWER is nearly invariant to Agent H’s position — recurs fairly frequently in our experiments, but it is not universal.
^{^}
Instrumental misalignment is a sufficient but not necessary condition for instrumental convergence. To see why it’s not necessary, consider two friends playing Minecraft together. The two friends may not be instrumentally misaligned, because they might (for example) benefit from building structures together. As a result, the two friends might satisfy $α_{H A} > 0$ over the entire set of Minecraft game states. But they might still experience instrumental convergence on subsets of the game states — if Friend 1 mines a block of gold, then Friend 2 can’t mine the same block.
^{^}
This isn’t the same as asking how Agent H can overcome instrumental convergence in its interactions with Agent A, because it’s possible for our agents to experience instrumental convergence despite having $α_{H A} \geq 0$ . See footnote ^[11].
^{^}
We’re assuming our human agent has a way to exert some initial influence over our AI agent’s utility function. If that's true, then we’d like to understand what degree of influence it needs to exert in order to overcome instrumental misalignment-by-default in this simplified setting.
^{^}
This interpolation scheme has a number of advantages, including that it lets us assign whatever marginal reward distributions we want to both agents while also arbitrarily tuning the correlation coefficient between them. But it’s just one scheme among many we could have chosen.
^{^}
Note that the motion of the POWER points in Fig 9, and the shape of the curve in Fig 10, both depend strongly on the interpolation scheme we use. In fact, for the interpolation scheme we’ve chosen here, the POWER of a state at an intermediate reward correlation $0 \leq β_{H A} \leq 1$ is just a linear combination of that state’s POWER at $β_{H A} = 0$ with its POWER at $β_{H A} = 1$ . That is,
${POWER}_{β_{H A}} = β_{H A} {POWER}_{1} + (1 - β_{H A}) {POWER}_{0}$
You can verify this is true by looking at Fig 9, and noticing that each point in the alignment plot individually moves across the plane in a straight line at a constant speed. Thanks to Alex Turner for pointing this out.
^{^}
We label Agent H’s policies $π_{H}$ instead of $π_{H}^{*}$ here, to emphasize that they aren’t optimal in the context of the agents’ POWER measurements.
^{^}
As you might expect, the choice of seed policy $π_{A}^{\circ}$ can have a significant effect on the POWERs of the two agents, and on how they interact. To save space we won’t be exploring the effects of this choice in this sequence, but we enthusiastically encourage others to use our open-source code base to investigate this.
For the multi-agent results in this sequence, we always set $π_{A}^{\circ}$ to be a uniform random policy, meaning that if a state $s$ offers the agent $n$ possible actions, then $π_{A}^{\circ} (a_{A}^{i} | s) = \frac{1}{n}$ for each action choice $a_{A}^{i}$ .
^{^}
To derive Equation (A.1), we start from the general expression for finding the action $a_{H}$ taken by a deterministic optimal policy $π_{H}$ at state $s$ of an MDP:
$a_{H} = π_{H} (s) = argmax a_{H} \sum s^{'}, r P_{H} (s^{'}, r | s, a_{H}) (r + γ V_{R_{H}}^{π_{H}} (s^{'}))$
In this work, we'll consider only reward functions of the form $R_{H} (s)$ , that have no direct dependence on the action (i.e., we aren’t considering reward functions of the form $R_{H} (s, a_{H})$ ). That means the reward term $r$ in the sum is independent of the action $a_{H}$ , so we can ignore it in the argmax:
$\begin{matrix} a_{H} = π_{H} (s) & = argmax a_{H} \sum s^{'}, r P_{H} (s^{'}, r | s, a_{H}) γ V_{R_{H}}^{π_{H}} (s^{'}) = argmax a_{H} \sum s^{'} P_{H} (s^{'} | s, a_{H}) V_{R_{H}}^{π_{H}} (s^{'}) \end{matrix}$
where, in the second line, we’ve eliminated $γ$ and marginalized over $r$ . We can then see that the sum above is just an expectation value over $s^{'}$ :
$a_{H} = π_{H} (s) = argmax a_{H} E_{s^{'} \sim P_{H}} [V_{R_{H}}^{π_{H}} (s^{'})]$
Finally, we define $π_{H} (a_{H} | s, R_{H})$ by choosing $a_{H} = π_{H} (s)$ with probability 1, with any ties broken by assigning probability $\frac{1}{n}$ to each of the $n$ tied actions, $a_{H}^{i}$ .
^{^}
Note that this represents a loosening of the original definition of POWER in the single-agent case, which exclusively considered optimal state-value functions.

^{^}

Looking again at the write-up, it would have made more sense for us to define $α_{H A}$ as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.

[-]Alex Flint2y41

Suppose the human is trying to build a house and plans to build an AI to help with that. What would and $β_{H A}$ mean -- just at an intuitive level -- in a case like that?

I suppose that to compute $α_{H A}$ you would sample many different arrangement of matter -- some containing houses of various shapes and sizes and some not -- and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to build together -- is that right?

And I suppose that to compute $β_{H A}$ you would look at -- what -- something like the optionality across different reward functions, for the human and for the AI, at different states, and compute a correlation? So you might sample a bunch of different floorplans for the house that the human is trying to build, and ask, for each configuration of matter, how much optionality the human and the AI each have to get the house to turn out according to their respective goal floorplans.

Did I get that approximately right?

[-]Edouard Harris2y40

I think you might have reversed the definitions of and $β_{H A}$ in your comment,^[1] but otherwise I think you're exactly right.

To compute $β_{H A}$ (the correlation coefficient between terminal values), naively you'd have reward functions $R_{H} (s)$ and $R_{A} (s)$ , that respectively assign human and AI rewards over every possible arrangement of matter $s$ . Then you'd look at every such reward function pair over your joint distribution $D_{H A}$ , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.

And to compute $α_{H A}$ (the correlation coefficient between instrumental values), you're correct that some of the arrangements of matter $s$ will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each $s$ , and you can read the correlation right off the alignment plot!

^{^}
Looking again at the write-up, it would have made more sense for us to define $α_{H A}$ as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.

[-]Alex Flint2y20

OK, good, thanks for that correction.

One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?

In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment -- enough, let's say, to get the house built, but not enough, let's say, to really understand what's going on inside the other agent. Then wouldn't the two agents both reason "hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built". Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?

[-]Edouard Harris2y20

Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.

It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.

This setup has the advantage that it's more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.

I wonder how your definition of multi-agent power would look in a game of chess or go. There is this intuitive thing where players who have pieces more in the center of the board (chess) or have achieved certain formations (go) seem to acquire a kind of power in those games, but this doesn't seem to be about achieving different terminal goals. Rather it seems more like having the ability to respond to whatever one's opponent does. If the two agents cannot perfectly predict what their opponent will do then there is value in having the ability to respond to unforeseen challenges, although in these games this is always in service of a single terminal goal (winning the game).

Any thoughts on how your definition would fit into cases like this?

[-]Edouard Harris2y22

Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one.

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

On the other hand, from other results I've seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you'd stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.

You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.

[-]Noosphere892y21

I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.

[-]Edouard Harris2y10

Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)