Generalizing Power to multi-agent games

by Jacob Stavrianos, Alex Turner6 min read22nd Mar 202110 comments

18

Instrumental ConvergenceWorld ModelingAI
Frontpage

Acknowledgements:

This article is a writeup of a research project conducted through the SERI program under the mentorship of Alex Turner. I (Jacob Stavrianos) would like to thank Alex for turning a messy collection of ideas into legitimate research, as well as the wonderful researchers at SERI for guiding the project and putting me in touch with the broader X-risk community.

Motivation/Overview

In the single-agent setting, Seeking Power is Often Robustly Instrumental in MDPs showed that optimal policies tend to choose actions which pursue "power" (reasonably formalized). In the multi-agent setting, the Catastrophic Convergence Conjecture presented intuitions that "most agents" will "fight over resources" when they get "sufficiently advanced." However, it wasn't clear how to formalize that intuition.

This post synthesizes single-agent power dynamics (which we believe is now somewhat well-understood in the MDP setting) with the multi-agent setting. The multi-agent setting is important for AI alignment, since we want to reason clearly about when AI agents disempower humans. Assuming constant-sum games (i.e. maximal misalignment between agents), this post presents a result which echoes the intuitions in the Catastrophic Convergence Conjecture post: as agents become "more advanced", "power" becomes increasingly scarce & constant-sum.

An illustrative example

You're working on a project with a team of your peers. In particular, your actions affect the final deliverable, but so do those of your teammates. Say that each member of the team (including you) has some goal for the deliverable, which we can express as a reward function over the set of outcomes. How well (in terms of your reward function) can you expect to do?

It depends on your teammates' actions. Let's first ask "given my opponent's actions, what's the highest expected reward I can attain?"

Case 1: Everyone plays nice

We can start by imagining the case where everyone does exactly what you'd want them to do. Mathematically, this allows you to obtain the globally maximal reward; or "the best possible reward assuming you can choose everyone else's actions". Intuitively, this looks like your team sitting you down for a meeting, asking what you want them to do for the project, and carrying out orders without fail. As expected, this case is 'the best you can hope for" in a formal sense.

Case 2: Everyone plays mean

Now, imagine the case where everyone does exactly what you don't want them to do. Mathematically, this is the worst possible case; every other choice of teammates' actions is at least as good as this one. Intuitively, this case is pretty terrible for you. Imagine the previous case, but instead of following orders your team actively sabotages them. Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.

Case 3: Somewhere in between

However, scenarios where your team is perfectly aligned either with or against you are rare. More typically, we model people as maximizing their own reward, with imperfect correlation between reward functions. Interpreting our example as a multi-player game, we can consider the case where the players' strategies form a Nash equilibrium: every person's action is optimal for themselves given the actions of the rest of their team. This case is both relatively general and structured enough to make claims about; we will use it as a guiding example for the formalism below.

Power, and why it matters

Many attempts have been made to classify AI robustly instrumental goals, with the goals of understanding why they emerge given seemingly-unrelated utilities and ultimately to counterbalance (either implicitly or explicitly) undesirable robust instrumental subgoals. One promising such attempt is based on Power (we capitalize the definition to distinguish from normal use of the word): consider an agent with some space of actions, which receives rewards depending on the chosen actions (formally, an agent in an MDP). Then, Power is roughly "ability to achieve a wide variety of goals". It's been shown that Power is robustly instrumental given certain conditions on the environment, but currently no formalism exists describing power of different agents interacting with each other.

Since we'll be working with Power for the rest of this post, we need a solid definition to build off of. We present a simplified version of the original definition:

Consider a scenario in which an agent has a set of actions  and a distribution  of reward functions . Then, we define the Power of that agent as

As an example, we can rewrite the project example from earlier in terms of Power. Let your goal for the project be chosen from some distribution  (maybe you want it done nicely, or fast, or to feature some cool thing that you did, etc). Then, your Power is the maximum extent to which you can accomplish that goal, in expectation.

However, this model of Power can't account for the actions of other agents in the environment (what about what your teammates do? Didn't we already show that it matters a lot?). To say more about the example, we'll need a generalization of Power.

Multi-agent Power

We now consider a more realistic scenario: not only are you an agent with a notion of reward and Power, but so is everyone else, all playing the same multiplayer game. We can even revisit the project example and go through the cases for your teammates' actions in terms of Power:

  • In Case 1, your team works to maximize your reward in every case, which (with some assumptions) maximizes your Power over the space of all choices of teammate actions.
  • In Case 2, your team works to minimize your reward in every case, which analogously minimizes your Power.
  • In case 3, we have a Nash equilibrium of the game used to define multi-agent Power. In particular, each player's action is a best-response to the actions of every other player. We'll see a parallel between this best-response property and the  term in the definition of Power pop up in the discussion of constant-sum games.

Bayesian games

To extend our formal definition of Power to the multi-agent case, we'll need to define a type of multiplayer normal-form game called a Bayesian game. We describe them below:

  • At the beginning of the game, each of  players is assigned a type  from a joint type distribution . The distribution  is common knowledge.
  • The players then (independently, not sequentially) choose actions , resulting in an action profile .
  • Player  then receives reward  (crucially, a player's reward can depend on their type).

Strategies (technically, mixed strategies) in a Bayesian game are given by functions . Thus, even given a fixed strategy profile , any notion of "expected reward of an action" will have to account for uncertainty in other players' types. We do so be defining interim expected utility for player  as follows:

where the expectation is taken over the following:

  • the posterior distribution over opponents' types  - in other words, what types you expect other players to have, given your type.
  • random choice of opponents' actions  - even if you know someone's type, they might implement a mixed strategy which stochastically selects actions.

Further, we can define a (Bayesian) Nash Equilibrium to be a strategy profile where each player's strategy is a best response to opponents' strategies in terms of interim expected utility.

Formal definition of multi-agent Power

We can now define Power in terms of a Bayesian game:

Fix a strategy profile . We define player 's power as 

 Intuitively, Power is maximum (expected) reward given a distribution of possible goals. The difference from the single-agent case is that your reward is now influenced by other players' actions (by taking an expectation over opponents' strategy).

Properties of constant-sum games

As both a preliminary result and a reference point for intuition, we consider the special case of zero-sum games:

A zero-sum game is a game in which for every possible outcome of the game, the sum of each player's reward is zero. For Bayesian games, this means that for all type profiles  and action profiles , we have . Similarly, a constant-sum game is a game satisfying  for any choices of .

As a simple example, consider chess; a two-player adversarial game. We let the reward profile be constant, given by "1 if you win, -1 if you lose" (assume black wins in a tie). This game is clearly zero-sum, since exactly one player will win and lose. We could ask the same "how well can you do?" question as before, but the upper-bound of winning is trivial. Instead, we ask "how well can both players simultaneously do?" 

Clearly, you can't both simultaneously win. However, we can imagine scenarios where both players have the power to win: in a chess game between two beginners, the optimal strategy for either player will easily win the game. As it turns out, this argument generalizes (we'll even prove it): in a constant-sum game, the sum of each player's power , with equality iff each player responds optimally for all their possible goals ("types"). This condition is equivalent to a Bayesian Nash Equilibrium of the game.

Importantly, this idea suggests a general principle of multi-agent Power I'll call power-scarcity: in multi-agent games, gaining power tends to come at the expense of another player losing power. Future research will focus on understanding this phenomenon further and relating it to "how aligned the agents are" in terms of their reward functions.

Claim: Consider a Bayesian constant-sum game with some strategy profile . Then,  with equality iff   is a Nash Equilibrium.

Intuition: By definition,  isn't a Nash Equilibrium iff some player 's strategy  isn't a best response. In this case, we see that player  has the power to play optimally, but the other players also have the power to capitalize off of player 's mistake (since the game is constant-sum). Thus, the lost reward is "double-counted" in terms of Power; if no such double-counting exists, then the sum of power is just the expected sum of reward, which is  by definition of a constant-sum game.

Rigorous proof:

We prove the following for general strategy profiles :

Now, we claim that the inequality on line 2 is an equality iff  is a Nash Equilibrium. To see this, note that for each , we have

with equality iff  is a best response to . Thus, the sum of these inequalities for each player is an equality iff each  is a best response, which is the definition of a Nash Equilibrium. 

Final notes

To wrap up, I'll elaborate on the implications of this theorem, as well as some areas of further exploration on power-scarcity:

  • It initially seems unintuitive that as players' strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes". More work is probably needed to get a clearer picture of this dynamic.
  • We suspect that if all players have identical rewards, then the sum of Power is equal to the sum of best-case Power for each player. This gives the appearance of a spectrum with [aligned rewards (common payoff), maximal sum power] on one end and [anti-aligned rewards (constant-sum), constant sum power] on the other. Further research might look into an interpolation between these two extremes, possibly characterized by a correlation metric between reward functions.
    • We also plan to generalize Power to Bayesian stochastic games to account for sequential decision making. Thus, any such metric for comparing reward functions would have to be consistent with such a generalization.
  • Power-scarcity results in terms of Nash Equilibria suggest the following dynamic: as agents get smarter and take available opportunities, Power becomes increasingly scarce. This matches the intuitions presented in the Catastrophic Convergence Conjecture, where agents don’t fight over resources until they get sufficiently “advanced.”

18

10 comments, sorted by Highlighting new comments since Today at 2:20 PM
New Comment

Exciting to see new people tackling AI Alignment research questions! (and I'm already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just not seeing how important it is, but then it is not obvious from the post alone.

On the positive side, it was quite agreeable to read, and I followed all the formal parts. My only criticism of the form is that I would have liked a statement of what will be proved/done in the post upfront, instead of having to wait the last section.

This might be harsh criticism, but I really encourage you to keep working in the field, and hopefully prove me wrong by expanding on this work in more advanced and exciting ways.

Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.

This is an example of wit done well in a "serious" post. I approve.

Strategies (technically, mixed strategies) in a Bayesian game are given by functions . Thus, even given a fixed strategy profile , any notion of "expected reward of an action" will have to account for uncertainty in other players' types. We do so be defining interim expected utility for player  as follows:

You haven't defined  at that point, and you don't introduce the indexing  for other strategies before the next line. So is this a typo (where you wanted to write ) or am I just misunderstanding the formula? I'm even more confused because you use  to compute , and so if it's not a typo this means that your interim utility considers that every other agent uses the same strategy?

Coming back after reading more, do you use  to mean "the strategy profile for every process except "? That would make more sense of the formulas (since you fix , there's no reason to have a ) but if it's the case, then this notation is horrible (no offense).

By the way, indexing the other strategies by  instead of, let's say  or  is quite unconventional and confusing.

It initially seems unintuitive that as players' strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes".

I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.

Thanks so much for your comment! I'm going to speak for myself here, and not for Jacob.

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just not seeing how important it is, but then it is not obvious from the post alone.

I think this is an understandable reaction. I personally feel excited by the formalism and theorem and I'll try to explain why.

Coming off of Optimal Policies Tend to Seek Power last summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, "understand multi-agent power" was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn't necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.

The crux was, in an MDP, you've got a state, and it's pretty clear what an agent can do. But in the multi-agent case, now you've got other reasoners, and now you have to account for their influence. So at first I thought, 

maybe Power is about being able to enforce your will even against the best efforts of the other players

which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn't quite right. I thought about this for a while, and I didn't make much progress, and somehow I didn't come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it's obvious: 

  • in an MDP, the relevant "situation" is the current state; measure the agent's average optimal value at that state.
  • in a non-iterated multi-agent game, the relevant "situation" is just the other players' strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
    • This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.

But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn't buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.

But the thing I'm most excited about is how I had this intuitive picture of "if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after 'some point'."

Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I'm excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn't obvious to me, at least.

-----

That said, there are some "logical time" aspects of "game theoretic power" that we don't cover, and aren't trying to cover. For example, some decision-making algorithms might be really good at ensuring they come "first" in logical time, which gives them a kind of power over other reasoners in the game. For example, if you're convinced that I'll tear off my steering wheel in Chicken, then I precede you in logical time and bully you into swerving, and I benefit from this greatly. 

I think this is an intriguing problem, but out of scope: we want to understand competition over "resources" (whatever kind of thing that is), and how one player "gaining power" can make other players "lose power."

Thanks for the detailed reply!

I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.

The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state of the research, as well as the argument for its value, fit more the "here's-a-cool-formal-idea blogpost or workshop paper". And that's independently of the importance of the result. To say it again differently, I'm ready to accept the importance of a formalism without much explanations of why I should care if it shows a lot of cool results, but when the results are few, I need a more detailed story of why I should care.

About your specific story now:

Coming off of Optimal Policies Tend to Seek Power last summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, "understand multi-agent power" was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn't necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.

Nothing to say here, except that you have the frustrating (for me) ability to make me want to read 5 of your posts in detail when explaining something completely different. I am also supposed to make my own research, you know? (Related: I'll be excited with reviewing one of your post with the review project we're doing with a bunch of other researchers. Not sure what post of you would be most appropriate though. If you have some idea, you can post it here. ;) )

The crux was, in an MDP, you've got a state, and it's pretty clear what an agent can do. But in the multi-agent case, now you've got other reasoners, and now you have to account for their influence. So at first I thought, 

maybe Power is about being able to enforce your will even against the best efforts of the other players

which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn't quite right. I thought about this for a while, and I didn't make much progress, and somehow I didn't come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it's obvious: 

  • in an MDP, the relevant "situation" is the current state; measure the agent's average optimal value at that state.
  • in a non-iterated multi-agent game, the relevant "situation" is just the other players' strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
    • This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.

When phrased that way, I think my "issue" is that the subtlety you add is mostly hidden within the additional parameter of the strategy profile. That is, with the original intuition, you don't have to find out what the other players will actually do; here you kind of have to. It's a good thing as I agree with you that it makes the intuition subtler, but it also creates a whole new complex problem of inferring strategies.

At this point, I went to reread the last sections, and realized that you're partially dealing with my problem by linking power with well-known strategy profiles (the nash-equilibriums).

But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn't buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.

This part pushed me to reread the statements in detail. If I get it correctly, you had the intuition that the power behaved like "will this player win", whereas it actually work as "keeping everything else fixed, how well can this player end up". The trick that makes the theorem true and the power bigger than the sum is that for a strategy profile that isn't a nash equilibrium, multiple players might get a lot if they change their action in turn while keeping everything else fixed.

I'm a bit ashamed, because that's actually explained in the intuition of the proof, but I didn't get it on the first reading. I also see now that it was the point of the discussion before the theorem, but that part flew over my head. So my advice for this would be to explain even more in detail the initial intuition and why it is wrong, including where in the maths this happens (the fixing of ).

My updated take after getting this point is that I'm a bit more excited about your formalism.

But the thing I'm most excited about is how I had this intuitive picture of "if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after 'some point'."

Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I'm excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn't obvious to me, at least.

I agree that this is exciting, but this is only mentioned in the last line of the post, as one perspective among others. Notably, it wasn't clear at all that this was the main application of this work.

Thank you so much for the comments! I'm pretty new to the platform (and to EA research in general), so feedback is useful for getting a broader perspective on our work.

To add to TurnTrout's comments about power-scarcity and the CCC, I'd say that the broader vision of the multi-agent formulation is to establish a general notion of power-scarcity as a function of "similarity" between players' reward functions (I mention this in the post's final notes). In this paradigm, the constant-sum case is one limiting case of "general power-scarcity", which I see as the "big idea". As a simple example, general power-scarcity would provide a direct motivation for fearing robustly instrumental goals, since we'd have reason to believe an AI with goals orthogonal(ish) from human goals would be incentivized to compete with humanity for Power.

We're planning to continue investigating multi-agent Power and power-scarcity, so hopefully we'll have a more fleshed-out notion of general power-scarcity in the months to come.

Also, re: "as players' strategies improve, their collective Power tends to decrease", I think your intuition is correct? Upon reflection, the effect can be explained reasonably well by "improving your actions has no effect on your Power, but a negative effect on opponents' Power".

Glad to be helpful!

I go into more detail in my answer to Alex, but what I want to say here is that I don't feel like you use the power-scarcity idea enough in the post itself. As you said, it's one of three final notes, and without any emphasis on it.

So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.

It initially seems unintuitive that as players' strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes".

"I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals."

IMO, the unintuitive and potentially problematic thing is not that in a zero-sum game playing better makes things worse for everybody else. That part is fine. The unintuitive and potentially problematic thing is that, according to this formalism, the total collective Power is greater the worse everybody plays. This seems adjacent to saying that everybody would be better off if everyone played poorly, which is true in some games (maybe) but definitely not true in zero-sum games. (Right? This isn't my area of expertise)

EDIT: Currently I suppose what you'd say is that power =/= utility, and so even though we'd all have more power if we were all less competent, we wouldn't actually be better off. But perhaps a better way forward would be to define a new concept of "Useful power" or something like that, which equals your share of the total power in a zero-sum game. Then we could say that everyone getting less competent wouldn't result in everyone becoming more usefully-powerful, which seems like an important thing to be able to say. Ideally we could just redefine power that way instead of inventing a new concept of useful power, but maybe that would screw up some of your earlier theorems?

But perhaps a better way forward would be to define a new concept of "Useful power" or something like that, which equals your share of the total power in a zero-sum game.

I don’t see why useful power is particularly useful, since it’s taking a non-constant-sum quantity (outside of nash equilibria)  and making it constant-sum, which seems misleading. 
 

But I also don’t see a problem with the “better play -> less exploitability -> less total Power” reasoning. this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon. 

this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon. 

Different strokes for different folks, I guess. It feels very different to me.

Probably going to reply to the rest later (and midco can as well, of course), but regarding:

Coming back after reading more, do you use  to mean "the strategy profile for every process except "? That would make more sense of the formulas (since you fix , there's no reason to have a ) but if it's the case, then this notation is horrible (no offense).

By the way, indexing the other strategies by  instead of, let's say  or  is quite unconventional and confusing.

Using "" to mean "the strategy profile of everyone but player " is common notation; I remember it being used in 2-3 game theory textbooks I read, and you can see its prominence by consulting the Wikipedia page for Nash equilibrium.

Do I agree this is horrible notation? Meh. I don't know. But it's not a convention we pioneered in this work.

Ok, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally.

Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.