# 15

In my experience, constant-sum games are considered to provide "maximally unaligned" incentives, and common-payoff games are considered to provide "maximally aligned" incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary  payoff table representing a two-player normal-form game (like Prisoner's Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?

If this question is ill-posed, why is it ill-posed? And if it's not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity's future. (I started considering this question with Jacob Stavrianos over the last few months, while supervising his SERI project.)

Thoughts:

• Assume the alignment function has range  or .
• Constant-sum games should have minimal alignment value, and common-payoff games should have maximal alignment value.
• The function probably has to consider a strategy profile (since different parts of a normal-form game can have different incentives; see e.g. equilibrium selection).
• The function should probably be a function of player A's alignment with player B; for example, in a prisoner's dilemma, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is aligned with B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A's payoff).
• So the function need not be symmetric over players.
• The function should be invariant to applying a separate positive affine transformation to each player's payoffs; it shouldn't matter whether you add 3 to player 1's payoffs, or multiply the payoffs by a half.
• The function may or may not rely only on the players' orderings over outcome lotteries, ignoring the cardinal payoff values. I haven't thought much about this point, but it seems important. EDIT: I no longer think this point is important, but rather confused.

• Do some thought experiments to pin down the intuitive concept. Consider simple games where my "alignment" concept returns a clear verdict, and use these to derive functional constraints (like symmetry in players, or the range of the function, or the extreme cases).
• See if I can get enough functional constraints to pin down a reasonable family of candidate solutions, or at least pin down the type signature.

New Comment

# 5 Answers sorted by top scoring

Vanessa Kosoy

### Jun 18, 2021

14

Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let be the set of pure strategies of player and the set of pure strategies of player . Let be the utility function of player . Let be a particular (mixed) outcome. Then the alignment of player with player in this outcome is defined to be:

Ofc so far it doesn't depend on at all. However, we can make it depend on if we use to impose assumptions on , such as:

• is a -best response to or
• is a Nash equilibrium (or other solution concept)

Caveat: If we go with the Nash equilibrium option, can become "systematically" ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where chooses their strategy after seeing 's strategy.

In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend on " is a feature, not a bug.

EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?

1Vanessa Kosoy2y
In common-payoff games the denominator is not zero, in general. For example, suppose that SA=SB={a,b}, uA(a,a)=uA(b,b)=1, uA(a,b)=uA(b,a)=0, uB≡eA, α=β=δa. Then aB/A(α,β)=1, as expected: current payoff is 1, if B played b it would be 0.
2Alex Turner2y
You're right. Per Jonah Moss's comment [https://www.lesswrong.com/posts/ghyw76DfRyiiMxo3t/open-problem-how-can-we-quantify-player-alignment-in-2x2?commentId=G8A3KtPw8vHwFnDj9], I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.
3Vanessa Kosoy2y
I don't think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It is true that if, out of some set of objects you consider the subset of those that have aB/A=1, then it's natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it's also natural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.

This also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.

Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.

There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).

Mykhailo Odintsov

### Jun 16, 2021

0

So, something like "fraction of preferred states shared" ? Describe preferred states for P1 as cells in the payoff matrix that are best for P1 for each P2 action (and preferred stated for P2 in a similar manner) Fraction of P1 preferred states that are also preferred for P2 is measurement of alignment P1 to P2. Fraction of shared states between players to total number of preferred states is measure of total alignment of the game.

For 2x2 game each player will have 2 preferred states (corresponding to the 2 possible action of the opponent). If 1 of them will be the same cell that will mean that each player is 50% aligned to other (1 of 2 shared) and the game in total is 33% aligned (1 of 3), This also generalize easily to NxN case and for >2 players.

And if there are K multiple cells with the same payoff to choose from for some opponent action we can give 1/K to them instead of 1.

(it would be much easier to explain with a picture and/or table, but I'm pretty new here and wasn't able to find how to do them here yet)

I like this answer, and I'm going to take more time to chew on it.

Dagon

### Jun 16, 2021

2

I think this is backward.  The game's payout matrix determines the alignment.  Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff games ARE the definition of alignment.

When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then "alignment" starts to make sense as a distinct measurement - it's how much the players' utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.

I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?

Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.

1Dagon2y
Payout correlation IS the metric of alignment.  A player who isn't trying to maximize their (utility) payout is actually not playing the game you've defined.    You're simply incorrect (or describing a different payout matrix than you state) that a player doesn't "have to select a best response".
1gjm2y
I think "X and Y are playing a game of stag hunt" has multiple meanings. The meaning generally assumed in game theory when considering just a single game is that the outcomes in the game matrix are utilities. In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.) But there are some other situations that might be described by saying that X and Y are playing stag hunt. Maybe we are playing an iterated stag hunt. Then (by definition) what I care about is still some sort of aggregation of per-round outcomes, and (by definition) each round's outcome still has (stag,stag) best for me, etc. -- but now I need to strategize over the whole course of the game, and e.g. maybe I think that on a particular occasion choosing "hare" when you chose "stag" will make you understand that you're being punished for a previous choice of "hare" and make you more likely to choose "stag" in future. Or maybe we're playing an iterated iterated stag hunt. Now maybe I choose "hare" when you chose "stag", knowing that it will make things worse for me over subsequent rounds, but hoping that other people looking at our interactions will learn the rule Don't Fuck With Gareth and never, ever choose anything other than "stag" when playing with me. Or maybe we're playing a game in which the stag hunt matrix describes some sort of payouts that are not exactly utilities. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many dollars we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or$10 and figure I might as well try to maximize your payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial s
2Alex Turner2y
Thanks for the thoughtful response. It seems to me like you're assuming that players must respond rationally, or else they're playing a different game, in some sense. But why? The stag hunt game is defined by a certain set of payoff inequalities holding in the game. Both players can consider (stag,stag) the best outcome, but that doesn't mean they have to play stag against (stag, stag). That requires further rationality assumptions (which I don't think are necessary in this case). If I'm playing against someone who always defects against cooperate/cooperate, versus against someone who always cooperates against cooperate/cooperate, am I "not playing iterated PD" in one of those cases?
1gjm2y
I'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"? If your opponent is not in any sense a utility-maximizer then I don't think it makes sense to talk about your opponent's utilities, which means that it doesn't make sense to have a payout matrix denominated in utility, which means that we are not in the situation of my second paragraph above ("The meaning generally assumed in game theory..."). We might be in the situation of my last-but-two paragraph ("Or maybe we're playing a game in which..."): the payouts might be something other than utilities. Dollars, perhaps, or just numbers written on a piece of paper. In that case, all the things I said about that situation apply here. In particular, I agree that it's then reasonable to ask "how aligned is B with A's interests?", but I think this question is largely decoupled from the specific game and is more about the mapping from (A's payout, B's payout) to (A's utility, B's utility). I guess there are cases where that isn't enough, where A's and/or B's utility is not a function of the payouts alone. Maybe A just likes saying the word "defect". Maybe B likes to be seen as the sort of person who cooperates. Etc. But at this point it feels to me as if we've left behind most of the simplicity and elegance that we might have hoped to bring by adopting the "two-player game in normal form" formalism in the first place, and if you're prepared to consider scenarios where A just likes choosing the top-left cell in a 2x2 array then you also need to consider ones like the ones I described earlier in this paragraph -- where in fact it's not just the 2x2 payout matrix that matters but potentially any arbitrary details about what words are used when playing the game, or who is watching, or anything else. So if you're trying to get to the essence of alignment by considering simple 2x2 games, I think it would be best to leave that sor
2Alex Turner2y
Let πi(σ)=σ′i be player i's response function to strategy profile σ. Given some strategy profile (like stag/stag), player i selects a response. I mean "response" in terms of "best response [https://en.wikipedia.org/wiki/Best_response]" - I don't necessarily mean that there's an iterated game. This captures all the relevant "outside details" for how decisions are made. I don't think I understand where this viewpoint is coming from. I'm not equating payoffs with VNM-utility [https://www.lesswrong.com/posts/ghyw76DfRyiiMxo3t/open-problem-how-can-we-quantify-player-alignment-in-2x2?commentId=pdvar6wKuLwX2fbup], and I don't think game theory usually does either - for example, the maxmin [https://en.wikipedia.org/wiki/Minimax] payoff solution concept does not involve VNM-rational expected utility maximization. I just identify payoffs with "how good is this outcome for the player", without also demanding that πi always select a best response. Maybe it's Boltzmann rational, or maybe it just always selects certain actions (regardless of their expected payouts). There exist two payoff functions. I think I want to know how impact-aligned [https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility#Nomenclature] one player is with another: how do the player's actual actions affect the other player (in terms of their numerical payoff values). I think (c) is closest to what I'm considering, but in terms of response functions - not actual iterated games.  Sorry, I'm guessing this probably still isn't clear, but this is the reply I have time to type right now and I figured I'd send it rather than nothing.
0Ericf2y
Quote: Or maybe we're playing a game in which the stag hunt matrix describes some sort of payouts that are not exactly utilities. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many dollars we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or$10 and figure I might as well try to maximize your payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I'm actually evil and want you to do as badly as possible.   So, if the other player is "always cooperate" or "always defect" or any other method of determining results that doesn't correspond to the payouts in the matrix shown to you, then you aren't playing "prisoner's dillema" because the utilities to player B are not dependent on what you do. In all these games, you should pick your strategy based on how you expect your counterparty to act, which might or might not include the "in game" incentives as influencers of their behavior.
3Alex Turner2y
Here is the definition of a normal-form game [https://en.wikipedia.org/wiki/Normal-form_game]: You are playing prisoner's dilemma when certain payoff inequalities [https://en.wikipedia.org/wiki/Prisoner%27s_dilemma#Generalized_form] are satisfied in the normal-form representation. That's it. There is no canonical assumption that players are expected utility maximizers, or expected payoff maximizers.  Noting that I don't follow what you mean by this: do you mean to say that player B's response cannot be a constant function of strategy profiles (ie the response function cannot be constant everywhere)?
2Alex Turner2y
Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don't think I've ever encountered that.  Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However,  * it's symmetric across players, * it's invariant to player rationality * which matters, since alignment seems to not just be a function of incentives, but of what-actually-happens and how that affects different players * it equally weights each outcome in the normal-form game, ignoring relevant local dynamics. For example, what if part of the game table is zero-sum, and part is common-payoff? Correlation then can be controlled by zero-sum outcomes which are strictly dominated for all players. For example: 1 / 1 || 2 / 2 -.5 / .5 || 1 / 1 and so I don't think it's a slam-dunk solution. At the very least, it would require significant support. Why? I suppose it's common to assume (a kind of local) rationality for each player, but I'm not interested in assuming that here. It may be easier to analyze the best-response case as a first start, though.
1Ericf2y
It's a definitional thing. The definition of utility is "the thing people maximize." If you set up your 2x2 game to have utilities in the payout matrix, then by definition both actors will attempt to pick the box with the biggest number. If you set up your 2x2 game with direct payouts from the game that don't include phychic (eg "I just like picking the first option given") or reputational effects, then any concept of alignment is one of: 1. assume the players are trying for the biggest number, how much will they be attempting to land on the same box? 2. alignment is completely outside of the game, and is one of the features of function that converts game payouts to global utility You seem to be muddling those two, and wondering "how much will people attempt to land on the same box, taking into account all factors, but only defining the boxes in terms of game payouts." The answer there is "you can't." Because people (and computer programs) have wonky screwed up utility functions (eg (spoiler alert) https://en.wikipedia.org/wiki/Man_of_the_Year_(2006_film) [https://en.wikipedia.org/wiki/Man_of_the_Year_(2006_film)])
2Alex Turner2y
Only applicable if [https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem] you're assuming the players are VNM-rational over outcome lotteries, which I'm not. Forget expected utility maximization. It seems to me that people are making the question more complicated than it has to be, by projecting their assumptions about what a "game" is. We have payoff numbers describing how "good" each outcome is to each player. We have the strategy spaces, and the possible outcomes of the game. And here's one approach: fix two response functions in this game, which are functions from strategy profiles to the player's response strategy. With respect to the payoffs, how "aligned" are these response functions with each other? This doesn't make restrictive rationality assumptions. It doesn't require getting into strange utility assumptions. Most importantly, it's a clearly-defined question whose answer is both important and not conceptually obvious to me. (And now that I think of it, I suppose that depending on your response functions, even in zero-sum games, you could have "A aligned with B", or "B aligned with A", but not both.)
2Rohin Shah2y
Then what's the definition / interpretation of "payoff", i.e. the numbers you put in the matrix? If they're not utilities, are they preferences? How can they be preferences if agents can "choose" not to follow them? Where do the numbers come from? Note that Vanessa's answer doesn't need to depend on uB, which I think is its main strength and the reason it makes intuitive sense. (And I like the answer much less when uB is used to impose constraints.)
4Alex Turner2y
I think I've been unclear in my own terminology, in part because I'm uncertain about what other people have meant by 'utility' (what you'd recover from perfect IRL / Savage's theorem, or cardinal representation of preferences over outcomes?) My stance is that they're utilities but that I'm not assuming the players are playing best responses in order to maximize expected utility. Am I allowed to have preferences without knowing how to maximize those preferences, or while being irrational at times? Boltzmann-rational agents have preferences, don't they? These debates have surprised me; I didn't think that others tied together "has preferences" and "acts rationally with respect to those preferences."
3Rohin Shah2y
There's a difference between "the agent sometimes makes mistakes in getting what it wants" and "the agent does the literal opposite of what it wants"; in the latter case you have to wonder what the word "wants" even means any more. My understanding is that you want to include cases like "it's a fixed-sum game, but agent B decides to be maximally aligned / cooperative and do whatever maximizes A's utility", and in that case I start to question what exactly B's utility function meant in the first place. I'm told that Minimal Rationality [https://mitpress.mit.edu/books/minimal-rationality] addresses this sort of position, where you allow the agent to make mistakes, but don't allow it to be e.g. literally pessimal since at that point you have lost the meaning of the word "preference". (I kind of also want to take the more radical position where when talking about abstract agents the only meaning of preferences is "revealed preferences", and then in the special case of humans we also see this totally different thing of "stated preferences" that operates at some totally different layer of abstraction and where talking about "making mistakes in achieving your preferences" makes sense in a way that it does not for revealed preferences. But I don't think you need to take this position to object to the way it sounds like you're using the term here.)

Andrew Hyer

### Jun 16, 2021

0

Correlation between player payouts? In a zero sum game it is -1, when payouts are perfectly aligned it is +1, if payouts are independent it is 0.

I agree that this is a good start, but I find it unsatisfactory.

Rafael Harth

### Jun 18, 2021

1

I'll take a shot at this. Let and be the sets of actions of Alice and Bob. Let (where 'n' means 'nice') be function that orders by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let (where 's' means 'selfish') be the function that orders by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function measuring similarity between two orderings of a finite set (should range over ); the alignment of with is then .

Example: in the prisoner's dilemma, , and orders whereas orders . Hence should be , i.e., Bob is maximally unaligned with Alice. Note that this makes it different from Mykhailo's answer which gives alignment , i.e., medium aligned rather than maximally unaligned.

This seems like an improvement over correlation since it's not symmetrical. In the game where Alice and Bob both get to choose numbers and Alice's utility function outputs whereas Bob's outputs , Bob would be perfectly aligned with Alice (his and both order ) but Alice perfectly unaligned with Bob (her orders but her orders ).

I believe this metric meets criteria 1,3,4 you listed. It could be changed to be sensitive to players' decision theories by changing (for alignment from Bob to Alice) to be the order output by Bob's decision theory, but I think that would be a mistake. Suppose I build an AI that is more powerful than myself, and the game is such that we can both decide to steal some of the other's stuff. If the AI does this, it leads to -10 utils for me and +2 for it (otherwise 0/0); if I do it, it leads to -100 utils for me because the AI kills me in response (otherwise 0/0). This game is trivial: the AI will take my stuff and I'll do nothing. Also, the AI is maximally unaligned with me. Now suppose I become as powerful as the AI and my 'take AI's stuff' becomes -10 for AI, +2 for me. This makes the game a prisoner's dilemma. If we both run UDT or FDT, we would now cooperate. If is the ordering of the AI's decision theory, this would mean the AI is now aligned with me, which is odd since the only thing that changed is me getting more powerful. With the original proposal, the AI is still maximally unaligned with me. More abstractly, game theory assumes your actions have influence on the other player's rewards (else the game is trivial), so if you cooperate for game-theoretical reasons, this doesn't seem to capture what we mean by alignment.

I want to point that this is a great example of a deconfusion open problem. There is a bunch of intuitions, some constraints, and then we want to clarify the confusion underlying it all. Not planning to work on it myself, but it sounds very interesting.

(Only caveat I have with the post itself is that it could be more explicit in the title that it is an open problem).

I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.  You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision.  It's not clear whether they include any non-modeled information (how much they like the other player, whether they think there are future games or reputation effects, etc.) in their decision.

Based on this, I don't think the question is well-formed.  A 2x2 normal-form game is self-contained and one-shot.  There's no alignment to measure or consider - it's just ONE SELECTION, with one of two outcomes based on the other agent's selection.

It would be VERY INTERESTING to define a game nomenclature to specify the universe of considerations that two (or more) agents can have to make a decision, and then to define an "alignment" measure about when a player's utility function prefers similar result-boxes as the others' do.  I'd be curious about even very simple properties, like "is it symmetrical" (I suspect no - A can be more aligned with B than B is with A, even for symmetrical-in-resource-outcome games).

it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.

Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?

Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates.   http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it.  I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player).

So the question is "what DO these payout numbers represent"?  and "what other factors go into an agent's decision of which row/column to choose"?

Right, thanks!

1. I think I agree that payout represents player utility.
2. The agent's decision can be made in any way. Best response, worst response, random response, etc.

I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors, I'm not assuming that P2 would respond to this by playing paper.

And when I talk about 'responses', I do mean 'response' in the 'best response' sense; the same way one can reason about Nash equilibria in non-iterated games, we can imagine asking "how would the player respond to this outcome?".

Another point for triangulating my thoughts here is Vanessa's answer, which I think resolves the open question.