# All of TurnTrout's Comments + Replies

Environmental Structure Can Cause Instrumental Convergence

Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:

if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

TurnTrout's shortform feed

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even... (read more)

Seeking Power is Often Convergently Instrumental in MDPs

I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.

For the record, the post used to contain the following section:

## A note on terminology

[AN #156]: The scaling hypothesis: a plan for building AGI

Sure.

Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

2Rohin Shah10dYou're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).
[AN #156]: The scaling hypothesis: a plan for building AGI

Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.

Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.

And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does ... (read more)

6Rohin Shah11dYeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that. I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention [https://www.lesswrong.com/posts/X2AD2LgtKgkRNPj2a/privileging-the-hypothesis] (and thus Claim 4 as well).
The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.

What do you mean?

If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.

I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.

A world in which the alignment problem seems lower-stakes

Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Well, maybe here's a better way of communicating what I'm after:

A world in which the alignment problem seems lower-stakes

I'm not sure if you're arguing that this is a good world in which to think about alignment.

I am not arguing this. Quoting my reply to ofer:

I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.

(Edited post to clarify)

A world in which the alignment problem seems lower-stakes

Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values

Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)

(Edited post to clarify)

Environmental Structure Can Cause Instrumental Convergence

My take on it has been, the theorem's bottleneck assumption implies that you can't reach S again after taking action a1 or a2, which rules out cycles.

2Rohin Shah1moYeah actually that works too
Environmental Structure Can Cause Instrumental Convergence

If the agent is sufficiently farsighted (i.e. the discount is near 1)

I'd change this to "optimizes average reward (i.e. the discount equals 1)". Otherwise looks good!

2Rohin Shah1moDone :)
Environmental Structure Can Cause Instrumental Convergence

I don't understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin's claim in the summary?)

I'd say that we can just remove the quoted portion and just explain "a1 and a2 lead to disjoint sets of future options", which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)

1Ofer Givoli1moI was referring to the claim being made in Rohin's summary. (I no longer see counter examples after adding the assumption that "a1 and a2 lead to disjoint sets of future options".)
Environmental Structure Can Cause Instrumental Convergence

Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?

Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.

What criterion does that environment violate?

Right, good question. I'll explain the general principle (not stated in the paper - yes, I agree this needs to be fixed!), and th... (read more)

1Ofer Givoli1moMost of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase. Here "{cycles reachable after taking a1 at s}" actually refers an RSD, right? So we're not just talking about a set of states, we're talking about a set of vectors that each corresponds to a "state visitation distribution" of a different policy. In order for the "similar to" (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments. In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one "ghost" is moving randomly at any given time; I'll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in "Frightened" mode, i.e. when they are blue and can't kill Pac-Man). Let's focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state s′, RSDnd(s′) is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state s that can result from action a2, we get that RSD(s) is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability
Environmental Structure Can Cause Instrumental Convergence

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don't.

Consider the following counter example (in which the last state is equivalent to the agent being shut down):

This is just making my point - average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progres... (read more)

1Ofer Givoli1moAre you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why? My question is just about the main claim in the abstract of the paper ("We prove that for most prior beliefs one might have about the agent's reward function [...], one should expect optimal policies to seek power in these environments."). The main claim does not apply to the simple environment in my example (i.e. we should not expect optimal policies to seek POWER in that environment). I'm completely fine with that being the case, I just want to understand why. What criterion does that environment violate? I counted ~19 non-trivial definitions in the paper. Also, the theorems that the main claim directly relies on (which I guess is some subset of {Proposition 6.9, Proposition 6.12, Theorem 6.13}?) seem complicated. So I think the paper should definitely provide a reasonably simple description of the set of MDPs that the main claim applies to, and explain why proving things on that set is useful. Do you mean that the main claim of the paper actually applies to those environments (i.e. that they are in the formal set of MDPs that the relevant theorems apply to) or do you just mean that optimal policies in those environments tend to be POWER-seeking? (The main claim only deals with sufficient conditions.)
Environmental Structure Can Cause Instrumental Convergence

I haven't seen the paper support that claim.

The paper supports the claim with:

• Embodied environment in a vase-containing room (section 6.3)
• Pac-Man (figure 8)
• And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
• Average-optimal robots not idling in a particular spot (beginning of section 7)

This post supports the claim with:

• Tic-Tac-Toe
• Vase gridworld
• SafeLife

So yes, this is sufficient support for speculation that most relevant environments have these symmetries.

Maybe I just missed it, but I

1Ofer Givoli1moI think this refers to the following passage from the paper: This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase. Regarding your next bullet point: I don't know what you mean here by "generally holds". When does an environment—in which the agent can be shut down—"have the right symmetries" for the purpose of the main claim? Consider the following counter example (in which the last state is equivalent to the agent being shut down): In most states (the first 3 states) the optimal policies of most reward functions transition to the next state, while the POWER-seeking behavior is to stay in the same state (when the discount rate is sufficiently close to 1). If we want to tell a story about this environment, we can say that it's about a car in a one-way street. To be clear, the issue I'm raising here about the paper is NOT that the main claim does not apply to all MDPs. The issue is the lack of (1) a reasonably simple description of the set of MDPs that the main claim applies to; and (2) an explanation for why it is useful to prove things about that set. The limitations mentioned there are mainly: "Most real-world tasks are partially observable" and "our results only apply to optimal policies in finite MDPs". I think that another limitation that belongs there is that the main claim only applies to a particular set of MDPs.
Environmental Structure Can Cause Instrumental Convergence

My apologies - I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I "undid" it (for both mine and yours). I hadn't realized it was already on AF.

3Ofer Givoli1moNo worries, thanks for the clarification. [EDIT: the confusion may have resulted from me mentioning the LW username "adamShimi", which I'll now change to the display name on the AF ("Adam Shimi").]
Environmental Structure Can Cause Instrumental Convergence

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper?).

I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.

2Ofer Givoli1moI did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim. Maybe I just missed it, but I didn't find a "limitations section" or similar in the paper. I did find the following in the Conclusion section: Though the title of the paper can still give the impression that it proves a core argument for AI x-risk. Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here [https://forum.effectivealtruism.org/posts/7gxtXrMeqw78ZZeY9/ama-or-discuss-my-80k-podcast-episode-ben-garfinkel-fhi?commentId=BPCQqe5KTLBM24KDy] ). So I think my advice that "it should not be cited as a paper that formally proves a core AI alignment argument" is beneficial. For reference (in case anyone is interested in that discussion): I think it's the thread that starts here [https://www.lesswrong.com/posts/XkXL96H6GknCbT5QH/mdp-models-are-determined-by-the-agent-architecture-and-the?commentId=h4hFzkGntHsJgMLmC] (just the part after "2.").
Environmental Structure Can Cause Instrumental Convergence

I like the thought. I don't know if this sketch works out, partly because I don't fully understand it. your conclusion seems plausible but I want to develop the arguments further.

As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you'll get another function for which power-seeking and not-power-seeking actions are both optimal.

2Daniel Kokotajlo1moOh interesting... so then what I need for my argument is not the simplest function period, but the simplest function that doesn't make both power-seeking and not-power-seeking both optimal? (isn't that probably just going to be the simplest function that doesn't make everything optimal?) I admit I am probably conceptually confused in a bunch of ways, I haven't read your post closely yet.
Alex Turner's Research, Comprehensive Information Gathering

This in turns leads to one of the strongest result of Alex's paper: for any "well-behaved" distribution on reward functions, if the environment has the sort of symmetry I mentioned, then for at least half of the permutations of this distribution, at least half of the probability mass will be on reward functions for which the optimal policy is power-seeking.

Clarification:

• The instrumental convergence (formally, optimality probability) results apply to all distributions over reward functions. So, the "important" part of my results apply to permutations of arb
Open problem: how can we quantify player alignment in 2x2 normal-form games?

I think I've been unclear in my own terminology, in part because I'm uncertain about what other people have meant by 'utility' (what you'd recover from perfect IRL / Savage's theorem, or cardinal representation of preferences over outcomes?) My stance is that they're utilities but that I'm not assuming the players are playing best responses in order to maximize expected utility.

How can they be preferences if agents can "choose" not to follow them?

Am I allowed to have preferences without knowing how to maximize those preferences, or while being irrational a... (read more)

3Rohin Shah1moThere's a difference between "the agent sometimes makes mistakes in getting what it wants" and "the agent does the literal opposite of what it wants"; in the latter case you have to wonder what the word "wants" even means any more. My understanding is that you want to include cases like "it's a fixed-sum game, but agent B decides to be maximally aligned / cooperative and do whatever maximizes A's utility", and in that case I start to question what exactly B's utility function meant in the first place. I'm told that Minimal Rationality [https://mitpress.mit.edu/books/minimal-rationality] addresses this sort of position, where you allow the agent to make mistakes, but don't allow it to be e.g. literally pessimal since at that point you have lost the meaning of the word "preference". (I kind of also want to take the more radical position where when talking about abstract agents the only meaning of preferences is "revealed preferences", and then in the special case of humans we also see this totally different thing of "stated preferences" that operates at some totally different layer of abstraction and where talking about "making mistakes in achieving your preferences" makes sense in a way that it does not for revealed preferences. But I don't think you need to take this position to object to the way it sounds like you're using the term here.)
Open problem: how can we quantify player alignment in 2x2 normal-form games?

Right, thanks!

1. I think I agree that payout represents player utility.
2. The agent's decision can be made in any way. Best response, worst response, random response, etc.

I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors, I'm not assuming that P2 would respond to this by playing paper.

And when I talk about 'responses', I do mean 'response' in the 'best r... (read more)

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.

There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).

Open problem: how can we quantify player alignment in 2x2 normal-form games?

You're right. Per Jonah Moss's comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.

3Vanessa Kosoy1moI don't think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It is true that if, out of some set of objects you consider the subset of those that have a B/A=1, then it's natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it's also natural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.
Open problem: how can we quantify player alignment in 2x2 normal-form games?

This also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.

Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?

2Dagon1moSorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf [http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf] is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player). So the question is "what DO these payout numbers represent"? and "what other factors go into an agent's decision of which row/column to choose"?
Open problem: how can we quantify player alignment in 2x2 normal-form games?

In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend on " is a feature, not a bug.

EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?

1Vanessa Kosoy1moIn common-payoff games the denominator is not zero, in general. For example, suppose that SA=SB={a,b}, uA(a,a)=uA(b,b)=1, uA(a,b)=uA(b,a)=0, uB≡eA, α=β=δa. Then aB/A(α,β)=1, as expected: current payoff is 1, if B played b it would be 0.
Open problem: how can we quantify player alignment in 2x2 normal-form games?

I'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"?

Let  be player 's response function to strategy profile . Given some strategy profile (like stag/stag), player i selects a response. I mean "response" in terms of "best response" - I don't necessarily mean that there's an iterated game. This captures all the relevant "outside details" for how decisions are made.

If your opponent is not in any sense a utility-maximizer then I don

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I like this answer, and I'm going to take more time to chew on it.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

The definition of utility is "the thing people maximize."

Only applicable if you're assuming the players are VNM-rational over outcome lotteries, which I'm not. Forget expected utility maximization.

It seems to me that people are making the question more complicated than it has to be, by projecting their assumptions about what a "game" is. We have payoff numbers describing how "good" each outcome is to each player. We have the strategy spaces, and the possible outcomes of the game. And here's one approach: fix two response functions in this game, which are f... (read more)

2Rohin Shah1moThen what's the definition / interpretation of "payoff", i.e. the numbers you put in the matrix? If they're not utilities, are they preferences? How can they be preferences if agents can "choose" not to follow them? Where do the numbers come from? Note that Vanessa's answer doesn't need to depend onuB, which I think is its main strength and the reason it makes intuitive sense. (And I like the answer much less whenuBis used to impose constraints.)
Open problem: how can we quantify player alignment in 2x2 normal-form games?

In static games of complete, perfect information, a normal-form representation of a game is a specification of players' strategy spaces and payoff functions.

You are playing prisoner's dilemma when certain payoff inequalities are satisfied in the normal-form representation. That's it. There is no canonical assumption that players are expected utility maximizers, or expected payoff maximizers.

because the utilities to player B are not dependent on what you do.

Noting that I don't follow what you mean by this: ... (read more)

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Payout correlation IS the metric of alignment.

Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don't think I've ever encountered that.

Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However,

• it's symmetric across players,
• it's invariant to player rationality
• which matters, since alignment seems to not just be a function of incentives, but of what-actually-happens and how that affects different players
• it equally weights each outcome in th
1Ericf2moIt's a definitional thing. The definition of utility is "the thing people maximize." If you set up your 2x2 game to have utilities in the payout matrix, then by definition both actors will attempt to pick the box with the biggest number. If you set up your 2x2 game with direct payouts from the game that don't include phychic (eg "I just like picking the first option given") or reputational effects, then any concept of alignment is one of: 1. assume the players are trying for the biggest number, how much will they be attempting to land on the same box? 2. alignment is completely outside of the game, and is one of the features of function that converts game payouts to global utility You seem to be muddling those two, and wondering "how much will people attempt to land on the same box, taking into account all factors, but only defining the boxes in terms of game payouts." The answer there is "you can't." Because people (and computer programs) have wonky screwed up utility functions (eg (spoiler alert) https://en.wikipedia.org/wiki/Man_of_the_Year_(2006_film) [https://en.wikipedia.org/wiki/Man_of_the_Year_(2006_film)])
Open problem: how can we quantify player alignment in 2x2 normal-form games?

Thanks for the thoughtful response.

In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)

It seems to me like you're assuming that players must respond rationally, or else they're playing a different game, in some sense. But why? The stag hunt game is defined by a certain set of payoff inequal... (read more)

1gjm2moI'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"? If your opponent is not in any sense a utility-maximizer then I don't think it makes sense to talk about your opponent's utilities, which means that it doesn't make sense to have a payout matrix denominated in utility, which means that we are not in the situation of my second paragraph above ("The meaning generally assumed in game theory..."). We might be in the situation of my last-but-two paragraph ("Or maybe we're playing a game in which..."): the payouts might be something other than utilities. Dollars, perhaps, or just numbers written on a piece of paper. In that case, all the things I said about that situation apply here. In particular, I agree that it's then reasonable to ask "how aligned is B with A's interests?", but I think this question is largely decoupled from the specific game and is more about the mapping from (A's payout, B's payout) to (A's utility, B's utility). I guess there are cases where that isn't enough, where A's and/or B's utility is not a function of the payouts alone. Maybe A just likes saying the word "defect". Maybe B likes to be seen as the sort of person who cooperates. Etc. But at this point it feels to me as if we've left behind most of the simplicity and elegance that we might have hoped to bring by adopting the "two-player game in normal form" formalism in the first place, and if you're prepared to consider scenarios where A just likes choosing the top-left cell in a 2x2 array then you also need to consider ones like the ones I described earlier in this paragraph -- where in fact it's not just the 2x2 payout matrix that matters but potentially any arbitrary details about what words are used when playing the game, or who is watching, or anything else. So if you're trying to get to the essence of alignment by considering simple 2x2 games, I think it would be best to leave that sor
0Ericf2moQuote: Or maybe we're playing a game in which the stag hunt matrix describes some sort of payouts that are not exactly utilities. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many dollars we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or$10 and figure I might as well try to maximize your payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I'm actually evil and want you to do as badly as possible. So, if the other player is "always cooperate" or "always defect" or any other method of determining results that doesn't correspond to the payouts in the matrix shown to you, then you aren't playing "prisoner's dillema" because the utilities to player B are not dependent on what you do. In all these games, you should pick your strategy based on how you expect your counterparty to act, which might or might not include the "in game" incentives as influencers of their behavior.
Open problem: how can we quantify player alignment in 2x2 normal-form games?

I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?

Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.

2Dagon2moPayout correlation IS the metric of alignment. A player who isn't trying to maximize their (utility) payout is actually not playing the game you've defined. You're simply incorrect (or describing a different payout matrix than you state) that a player doesn't "have to select a best response".
MDP models are determined by the agent architecture and the environmental dynamics

Not from the paper. I just wrote it.

I don't think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.

It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.

Sure, but I still don't understand the argument here. It's trivial to write a reward function that doesn't yield instrumental convergen

3Ofer Givoli2moConsider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: "We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments." (If it's the set of environments in which "the “vast majority” of RSDs are only reachable by following a subset of policies" consider clarifying that in the paper). It's hard (at least for me) to infer that from the formal theorems/definitions. My "unrolling trick" argument doesn't require an easy way to factor states into [action history] and [the rest of the state from which the action history can't be inferred]. A sufficient condition for my argument is that the complete action history could be inferred from every reachable state. When this condition fulfills, the environment implicitly contains an action log (for the purpose of my argument), and thus the POWER (IID) of all the states is equal. And as I've argued before, this condition seems plausible for sufficiently complex real-world-like environments. BTW, any deterministic time-reversible [https://en.wikipedia.org/wiki/Time_reversibility] environment fulfills this condition, except for cases where multiple actions can yield the same state transition (in which case we may not be able to infer which of those actions were chosen at the relevant time step). It's easier to find reward functions that incentivize a given action sequence if the complete action history can be inferred from every reachable state (and the easiness depends on how easy it is to compute the action history from the state). I don't see how this fact relates to instrumental convergence supposedly disappearing for "most objectives" [EDIT: when using a simplicity prior over objectives; otherwise, instrumental convergence may not apply regardless]. Generally, if an action log constitutes a tiny fraction of the environment, its existence
MDP models are determined by the agent architecture and the environmental dynamics

I was looking for some high-level/simplified description

Ah, I see. In addition to the cited explanation, see also: "optimal policies tend to take actions which strictly preserve optionality*", where the optionality preservation is rather strict (requiring a graphical similarity, and not just "there are more options this way than that"; ironically, this situation is considerably simpler in arbitrary deterministic computable environments, but that will be the topic of a future post).

Isn't the thing we condition on here similar (roughly speaking) to you

1Ofer Givoli2moDoes this quote refer to a passage from the paper? (I didn't find it.) There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that rely on action-history (you need at least 2n bits to specify a reward function that considers n actions, when using a uniform prior). Also, I don't think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment. If we assume that the action logger can always "detect" the action that the agent chooses, this issue doesn't apply. (Instead of the agent being "dead" we can simply imagine the robot/actuators are in a box and can't influencing anything outside the box; which is functionally equivalent to being "dead" if the box is a sufficiently small fraction of the environment.) Sure, but I still don't understand the argument here. It's trivial to write a reward function that doesn't yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.
MDP models are determined by the agent architecture and the environmental dynamics

(I continued this discussion with Adam in private - here are some thoughts for the public record)

• There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
• Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the "5-googleplex" one).

I think I'm claiming first bullet. I a... (read more)

MDP models are determined by the agent architecture and the environmental dynamics

Thanks for taking the time to write this out.

Regarding the theorems (in the POWER paper; I've now spent some time on the current version): The abstract of the paper says: "With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment." I didn't find a description of those sufficient conditions (maybe I just missed it?).

I'm sorry - although I think I mentioned it in passing, I did not draw sufficient attention to the fact that I've been talking ... (read more)

3Ofer Givoli2moIt seems to me that the (implicit) description in the paper of the set of environments over which "one should expect optimal policies to seek power" ("for most prior beliefs one might have about the agent’s reward function") involves a lot of formalism/math. I was looking for some high-level/simplified description (in English), and found the following (perhaps there are other passages that I missed): Isn't the thing we condition on here similar (roughly speaking) to your interpretation of instrumental convergence? (Is the condition for when "[…] one should expect optimal policies to seek power" made weaker by another theorem?) I think that using a simplicity prior over reward functions has a similar effect to "restricting to certain kinds of reward functions". I didn't understand the point you were making with your explanation that involved a max-ent distribution. Why is the action logger treated in your explanation as some privileged object? What's special about it relative to all the other stuff that's going on in our arbitrarily complex environment? If you imagine an MDP environment where the agent controls a robot in a room that has a security camera in it, and the recorded video is part of the state, then the recorded video is doing all the work that we need an action logger to do (for the purpose of my argument). In my action-logger example, the action log is just a tiny part of the state representation (just like a certain blog or a video recording are a very tiny part of the state of our universe). The reward function is a function over states (or state-action pairs) as usual, not state-action histories. My "unrolling trick" doesn't involve utility functions that are defined over state(-action) histories.
MDP models are determined by the agent architecture and the environmental dynamics

I don't understand your point in this exchange. I was being specific about my usage of model; I meant what I said in the original post, although I noted room for potential confusion in my comment above. However, I don't know how you're using the word.

I don’t use the term model in my previous reply anyway.

You used the word 'model' in both of your prior comments, and so the search-replace yields "state-abstraction-irrelevant abstractions." Presumably not what you meant?

I already pointed out a concrete difference: I claim it’s reasonable to say there ar

MDP models are determined by the agent architecture and the environmental dynamics

I read your formalism, but I didn't understand what prompted you to write it. I don't yet see the connection to my claims.

If so, I might try to formalize it.

Yeah, I don't want you to spend too much time on a bulletproof grounding of your argument, because I'm not yet convinced we're talking about the same thing.

In particular, if the argument's like, "we usually express reward functions in some featurized or abstracted way, and it's not clear how the abstraction will interact with your theorems" / "we often use different abstractions to express differ... (read more)

-3Repetitive Experimenter2moI don’t think it’s a good use of time to get into this if you weren’t being specific about your usage of ‘model’ or the claim you made previously because I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives. (If it helps you, you can search-replace model-irrelevant to state-abstraction because I don’t use the term model in my previous reply anyway.)
MDP models are determined by the agent architecture and the environmental dynamics

say we agree that our state abstraction needs to be model-irrelevant

Why would we need that, and what is the motivation for "models"? The moment we give the agent sensors and actions, we're done specifying the rewardless MDP (and its model).

ETA: potential confusion - in some MDP theory, the “model” is a model of the environment dynamics. Eg in deterministic environments, the model is shown with a directed graph. i don’t use “model” to refer to an agent’s world model over which it may have an objective function. I should have chosen a better word, or clarifi... (read more)

-1Repetitive Experimenter2moThis was why gave a precise definition of model-irrelevance. I'll step through your points using the definition, 1. Consider the underlying environment (assumed Markovian) 2. Consider different state/action encodings (model-irrelevant abstractions) we might supply the agent. 3. For each, fix a reward function distribution 4. See what the theory predict The problem I'm trying to highlight lies in point three. Each task is a reward function you could have the agent attempt to optimize. Every abstraction/encoding fixes a set of rewards under which the abstraction is model-irrelevant. This means the agent can successfully optimize these rewards. My claim is that there is a third alternative: you may claim that the reward function given to the agent does not satisfy model-irrelevance. This can be the case even if the underlying dynamics are markovian and the abstraction of the transitions satisfies model-irrelevance. That may take a while. The argument above is a reasonable candidate for a lemma. A useful example would show that the third alternative exists. Do you agree this is the crux of your disagreement with my objection? If so, I might try to formalize it.
MDP models are determined by the agent architecture and the environmental dynamics

I'm not trying to define here the set of reward functions over which instrumental convergence argument apply (they obviously don't apply to all reward functions, as for every possible policy you can design a reward function for which that policy is optimal).

ETA: I agree with this point in the main - they don't apply to all reward functions. But, we should be able to ground the instrumental convergence arguments via reward functions in some way. Edited out because I read through that part of your comment a little too fast, and replied to something you didn'... (read more)

1Ofer Givoli2moI'll address everything in your comment, but first I want to zoom out and say/ask: 1. In environments that have a state graph that is a tree-with-constant-branching-factor, the POWER—defined over IID-over-states reward distribution—is equal in all states. I argue that environments with very complex physical dynamics are often like that, but not if at some time step the agent can't influence the environment. (I think we agree so far?) I further argue that we can take any MDP environment and "unroll" its state graph into a tree-with-constant-branching-factor (e.g. by adding an "action log" to the state representation) such that we get a "functionally equivalent" MDP in which the POWER (IID) of all the states are equal. My best guess is that you don't agree with this point, or think that the instrumental convergence thesis doesn't apply in a meaningful sense to such MDPs (but I don't yet understand why). 2. Regarding the theorems (in the POWERpaper [https://arxiv.org/pdf/1912.01683.pdf]; I've now spent some time on the current version): The abstract of the paper says: "With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment." I didn't find a description of those sufficient conditions (maybe I just missed it?). AFAICT, MDPs that contain "reversible actions" (other than self-loops in terminal states) are generally problematic for POWER (IID). (I'm calling actionafrom states"reversible" if it allows the agent to return tosat some point). POWER-seeking (in the limit asγapproaches 1) will always imply choosing a reversible action over a non-reversible action, and if the only reversible action is a self-loop, POWER-seeking means staying in the same state forever. Note that if there are sufficiently many terminal states (or loops more generally) that require a certain non
MDP models are determined by the agent architecture and the environmental dynamics

Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…

To clarify: when I say that taking over the world is "instrumentally convergent", I mean that most objectives incentivize it. If you mean something else, please tell me. (I'm starting to think there must be a serious miscommunication somewhere if we're still disagreeing about this?)

So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward fun... (read more)

1Ofer Givoli2moI was talking about a particular example, with a particular reward function that I had in mind. We seemed to disagree about whether instrumental convergence arguments apply there, and my purpose in that comment was to argue that they do. I'm not trying to define here the set of reward functions over which instrumental convergence argument apply (they obviously don't apply to all reward functions, as for every possible policy you can design a reward function for which that policy is optimal). E.g. humans noticing that something weird is going on and trying to shut down the process. (Shutting down the process doesn't mean that new strings won't appear in the environment and cause the state graph to become a tree-with-constant-branching-factor due to complex physical dynamics.) Not in the example I have in mind. Again, let's say the state representation determines the location of every atom in that earth-like environment. (I think that's the key miscommunication here; the MDP I'm thinking about is NOT a "sequential string output MDP", if I understand your use of that phrase correctly. [EDIT: my understanding is that you use that phrase to describe an MDP in which a state is just the sequence of strings in the exchange so far.] [EDIT 2: I think this miscommunication is my fault, due to me writing in my first comment: "the state representation may be uniquely determined by all the text that was written so far by both the customer and the chatbot", sorry for that.]) I agree the statement would be true with any possible string; this doesn't change the point I'm making with it. (Consider this to be an application of the more general statement with a particular string.) For every subset of branches in the tree you can design a reward function for which every optimal policy tries to go down those branches; I'm not saying anything about "most rewards functions". I would focus on statements that apply to "most reward functions" if we dealt with an AI that had a reward funct
MDP models are determined by the agent architecture and the environmental dynamics

Yeah, i claim that this intuition is actually wrong and there's no instrumental convergence in this environment. Complicated & contains actors doesn't mean you can automatically conclude instrumental convergence. The structure of the environment is what matters for "arbitrarily capable agents"/optimal policies (learned policies are probably more dependent on representation and training process).

So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another? Because, again, this en... (read more)

1Ofer Givoli2mo(Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…) Consider a string, written by the chatbot, that "hacks" the customer and cause them to invoke a process that quickly takes control over most of the computers on earth that are connected to the internet, then "hacks" most humans on earth by showing them certain content, and so on (to prevent interferences and to seize control ASAP); for the purpose of maximizing whatever counts as the total discounted payments by the customer (which can look like, say, setting particular memory locations in a particular computer to a particular configuration); and minimizing low probability risks (from the perspective of the agent). If such a string (one that causes the above scenario) exists, then any optimal policy will either involve such a string or different strings that allow at least as much expected return.
MDP models are determined by the agent architecture and the environmental dynamics

For that particular reward function, yes, the optimal policies may be very complicated. But why are there instrumentally convergent goals in that environment? Why should I expect capable agents in that environment to tend to output certain kinds of string sequences, over other kinds of string sequences?

(Also, is the amount of money paid by the client part of the state? Or is the agent just getting rewarded for the total number of purchase-assents in the conversation over time?)

1Ofer Givoli2moYes; let's say the state representation determines the location of every atom in that earth-like environment. The idea is that the environment is very complicated (and contains many actors) and thus the usual arguments for instrumental convergence apply. (If this fails to address any of the above issues let me know.)
MDP models are determined by the agent architecture and the environmental dynamics

Though it involves the unresolved (for me) embedded agency issues.

Right, that does complicate things. I'd like to get a better picture of the considerations here, but given how POWER behaves on environment structures so far, I'm pretty confident it'll adapt to appropriate ways of modelling the situation.

Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere in the environment. The question is now whether it's possib

1Ofer Givoli2moI think we're still not thinking about the same thing; in the example I'm thinking about the agent is supposed to fill the role of a human salesperson, and the reward function is (say) the amount of money that the client paid (possibly over a long time period). So an optimal policy may be very complicated and involve instrumentally convergent goals.
MDP models are determined by the agent architecture and the environmental dynamics

To further clarify:

• pure-text-interaction-MDP: generated by the mentioned state and action representation, with environment dynamics allowing the agent to talk to a customer.
• Since you said that the induced model is regular, this implies that the agent won't get shut down for saying bad/weird things. If it could, then the graph is no longer regular under the previous state and action representations.
• The agent also isn't concerned with real-world resources, because it isn't modelling them. They aren't observable and they don't affect transition probabilities.
1Ofer Givoli2moI was imagining a formal (super-complex) MDP that looks like our world. The customer in my example is meant to be equivalent to a human on earth. But I haven't taken into account that this runs into embedded agency issues. (E.g. how does the state transition function look like when the computer that "runs the agent" is down?) Because states from which the agent can (say) prevent its computer from being turned off have larger POWER? That's an interesting point that didn't occur to me while writing that comment. Though it involves the unresolved (for me) embedded agency issues. Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere in the environment. The question is now whether it's possible to get to the same state with two different sequences of strings. This depends on the state representation & state transition function; it can be the case that the state is uniquely determined by the agent's sequence of past strings so far, which will mean POWER being equal in all states.
MDP models are determined by the agent architecture and the environmental dynamics

The instrumental convergence thesis is not a fact about every situation involving "capable AI", but a thesis pointing out a reliable-seeming pattern across environments and goals. It can't be used as a black-box reason on its own - you have to argue why the reasoning applies in the environment. In particular, we assumed that the agent is interacting with the text MDP, where

the state representation [is] uniquely determined by all the text that was written so far by both the customer and the chatbot [, and the chat doesn't end when the customer leaves / stop

2Alex Turner2moTo further clarify: * pure-text-interaction-MDP: generated by the mentioned state and action representation, with environment dynamics allowing the agent to talk to a customer. * Since you said that the induced model is regular [https://en.wikipedia.org/wiki/Regular_graph], this implies that the agent won't get shut down for saying bad/weird things. If it could, then the graph is no longer regular under the previous state and action representations. * The agent also isn't concerned with real-world resources, because it isn't modelling them. They aren't observable and they don't affect transition probabilities.
MDP models are determined by the agent architecture and the environmental dynamics

the choice of the state representation and action space may determine whether a problem is like that.

I agree. Also: the state and action representations determine which reward functions we can express, and I claim that it makes sense for the theory to reflect that fact.

If so POWER—when defined over an IID-over-states reward distribution—is constant.

Agreed. I also don't currently see a problem here. There aren't any robustly instrumental goals in this setting, as best I can tell.

1Ofer Givoli2moIf we consider a sufficiently high level of capability, the instrumental convergence thesis applies. (E.g. the agent might manipulate/hack the customer and then gain control over resources, and stop anyone from interfering with its plan.)
MDP models are determined by the agent architecture and the environmental dynamics

I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?

But that just means the subjectivity comes from the choice of the interface!

There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.

Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can