Added to the post:
Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.
Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:
if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.
My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even... (read more)
I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.
For the record, the post used to contain the following section:
The robustness-of-strategy p... (read more)
Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.
Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.
And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does ... (read more)
As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.
What do you mean?
If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.
I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.
Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.
My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.
Well, maybe here's a better way of communicating what I'm after:
Suppose that you have beliefs about t... (read more)
I'm not sure if you're arguing that this is a good world in which to think about alignment.
I am not arguing this. Quoting my reply to ofer:
I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.
(Edited post to clarify)
Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values
Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)
My take on it has been, the theorem's bottleneck assumption implies that you can't reach S again after taking action a1 or a2, which rules out cycles.
If the agent is sufficiently farsighted (i.e. the discount is near 1)
I'd change this to "optimizes average reward (i.e. the discount equals 1)". Otherwise looks good!
I don't understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin's claim in the summary?)
I'd say that we can just remove the quoted portion and just explain "a1 and a2 lead to disjoint sets of future options", which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)
Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?
Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
What criterion does that environment violate?
Right, good question. I'll explain the general principle (not stated in the paper - yes, I agree this needs to be fixed!), and th... (read more)
This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.
There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don't.
Consider the following counter example (in which the last state is equivalent to the agent being shut down):
This is just making my point - average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progres... (read more)
I haven't seen the paper support that claim.
The paper supports the claim with:
This post supports the claim with:
So yes, this is sufficient support for speculation that most relevant environments have these symmetries.
Maybe I just missed it, but I
My apologies - I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I "undid" it (for both mine and yours). I hadn't realized it was already on AF.
For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paperv.7?).
I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.
I like the thought. I don't know if this sketch works out, partly because I don't fully understand it. your conclusion seems plausible but I want to develop the arguments further.
As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you'll get another function for which power-seeking and not-power-seeking actions are both optimal.
This in turns leads to one of the strongest result of Alex's paper: for any "well-behaved" distribution on reward functions, if the environment has the sort of symmetry I mentioned, then for at least half of the permutations of this distribution, at least half of the probability mass will be on reward functions for which the optimal policy is power-seeking.
I think I've been unclear in my own terminology, in part because I'm uncertain about what other people have meant by 'utility' (what you'd recover from perfect IRL / Savage's theorem, or cardinal representation of preferences over outcomes?) My stance is that they're utilities but that I'm not assuming the players are playing best responses in order to maximize expected utility.
How can they be preferences if agents can "choose" not to follow them?
Am I allowed to have preferences without knowing how to maximize those preferences, or while being irrational a... (read more)
I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors, I'm not assuming that P2 would respond to this by playing paper.
P1: rock, P2: scissors
And when I talk about 'responses', I do mean 'response' in the 'best r... (read more)
✅ Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.
There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).
You're right. Per Jonah Moss's comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.
This also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.
it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.
Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?
In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend on uB" is a feature, not a bug.
EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?
I'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"?
Let πi(σ)=σ′i be player i's response function to strategy profile σ. Given some strategy profile (like stag/stag), player i selects a response. I mean "response" in terms of "best response" - I don't necessarily mean that there's an iterated game. This captures all the relevant "outside details" for how decisions are made.
If your opponent is not in any sense a utility-maximizer then I don
I like this answer, and I'm going to take more time to chew on it.
I agree that this is a good start, but I find it unsatisfactory.
The definition of utility is "the thing people maximize."
Only applicable if you're assuming the players are VNM-rational over outcome lotteries, which I'm not. Forget expected utility maximization.
It seems to me that people are making the question more complicated than it has to be, by projecting their assumptions about what a "game" is. We have payoff numbers describing how "good" each outcome is to each player. We have the strategy spaces, and the possible outcomes of the game. And here's one approach: fix two response functions in this game, which are f... (read more)
Here is the definition of a normal-form game:
In static games of complete, perfect information, a normal-form representation of a game is a specification of players' strategy spaces and payoff functions.
You are playing prisoner's dilemma when certain payoff inequalities are satisfied in the normal-form representation. That's it. There is no canonical assumption that players are expected utility maximizers, or expected payoff maximizers.
because the utilities to player B are not dependent on what you do.
Noting that I don't follow what you mean by this: ... (read more)
Payout correlation IS the metric of alignment.
Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don't think I've ever encountered that.
Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However,
Thanks for the thoughtful response.
In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)
It seems to me like you're assuming that players must respond rationally, or else they're playing a different game, in some sense. But why? The stag hunt game is defined by a certain set of payoff inequal... (read more)
I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?
Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.
Not from the paper. I just wrote it.
I don't think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.
It isn't the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.
Sure, but I still don't understand the argument here. It's trivial to write a reward function that doesn't yield instrumental convergen
I was looking for some high-level/simplified description
Ah, I see. In addition to the cited explanation, see also: "optimal policies tend to take actions which strictly preserve optionality*", where the optionality preservation is rather strict (requiring a graphical similarity, and not just "there are more options this way than that"; ironically, this situation is considerably simpler in arbitrary deterministic computable environments, but that will be the topic of a future post).
Isn't the thing we condition on here similar (roughly speaking) to you
(I continued this discussion with Adam in private - here are some thoughts for the public record)
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the "5-googleplex" one).
I think I'm claiming first bullet. I a... (read more)
Thanks for taking the time to write this out.
Regarding the theorems (in the POWER paper; I've now spent some time on the current version): The abstract of the paper says: "With respect to a class of neutral reward function distributions, we provide sufficient conditions for when optimal policies tend to seek power over the environment." I didn't find a description of those sufficient conditions (maybe I just missed it?).
I'm sorry - although I think I mentioned it in passing, I did not draw sufficient attention to the fact that I've been talking ... (read more)
I don't understand your point in this exchange. I was being specific about my usage of model; I meant what I said in the original post, although I noted room for potential confusion in my comment above. However, I don't know how you're using the word.
I don’t use the term model in my previous reply anyway.
You used the word 'model' in both of your prior comments, and so the search-replace yields "state-abstraction-irrelevant abstractions." Presumably not what you meant?
I already pointed out a concrete difference: I claim it’s reasonable to say there ar
I read your formalism, but I didn't understand what prompted you to write it. I don't yet see the connection to my claims.
If so, I might try to formalize it.
Yeah, I don't want you to spend too much time on a bulletproof grounding of your argument, because I'm not yet convinced we're talking about the same thing.
In particular, if the argument's like, "we usually express reward functions in some featurized or abstracted way, and it's not clear how the abstraction will interact with your theorems" / "we often use different abstractions to express differ... (read more)
say we agree that our state abstraction needs to be model-irrelevant
Why would we need that, and what is the motivation for "models"? The moment we give the agent sensors and actions, we're done specifying the rewardless MDP (and its model).
ETA: potential confusion - in some MDP theory, the “model” is a model of the environment dynamics. Eg in deterministic environments, the model is shown with a directed graph. i don’t use “model” to refer to an agent’s world model over which it may have an objective function. I should have chosen a better word, or clarifi... (read more)
I'm not trying to define here the set of reward functions over which instrumental convergence argument apply (they obviously don't apply to all reward functions, as for every possible policy you can design a reward function for which that policy is optimal).
ETA: I agree with this point in the main - they don't apply to all reward functions. But, we should be able to ground the instrumental convergence arguments via reward functions in some way. Edited out because I read through that part of your comment a little too fast, and replied to something you didn'... (read more)
Setting aside the "arbitrary" part, because I didn't talk about an arbitrary reward function…
To clarify: when I say that taking over the world is "instrumentally convergent", I mean that most objectives incentivize it. If you mean something else, please tell me. (I'm starting to think there must be a serious miscommunication somewhere if we're still disagreeing about this?)
So we can't set the 'arbitrary' part aside - instrumentally convergent means that the incentives apply across most reward functions - not just for one. You're arguing that one reward fun... (read more)
Yeah, i claim that this intuition is actually wrong and there's no instrumental convergence in this environment. Complicated & contains actors doesn't mean you can automatically conclude instrumental convergence. The structure of the environment is what matters for "arbitrarily capable agents"/optimal policies (learned policies are probably more dependent on representation and training process).
So if you disagree, please explain why arbitrary reward functions tend to incentivize outputting one string sequence over another? Because, again, this en... (read more)
For that particular reward function, yes, the optimal policies may be very complicated. But why are there instrumentally convergent goals in that environment? Why should I expect capable agents in that environment to tend to output certain kinds of string sequences, over other kinds of string sequences?
(Also, is the amount of money paid by the client part of the state? Or is the agent just getting rewarded for the total number of purchase-assents in the conversation over time?)
Though it involves the unresolved (for me) embedded agency issues.
Right, that does complicate things. I'd like to get a better picture of the considerations here, but given how POWER behaves on environment structures so far, I'm pretty confident it'll adapt to appropriate ways of modelling the situation.
Let's side-step those issues by not having a computer running the agent inside the environment, but rather having the text string that the agent chooses in each time step magically appear somewhere in the environment. The question is now whether it's possib
To further clarify:
The instrumental convergence thesis is not a fact about every situation involving "capable AI", but a thesis pointing out a reliable-seeming pattern across environments and goals. It can't be used as a black-box reason on its own - you have to argue why the reasoning applies in the environment. In particular, we assumed that the agent is interacting with the text MDP, where
the state representation [is] uniquely determined by all the text that was written so far by both the customer and the chatbot [, and the chat doesn't end when the customer leaves / stop
the choice of the state representation and action space may determine whether a problem is like that.
I agree. Also: the state and action representations determine which reward functions we can express, and I claim that it makes sense for the theory to reflect that fact.
If so POWER—when defined over an IID-over-states reward distribution—is constant.
Agreed. I also don't currently see a problem here. There aren't any robustly instrumental goals in this setting, as best I can tell.
I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
But that just means the subjectivity comes from the choice of the interface!
There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can