Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.
My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.
I proposed changing "instrumental convergence" to "robust instrumentality." This proposal has not caught on, and so I reverted the post's terminology. I'll just keep using 'convergently instrumental.' I do think that 'convergently instrumental' makes more sense than 'instrumentally convergent', since the agent isn't "convergent for instrumental reasons", but rather, it's more reasonable to say that the instrumentality is convergent in some sense.
For the record, the post used to contain the following section:
The robustness-of-strategy phenomenon became known as the instrumental convergence hypothesis, but I propose we call it robust instrumentality instead.
From the paper’s introduction:
An action is said to be instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The so-called instrumental convergence thesis is the claim that agents with many different goals, if given time to learn and plan, will eventually converge on exhibiting certain common patterns of behavior that are robustly instrumental (e.g. survival, accessing usable energy, access to computing resources). Bostrom et al.'s instrumental convergence thesis might more aptly be called the robust instrumentality thesis, because it makes no reference to limits or converging processes:
“Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.”
Some authors have suggested that gaining power over the environment is a robustly instrumental behavior pattern on which learning agents generally converge as they tend towards optimality. If so, robust instrumentality presents a safety concern for the alignment of advanced reinforcement learning systems with human society: such systems might seek to gain power over humans as part of their environment. For example, Marvin Minsky imagined that an agent tasked with proving the Riemann hypothesis might rationally turn the planet into computational resources.
This choice is not costless: many are already acclimated to the existing ‘instrumental convergence.’ It even has its own Wikipedia page. Nonetheless, if there ever were a time to make the shift, that time would be now.
Additional note for posterity: when I talked about "some objectives [may] make alignment far more likely", I was considering something like "given this pretraining objective and an otherwise fixed training process, what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.
Claim 3: If you don't control the dataset, it mostly doesn't matter what pretraining objective you use (assuming you use a simple one rather than e.g. a reward function that encodes all of human values); the properties of the model are going to be roughly similar regardless.
Analogous claim: since any program specifiable under UTM U1 is also expressible under UTM U2, choice of UTM doesn't matter.
And this is true up to a point: up to constant factors, it doesn't matter. But U1 can make it easier (simplier, faster, etc) to specify a set of programs than does U2. And so "there exists a program in U2-encoding which implements P in U1-encoding" doesn't get everything I want: I want to reason about the distribution of programs, about how hard it tends to be to get programs with desirable properties.
Stepping out of the analogy, even though I agree that "reasonable" pretraining objectives are all compatible with aligned / unaligned /arbitrarily behaved models, this argument seems to leave room that some objectives make alignment far more likely, a priori. And you may be noting as much:
(This is probably the weakest argument in the chain; just because most of the influence comes from the dataset doesn't mean that the pretraining objective can't have influence as well. I still think the claim is true though, and I still feel pretty confident about the final conclusion in the next claim.)
As I understand expanding candy into A and B but not expanding the other will make the ratios go differently.
What do you mean?
If we knew what was important and what not we would be sure about the optimality. But since we think we don't know it or might be in error about it we are treating that the value could be hiding anywhere.
I'm not currently trying to make claims about what variants we'll actually be likely to specify, if that's what you mean. Just that in the reasonably broad set of situations covered by my theorems, the vast majority of variants of every objective function will make power-seeking optimal.
Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.
My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.
Well, maybe here's a better way of communicating what I'm after:
Suppose that you have beliefs about the initial state of the right (AGI) half, and you know how it's going to evolve; this gives you a distribution over right-half universe histories - you have beliefs about the AGI's initial state, and you can compute the consequences of those beliefs in terms of how the right half of the universe will end up.
In this way, you can take expected utility over the joint universe history, without being able to observe what's actually happening on the AGI's end. This is similar to how I prefer "start a universe which grows to be filled with human flourishing" over "start a universe which fills itself with suffering", even though I may not observe the fruits of either decision.
Is this clearer?
I'm not sure if you're arguing that this is a good world in which to think about alignment.
I am not arguing this. Quoting my reply to ofer:
I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.
(Edited post to clarify)
Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values
Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)
My take on it has been, the theorem's bottleneck assumption implies that you can't reach S again after taking action a1 or a2, which rules out cycles.
If the agent is sufficiently farsighted (i.e. the discount is near 1)
I'd change this to "optimizes average reward (i.e. the discount equals 1)". Otherwise looks good!