Ofer Givoli

Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

Feel free to reach out by sending me a PM here or on my website.

Wiki Contributions


A world in which the alignment problem seems lower-stakes

I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").

A world in which the alignment problem seems lower-stakes

Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

This post uses the phrase "Bostrom's original instrumental convergence thesis". I'm not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper uses the term "many intelligent agents" instead of "a broad spectrum of situated intelligent agents".

In case it'll be helpful to anyone, the formulation of the thesis in the book Superintelligence is the following:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents. 

I'm not sure what you meant here by saying that the instrumental convergence thesis "needs to be applied carefully", and how the example you gave supports this. Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values (which are all mentioned both in the linked paper and in the book Superintelligence as categories where "convergent instrumental values may be found"): self-preservation, cognitive enhancement, technological perfection and resource acquisition.

Environmental Structure Can Cause Instrumental Convergence

Because you can do "strictly more things" with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.

Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don't "tend to avoid breaking the vase". Those optimal policies don't behave as if they care about the 'strictly more states' that can be reached by not breaking the vase.

When the agent maximizes average reward, we know that optimal policies tend to seek power when there's something like:

"Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1." (This follows by combining proposition 6.12 and theorem 6.13)

Here "{cycles reachable after taking a1 at s}" actually refers an RSD, right? So we're not just talking about a set of states, we're talking about a set of vectors that each corresponds to a "state visitation distribution" of a different policy. In order for the "similar to" (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.

In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one "ghost" is moving randomly at any given time; I'll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in "Frightened" mode, i.e. when they are blue and can't kill Pac-Man).

Let's focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state , is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state that can result from action a2, we get that is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of that is similar to via an involution.

A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): "Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function." So that claim relies on Pac-Man being a "sufficiently deterministic" environment and it does not apply to the Pac-Man-with-Random-Ghost version.

Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?

Environmental Structure Can Cause Instrumental Convergence

That one in particular isn't a counterexample as stated, because you can't construct a subgraph isomorphism for it.

Probably not an important point, but I don't see why we can't use the identity isomorphism (over the part of the state space that a1 leads to).

Environmental Structure Can Cause Instrumental Convergence

I was referring to the claim being made in Rohin's summary. (I no longer see counter examples after adding the assumption that "a1 and a2 lead to disjoint sets of future options".)

Environmental Structure Can Cause Instrumental Convergence

(we’re going to ignore cases where a1 or a2 is a self-loop)

I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.

Discussion: Objective Robustness and Inner Alignment Terminology

Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?

Environmental Structure Can Cause Instrumental Convergence

Optimal policies will tend to avoid breaking the vase, even though some don't. 

Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?

This is just making my point - Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then optimal policies tend to end up in D1 instead of D2. Most optimal policies will avoid entering the final state, just as section 7 claims. 

My question is just about the main claim in the abstract of the paper ("We prove that for most prior beliefs one might have about the agent's reward function [...], one should expect optimal policies to seek power in these environments."). The main claim does not apply to the simple environment in my example (i.e. we should not expect optimal policies to seek POWER in that environment). I'm completely fine with that being the case, I just want to understand why. What criterion does that environment violate?

I agree that there's room for cleaner explanation of when the theorems apply, for those readers who don't want to memorize the formal conditions. 

I counted ~19 non-trivial definitions in the paper. Also, the theorems that the main claim directly relies on (which I guess is some subset of {Proposition 6.9, Proposition 6.12, Theorem 6.13}?) seem complicated. So I think the paper should definitely provide a reasonably simple description of the set of MDPs that the main claim applies to, and explain why proving things on that set is useful.

But I think the theory says interesting things because it's already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I'm almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is "interesting enough."

Do you mean that the main claim of the paper actually applies to those environments (i.e. that they are in the formal set of MDPs that the relevant theorems apply to) or do you just mean that optimal policies in those environments tend to be POWER-seeking? (The main claim only deals with sufficient conditions.)

Environmental Structure Can Cause Instrumental Convergence

The paper supports the claim with:

  • Embodied environment in a vase-containing room (section 6.3)

I think this refers to the following passage from the paper:

Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options.

This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.

Regarding your next bullet point:

  • Pac-Man (figure 8)
    • And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)

I don't know what you mean here by "generally holds". When does an environment—in which the agent can be shut down—"have the right symmetries" for the purpose of the main claim? Consider the following counter example (in which the last state is equivalent to the agent being shut down):

In most states (the first 3 states) the optimal policies of most reward functions transition to the next state, while the POWER-seeking behavior is to stay in the same state (when the discount rate is sufficiently close to 1). If we want to tell a story about this environment, we can say that it's about a car in a one-way street.

To be clear, the issue I'm raising here about the paper is NOT that the main claim does not apply to all MDPs. The issue is the lack of (1) a reasonably simple description of the set of MDPs that the main claim applies to; and (2) an explanation for why it is useful to prove things about that set.

Sorry - I meant the "future work" portion of the discussion section 7. The future work highlights the "note of caution" bits.

The limitations mentioned there are mainly: "Most real-world tasks are partially observable" and "our results only apply to optimal policies in finite MDPs". I think that another limitation that belongs there is that the main claim only applies to a particular set of MDPs.

Environmental Structure Can Cause Instrumental Convergence

For my part, I either strongly disagree with nearly every claim you make in this comment, or think you're criticizing the post for claiming something that it doesn't claim (e.g. "proves a core AI alignment argument"; did you read this post's "A note of caution" section / the limitations section and conclusion of the paper v.7?).

I did read the "Note of caution" section in the OP. It says that most of the environments we think about seem to "have the right symmetries", which may be true, but I haven't seen the paper support that claim.

Maybe I just missed it, but I didn't find a "limitations section" or similar in the paper. I did find the following in the Conclusion section:

We caution that many real-world tasks are partially observable and that learned policies are rarely optimal. Our results do not mathematically prove that hypothetical superintelligent AI agents will seek power.

Though the title of the paper can still give the impression that it proves a core argument for AI x-risk.

Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here). So I think my advice that "it should not be cited as a paper that formally proves a core AI alignment argument" is beneficial.

I don't think it will be useful for me to engage in detail, given that we've already extensively debated these points at length, without much consensus being reached.

For reference (in case anyone is interested in that discussion): I think it's the thread that starts here (just the part after "2.").

Load More