When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Copying over a Slack comment from Abram Demski:

I think this post could be pretty important.
It offers a formal treatment of "goal-directedness" and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has -- till now -- been dealt with only quite informally. Personally I haven't known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn't seem that cruxy for most of my thinking, but some other people seem to really strongly perceive it as a crux for miri-type thought, and sometimes as a crux for AI risk more generally. (I once made a "tool AI" argument against AI risk myself, although in hindsight I would say that was all motivated cognition, which ignored the idea that even tool AI has to optimize strongly in order to have high capabilities.)
So, as I see it, there's been something of a stalemate between people who think the "goal-directed AI" vs "non-goal-directed AI" distinction is important for one reason or another, vs people who don't think that.
Alex Turner seems to give real technical meaning to this distinction, showing that most VNM-coherent preferences are indeed "goal directed" in the sense of acting broadly like we expect agents to act (that is, behaving in ways consistent with instrumental convergence). However, he also gives a class of VNM-coherent preferences which are not goal-directed in this sense, instead exhibiting essentially random behavior. This gives us a plausible formal proxy for the "goal directed vs not goal directed" distinction!
I'm not sure how it can/should carry the broader conversation forward, yet, but it seems like something to think about.

[-]Edouard Harris4y80

Thanks for writing this.

I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.

First, is being defined here as an outcome permutation. Presumably this means that 1) $ϕ (o_{i}) = o_{j}$ for some $o_{i}$ , $o_{j}$ ; and 2) $ϕ$ admits a unique inverse $ϕ^{- 1} (o_{j}) = o_{i}$ . That makes sense.

We also define lotteries over outcomes, presumably as, e.g., $L = \sum_{i = 1}^{n} ℓ_{i} o_{i}$ , where $ℓ_{i}$ is the probability of outcome $o_{i}$ . Of course we can interpret the $o_{i}$ geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the $n$ -simplex. So far, so good.

But the thing that's confusing me is what this implies for the definition of $ϕ^{- 1} (L)$ . Because $ϕ$ is defined as a permutation over outcomes (and not over probabilities of outcomes), we should expect this to be

ϕ^{- 1} (L) = ϕ^{- 1} (n \sum i = 1 ℓ_{i} o_{i}) = n \sum i = 1 ℓ_{i} ϕ^{- 1} (o_{i})

The problem is that this seems to give a different EV from the lemma:

E_{o \sim ϕ^{- 1} (L)} [u (o)] = n \sum i = 1 ℓ_{i} u (ϕ^{- 1} (o_{i})) = E_{o \sim L} [u (ϕ^{- 1} (o))]

(Note that I'm using $o$ as the dummy variable rather than $ℓ$ , but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the $M$ lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be " $≺_{ϕ}$ induces $u (ϕ^{- 1} (o_{i}))$ " as opposed to " $≺_{ϕ}$ induces $u (ϕ (o_{i}))$ ".

(As a concrete example, suppose we have a lottery $L = ℓ_{1} o_{1} + ℓ_{2} o_{2} + ℓ_{3} o_{3}$ with the permutation $ϕ^{- 1} (o_{1}) = o_{2}$ , $ϕ^{- 1} (o_{2}) = o_{3}$ , $ϕ^{- 1} (o_{3}) = o_{1}$ . Then $ϕ^{- 1} (L) = ℓ_{1} o_{2} + ℓ_{2} o_{3} + ℓ_{3} o_{1}$ and our EV is

E_{o \sim ϕ^{- 1} (L)} [u (o)] = ℓ_{1} u (o_{2}) + ℓ_{2} u (o_{3}) + ℓ_{3} u (o_{1}) = E_{o \sim L} [u (ϕ^{- 1} (o))]

Yet $E_{o \sim L} [u (ϕ (o))] = ℓ_{1} u (o_{3}) + ℓ_{2} u (o_{1}) + ℓ_{3} u (o_{2}) \neq E_{o \sim ϕ^{- 1} (L)} [u (o)]$ which appears to contradict the lemma as stated.)

Note that even if this analysis is correct, it doesn't invalidate your main claim. You only really care about the existence of a bijection rather than what that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)

Again, it's also possible I've just misunderstood something here — please let me know if that's the case!

[-]TurnTrout4y30

Thanks! I think you're right. I think I actually should have defined differently, because writing it out, it isn't what I want. Having written out a small example, intuitively, $L ≻_{ϕ} M$ should hold iff $ϕ (L) ≻ ϕ (M)$ , which will also induce $u (ϕ (o_{i}))$ as we want.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation. Probably it's more natural to represent $E_{ℓ \sim ϕ^{- 1} (L)} [u (ℓ)]$ as $u^{⊤} (P_{ϕ^{- 1}} l) = (u^{⊤} P_{ϕ^{- 1}}) l$ , which makes your insight obvious.

The post is edited and the issues should now be fixed.

[-]Edouard Harris4y10

No problem! Glad it was helpful. I think your fix makes sense.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.

Yeah, I figured maybe it was because the dummy variable was being used in the EV to sum over outcomes, while the vector $l$ was being used to represent the probabilities associated with those outcomes. Because $ℓ$ and $l$ are similar it's easy to conflate their meanings, and if you apply $ϕ$ to the wrong one by accident that has the same effect as applying $ϕ^{- 1}$ to the other one. In any case though, the main result seems unaffected.

Cheers!

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

30

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

30

Intuition

Formalism

Implications

The quest for better convergence theorems