The paper Optimal Policies Tend to Seek Power tries to justify the claim in the title with a mathematical argument, using a clever formal definition of "power". I like the general approach, but I think that parts of the formalisation do not correspond well to the intuitive understanding of the concepts involved. This wouldn't be an issue if the goal of the paper was to prove a mathematical theorem. Here, however, the goal is to be able to deduce something about the behaviour of future AI systems. Therefore, the interface between "math" and "the world" is much more important. This post attempts to take another route to formalisation, which I think is closer to the intuition, albeit more difficult to work with mathematically.

Intro

The power of a state s, written POWERμ(s), is defined as (roughly) the average optimal value of that state, over some chosen distribution of rewards μ. The paper quite convincingly argues for it:

Power is the ability to achieve a range of goals. In an MDP, the optimal value function V∗R(s,γ) captures the agent’s ability to “achieve the goal” R. Therefore, average optimal value captures the agent’s ability to achieve a range of goals.

The problem with this approach is that it's very sensitive to the choice of the distribution of rewards μ. In particular, one might take a Dirac delta on a reward giving 1 in any chosen state, and 0 everywhere else, which would trivially assign maximal power to that particular state. To combat this difficulty, authors introduce some technical devices allowing them to say "for most distributions, the power of (one, obviously better, state) is greater than the power of (another, obviously worse, state)".

To me, "X happens for most distributions of rewards" should mean that there is some reasonable, uninformative, broad prior D over distributions of rewards, and that PD(X)>12. The approach chosen in the paper not only doesn't correspond to this intuitive understanding, but seems to me to be (in certain sense) proving the thesis by assuming the conclusion.

More concretely, the approach taken in the paper is as follows. Given a reward R and a permutation σ over states of the MDP, one can obtain another reward (σR)(s)=R(σ(s)) by permuting the states' rewards. This also works if we have not one reward R, but a distribution over rewards, by pushing it forward by σ. Therefore, each such permutation σ induces an orbit Sσ in the space of (distributions of) rewards. The authors then say that the power of one state s1 is greater than power of another state s2for most distributions, if, for each permutation σ, for the majority of distributions of rewards μ in the orbit Sσ, we have the inequality POWERμ(s1)>POWERμ(s2).

This is arguably quite far from how anyone would interpret the phrase "for most distributions" in real life. What we care about is the global behaviour in the whole space, not the the local behaviour in orbits. (Moreover, the permutations considered here are arbitrary in a sense of not respecting the structure of the MDP - but that is another issue).

The main problem of making the probabilistic meaning of "most" proposed above work here is that it is mathematically difficult to work in the infinite-dimensional space of distributions over rewards. There are many pathologies, such as the fact that there is no counterpart to the (Lebesgue-) uniform measure (after scouring the internet, I found that authors make this argument in a response to the review of the paper).

I still think it is possible to make progress on this. It should be possible to use something like Dirichlet/Pitman-Yor/Gaussian processes, or even something more geometric such as Wasserstein distance, based on the metric structure of the rewards in an MDP. However, regardless of the concrete distribution, the first step is to prove that the event POWERμ(s1)≤POWERμ(s2) is measurable in the space of distributions of rewards (that is, in μ).

Thus, this post contains:

the proof of this fact, and

some (unfruitful) exploration of two simple cases of concrete distributions over distributions of rewards

The main technical device I use is the Giry monad, which gives a characterisation of the canonical σ-algebra on the space of probability distributions. I include the basic definition below, but those unfamiliar with the topic might enjoy this tutorial - which assumes only basic familiarity with measure theory and category theory.

Technical content

What the original paper did

To refresh the memory, the original paper proceeds as follows. We assign "power" to each state of a MDP (under a fixed distribution μ over reward functions), to be:
POWERμ(s)=1−γγ⋅ER∼μ[V∗R(s)−R(s)]
For a set D of distributions over rewards, we define the inequality between powers of states s1,s2 to be:
POWERμ(s1)≥most(D)POWERμ(s2)
if and only if, for all Bμ=orb(μ,ΣS)={σ⋅μ:σ∈ΣS}, for the majority of ν∈B (which is a finite set) we have:
POWERν(s1)≥POWERν(s2)

(The paper considers only rewards of a form R:S→R and only deterministic policies π∈S→A.)

Giry monad

First, we note that if S and A are finite (countable case is the same), then the space of reward functions S→R (in general, S×A×S→R) is equal to Rn for some n∈N - therefore, it is a standard Borel space.

On the category of Borel spaces, we have the Giry monad G, which assigns, to each measurable space X, the space of the probability measures on GX. Therefore, G points out a canonical σ-algebra on the Δ(Rn), which we denote by Σ.

We wiil describe Σ. First, we say that for a fixed U∈ΣRn, the evaluation function is given by:
evU:Δ(Rn)→RevU(μ)=μ(U)
Then, Σ It is generated by all evU, that is, by sets all sets of the form ev−1U(V), for any measuable set V∈ΣR.

Lemma 1: For any measurable function f:Rn→R, we can take the evaluation with respect to this function:
evf(μ)=∫f(x)dμ(x)
and then Σ is equivalently generated by all ev−1f(V).

Measurability of the POWER inequality

Let us fix some states s1,s2. We can ask whether the set
A={μ∈D:POWERD(s1)≥POWERD(s2)}
is measurable in Σ (since ultimately, we want to say something about it's measure).

Unrolling the definitions gives us
POWERμ(s1)≥POWERμ(s2)⟺1−γγER∼μ[V∗R(s1)−R(s1)]≥1−γγER∼μ[V∗R(s1)−R(s1)]⟺ER∼μ[(V∗R(s1)−V∗R(s2))−(R(s1)−R(s2))]≥0
Let's now denote:
f:Rn→Rf(R)=V∗R(s1)−V∗R(s2)−(R(s1)−R(s2))F:Δ(Rn)→RF(μ)=Ex∼μ[f(x)]=∫f(x)dμ(x)
Rewriting the definition of A gives us:
A={μ∈Δ(Rn):Ex∼μ[f(x)]≥0}={μ∈Δ(Rn):F(μ)≥0}=F−1([0,∞))
Now, we only have to argue that f is measurable - then, by Lemma 1, we will know that A is measurable. It is a sum of four terms - it is enough to show that g(R)=R(si) and h(R)=V∗R(si) are measurable.

The first function g(R)=R(s1) is just a projection to the si-th coordinate, therefore measurable.

The second function
h(R)=maxπ∈Πfπ,si(γ)⋅R
is just a maximum over a finite set Π=S→A of linear functionals (on a finite-dimensional space), therefore measurable as well - where fπ,si(γ) is the visit distribution
fπ,si(γ)=E{st}∼π[∞∑t=0γtest]
(doesn't really matter what it is, the important point is that it is independent of R).

Probabilities on D

Ok, so now we can think about what is the probability mass of A=A(s1,s2) under different probability distributions over Δ(Rn). To make our lives easier, we'll now switch to Δ(In), I=[0,1]. This doesn't change anything: we can rescale reward function by any factor and get an equivalent MDP (proof: just look at the Bellman equations).

Two (short) attempts I make below are unsuccessful. Both of them fail in a similar ways: by making the uniform distribution over rewards "too central", and by not having the full support over all probability distributions.

Lifting uniform distribution by Giry monad

We have one canonical distribution over distributions: take the uniform distribution p:
1p→X
and apply the Giry monad:
G(1)≃1Gp−→GX
But unfortunately, Gp is just the uniform distribution over the Dirac deltas in δR∈GX. Then, the situation is equivalent to calculating the measure of the set POWERμ(s1)≥POWERμ(s2) in one case of the uniform distribution over rewards (that is, a Diract delta on the uniform distribution of the rewards).

Dirichlet process

We can take Dirichlet process D:=D(U[0,1]n,α) centered around the uniform distribution, for a fixed α.

Definition: For a given distribution H on X and α∈(0,∞), the Dirichlet process D(H,α) on the space Δ(X) is characterized by the following property: for any partition A1,A2,…An of X, we have that if μ∼D(H,α), then:
(μ(A1),μ(A2),…μ(An))∼Dir(αH(A1),αH(A2),…αH(An))

Applying the definition to our D(U[0,1]n,α) and the partition {A,Ac}, we get
(μ(A),μ(Ac))=(μ(A),1−μ(A))∼Dir(αλ(A),α(1−λ(A)))=Beta(αλ(A),α(1−λ(A)))

where λ(A) the measure of A under uniform distribution U[0,1]n.

Therefore, we have the mean:
Eμ∼D[μ(A)]=λ(A)

Unfortunately, this is again quite uninteresting. All the choice was in centering the Dirichlet process at H=U[0,1]n. (Another drawback is that Dirichlet process has support only on discrete distributions, which is at odds with our initial goal of having a broad prior over all distributions of rewards - in particular, continuous ones).

Conclusion

We proved that the event POWERμ(s1)≥POWERμ(s2) is measurable in the space of distibutions μ∼D. We failed to point out any interesting concrete priors over distributions of reward functions, but I am optimistic that it should be possible with better understanding of the metric (topological, linear, differential, ...?) structure of the reward functions on an MDP.

The paper Optimal Policies Tend to Seek Power tries to justify the claim in the title with a mathematical argument, using a clever formal definition of "power". I like the general approach, but I think that parts of the formalisation do not correspond well to the intuitive understanding of the concepts involved. This wouldn't be an issue if the goal of the paper was to prove a mathematical theorem. Here, however, the goal is to be able to deduce something about the behaviour of future AI systems. Therefore, the interface between "math" and "the world" is much more important. This post attempts to take another route to formalisation, which I think is closer to the intuition, albeit more difficult to work with mathematically.

## Intro

The power of a state s, written POWERμ(s), is defined as (roughly) the average optimal value of that state, over some chosen distribution of rewards μ. The paper quite convincingly argues for it:

The problem with this approach is that it's very sensitive to the choice of the distribution of rewards μ. In particular, one might take a Dirac delta on a reward giving 1 in any chosen state, and 0 everywhere else, which would trivially assign maximal power to that particular state. To combat this difficulty, authors introduce some technical devices allowing them to say "for most distributions, the power of (one, obviously better, state) is greater than the power of (another, obviously worse, state)".

To me, "X happens for most distributions of rewards" should mean that there is some reasonable, uninformative, broad prior D over distributions of rewards, and that PD(X)>12. The approach chosen in the paper not only doesn't correspond to this intuitive understanding, but seems to me to be (in certain sense) proving the thesis by assuming the conclusion.

More concretely, the approach taken in the paper is as follows. Given a reward R and a permutation σ over states of the MDP, one can obtain another reward (σR)(s)=R(σ(s)) by permuting the states' rewards. This also works if we have not one reward R, but a distribution over rewards, by pushing it forward by σ. Therefore, each such permutation σ induces an orbit Sσ in the space of (distributions of) rewards. The authors then say that the power of one state s1 is greater than power of another state s2

for most distributions, if, for each permutation σ, for the majority of distributions of rewards μ in the orbit Sσ, we have the inequality POWERμ(s1)>POWERμ(s2).This is arguably quite far from how anyone would interpret the phrase "for most distributions" in real life. What we care about is the global behaviour in the whole space, not the the local behaviour in orbits. (Moreover, the permutations considered here are arbitrary in a sense of not respecting the structure of the MDP - but that is another issue).

The main problem of making the probabilistic meaning of "most" proposed above work here is that it is mathematically difficult to work in the infinite-dimensional space of distributions over rewards. There are many pathologies, such as the fact that there is no counterpart to the (Lebesgue-) uniform measure (after scouring the internet, I found that authors make this argument in a response to the review of the paper).

I still think it is possible to make progress on this. It should be possible to use something like Dirichlet/Pitman-Yor/Gaussian processes, or even something more geometric such as Wasserstein distance, based on the metric structure of the rewards in an MDP. However, regardless of the concrete distribution, the first step is to prove that the event POWERμ(s1)≤POWERμ(s2) is measurable in the space of distributions of rewards (that is, in μ).

Thus, this post contains:

The main technical device I use is the Giry monad, which gives a characterisation of the canonical σ-algebra on the space of probability distributions. I include the basic definition below, but those unfamiliar with the topic might enjoy this tutorial - which assumes only basic familiarity with measure theory and category theory.

## Technical content

## What the original paper did

To refresh the memory, the original paper proceeds as follows. We assign "power" to each state of a MDP (under a fixed distribution μ over reward functions), to be: POWERμ(s)=1−γγ⋅ER∼μ[V∗R(s)−R(s)] For a set D of distributions over rewards, we define the inequality between powers of states s1,s2 to be: POWERμ(s1)≥most(D)POWERμ(s2) if and only if, for all Bμ=orb(μ,ΣS)={σ⋅μ:σ∈ΣS}, for the majority of ν∈B (which is a finite set) we have: POWERν(s1)≥POWERν(s2)

(The paper considers only rewards of a form R:S→R and only deterministic policies π∈S→A.)

## Giry monad

First, we note that if S and A are finite (countable case is the same), then the space of reward functions S→R (in general, S×A×S→R) is equal to Rn for some n∈N - therefore, it is a standard Borel space.

On the category of Borel spaces, we have the Giry monad G, which assigns, to each measurable space X, the space of the probability measures on GX. Therefore, G points out a canonical σ-algebra on the Δ(Rn), which we denote by Σ.

We wiil describe Σ. First, we say that for a fixed U∈ΣRn, the evaluation function is given by: evU:Δ(Rn)→RevU(μ)=μ(U) Then, Σ It is generated by all evU, that is, by sets all sets of the form ev−1U(V), for any measuable set V∈ΣR.

Lemma 1: For any measurable function f:Rn→R, we can take the evaluation with respect to this function: evf(μ)=∫f(x)dμ(x) and then Σ is equivalently generated by all ev−1f(V).## Measurability of the POWER inequality

Let us fix some states s1,s2. We can ask whether the set A={μ∈D:POWERD(s1)≥POWERD(s2)} is measurable in Σ (since ultimately, we want to say something about it's measure).

Unrolling the definitions gives us POWERμ(s1)≥POWERμ(s2)⟺1−γγER∼μ[V∗R(s1)−R(s1)]≥1−γγER∼μ[V∗R(s1)−R(s1)]⟺ER∼μ[(V∗R(s1)−V∗R(s2))−(R(s1)−R(s2))]≥0 Let's now denote: f:Rn→Rf(R)=V∗R(s1)−V∗R(s2)−(R(s1)−R(s2))F:Δ(Rn)→RF(μ)=Ex∼μ[f(x)]=∫f(x)dμ(x) Rewriting the definition of A gives us: A={μ∈Δ(Rn):Ex∼μ[f(x)]≥0}={μ∈Δ(Rn):F(μ)≥0}=F−1([0,∞)) Now, we only have to argue that f is measurable - then, by

Lemma 1, we will know that A is measurable. It is a sum of four terms - it is enough to show that g(R)=R(si) and h(R)=V∗R(si) are measurable.The first function g(R)=R(s1) is just a projection to the si-th coordinate, therefore measurable.

The second function h(R)=maxπ∈Πfπ,si(γ)⋅R is just a maximum over a finite set Π=S→A of linear functionals (on a finite-dimensional space), therefore measurable as well - where fπ,si(γ) is the visit distribution fπ,si(γ)=E{st}∼π[∞∑t=0γtest] (doesn't really matter what it is, the important point is that it is independent of R).

## Probabilities on D

Ok, so now we can think about what is the probability mass of A=A(s1,s2) under different probability distributions over Δ(Rn). To make our lives easier, we'll now switch to Δ(In), I=[0,1]. This doesn't change anything: we can rescale reward function by any factor and get an equivalent MDP (proof: just look at the Bellman equations).

Two (short) attempts I make below are unsuccessful. Both of them fail in a similar ways: by making the uniform distribution over rewards "too central", and by not having the full support over all probability distributions.

## Lifting uniform distribution by Giry monad

We have one canonical distribution over distributions: take the uniform distribution p: 1p→X and apply the Giry monad: G(1)≃1Gp−→GX But unfortunately, Gp is just the uniform distribution over the Dirac deltas in δR∈GX. Then, the situation is equivalent to calculating the measure of the set POWERμ(s1)≥POWERμ(s2) in one case of the uniform distribution over rewards (that is, a Diract delta on the uniform distribution of the rewards).

## Dirichlet process

We can take Dirichlet process D:=D(U[0,1]n,α) centered around the uniform distribution, for a fixed α.

Definition: For a given distribution H on X and α∈(0,∞), the Dirichlet process D(H,α) on the space Δ(X) is characterized by the following property: for any partition A1,A2,…An of X, we have that if μ∼D(H,α), then: (μ(A1),μ(A2),…μ(An))∼Dir(αH(A1),αH(A2),…αH(An))Applying the definition to our D(U[0,1]n,α) and the partition {A,Ac}, we get (μ(A),μ(Ac))=(μ(A),1−μ(A))∼Dir(αλ(A),α(1−λ(A)))=Beta(αλ(A),α(1−λ(A)))

where λ(A) the measure of A under uniform distribution U[0,1]n.

Therefore, we have the mean: Eμ∼D[μ(A)]=λ(A)

Unfortunately, this is again quite uninteresting. All the choice was in centering the Dirichlet process at H=U[0,1]n. (Another drawback is that Dirichlet process has support only on discrete distributions, which is at odds with our initial goal of having a broad prior over all distributions of rewards - in particular, continuous ones).

## Conclusion

We proved that the event POWERμ(s1)≥POWERμ(s2) is measurable in the space of distibutions μ∼D. We failed to point out any interesting concrete priors over distributions of reward functions, but I am optimistic that it should be possible with better understanding of the metric (topological, linear, differential, ...?) structure of the reward functions on an MDP.