Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see "Proposition 5"). Now we show that an exact quantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).

We assume geometric time discount throughout.

Lemma

Consider ξ∈ΔS. Define the linear operators E:RS×A→RS and T:RS×A→RS by

Et,sa:=[[t=s]]

Tt,sa:=T(t|s,a)

(Note that this T is the transpose of the T defined in "Proposition A.3" of the previous essay.)

Then, ξ∈ImZ if and only if there is ϕ∈Δ(S×A) s.t.

Eϕ=ξ

(E−λT)ϕ=(1−λ)ζ0

Proof

(This is actually well known, but we spell out the proof to be self-contained.)

Suppose that ξ∈ImZ. We already know that this implies that there is a stationary policy π:Sk→A s.t. Zπ=ξ (we abuse notation in the obvious way): see the proofs of "Proposition 2" and "Proposition 3". Define the linear operator Tπ:RS→RS by

Conversely, suppose that ϕ is as above. Since Eϕ=ξ, there is π:Sk→A s.t. for any s∈S, if ξ(s)≠0 then

π(a∣s)=ϕ(s,a)ξ(s)

Again, we have

Zπ=(1−λ)(1−λTπ)−1ζ0

Also, for the same reason as before

(E−λT)ϕ=(1−λTπ)ξ

By the assumption, the left hand side equals (1−λ)ζ0. We conclude

ξ=(1−λ)(1−λTπ)−1ζ0=Zπ

Theorem

Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.

Proof

The algorithm starts by solving the following linear program. The indeterminates are ϕ∈RS×A and QV∈R. The goal is maximizing QV. The constraints are

∀s∈S,a∈A:ϕ(s,a)≥0

∑s∈Sa∈Aϕ(s,a)=1

(E−λT)ϕ=(1−λ)ζ0

∀s∈S∖suppZσ,a∈A:ϕ(s,a)=0

∀s∈suppZσ:QV≤∑t∈SR(t)∑a∈Aϕ(t,a)−ηZσ(s)∑a∈Aϕ(s,a)

Then, the algorithm computes π:Sk→A s.t. for any s∈S, if ∑b∈Aϕ(s,b)>0 then

π(a∣s):=ϕ(s,a)∑b∈Aϕ(s,b)

For s∈S s.t. ∑b∈Aϕ(s,b)=0, π(s) is arbitrary.

Now we need to explain why this algorithm is correct.

Observe that, the first 3 constraints mean that ξ∈RS defined by ξ(s):=∑b∈Aϕ(s,b) lies in ImZ (by Lemma 1) and, moreover, ϕ(s,a)=ξ(s)π(a∣s) for π:Sk→A s.t. ξ=Zπ. It remains to show that the linear program amounts to maximizing Eξ[R]−ηexpD∞(ξ||Zσ) inside ImZ. Indeed, the 4th constraint just means that D∞(ξ||Zσ)<∞. The last constraint implies that we are actually maximizing

mins∈suppZσ(Eξ[R]−ηZσ(s)ξ(s))

The latter is indeed Eξ[R]−ηexpD∞(ξ||Zσ), since every s∈suppZσ corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function P:S→[0,∞) to

P(t)=[[t=s]]Zσ(s)

(Strategies that place non-vanishing penalty on states outside of suppZσ become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)

Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see "Proposition 5"). Now we show that an

exactquantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).We assume geometric time discount throughout.

## Lemma

Consider ξ∈ΔS. Define the linear operators E:RS×A→RS and T:RS×A→RS by

Et,sa:=[[t=s]]

Tt,sa:=T(t|s,a)

(Note that this T is the

transposeof the T defined in "Proposition A.3" of the previous essay.)Then, ξ∈ImZ if and only if there is ϕ∈Δ(S×A) s.t.

Eϕ=ξ

(E−λT)ϕ=(1−λ)ζ0

## Proof

(This is actually well known, but we spell out the proof to be self-contained.)

Suppose that ξ∈ImZ. We already know that this implies that there is a

stationarypolicy π:Sk→A s.t. Zπ=ξ (we abuse notation in the obvious way): see the proofs of "Proposition 2" and "Proposition 3". Define the linear operator Tπ:RS→RS byTπts:=Ea∼π(s)[T(t|s,a)]

It follows that

ξ=(1−λ)∞∑n=0λnTπnζ0=(1−λ)(1−λTπ)−1ζ0

(1−λTπ)ξ=(1−λ)ζ0

Define ϕ by

ϕ(s,a):=ξ(s)π(a∣s)

We have

Tπξ=∑s∈SEa∼π(s)[T(s,a)]ξ(s)=∑s∈Sa∈Aπ(a∣s)T(s,a)ξ(s)=∑s∈Sa∈AT(s,a)ϕ(s,a)=Tϕ

Also, obviously Eϕ=ξ. We get

(E−λT)ϕ=ξ−λTπξ=(1−λTπ)ξ=(1−λ)ζ0

Conversely, suppose that ϕ is as above. Since Eϕ=ξ, there is π:Sk→A s.t. for any s∈S, if ξ(s)≠0 then

π(a∣s)=ϕ(s,a)ξ(s)

Again, we have

Zπ=(1−λ)(1−λTπ)−1ζ0

Also, for the same reason as before

(E−λT)ϕ=(1−λTπ)ξ

By the assumption, the left hand side equals (1−λ)ζ0. We conclude

ξ=(1−λ)(1−λTπ)−1ζ0=Zπ

## Theorem

Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.

## Proof

The algorithm starts by solving the following linear program. The indeterminates are ϕ∈RS×A and QV∈R. The goal is maximizing QV. The constraints are

∀s∈S,a∈A:ϕ(s,a)≥0

∑s∈Sa∈Aϕ(s,a)=1

(E−λT)ϕ=(1−λ)ζ0

∀s∈S∖suppZσ,a∈A:ϕ(s,a)=0

∀s∈suppZσ:QV≤∑t∈SR(t)∑a∈Aϕ(t,a)−ηZσ(s)∑a∈Aϕ(s,a)

Then, the algorithm computes π:Sk→A s.t. for any s∈S, if ∑b∈Aϕ(s,b)>0 then

π(a∣s):=ϕ(s,a)∑b∈Aϕ(s,b)

For s∈S s.t. ∑b∈Aϕ(s,b)=0, π(s) is arbitrary.

Now we need to explain why this algorithm is correct.

Observe that, the first 3 constraints mean that ξ∈RS defined by ξ(s):=∑b∈Aϕ(s,b) lies in ImZ (by Lemma 1) and, moreover, ϕ(s,a)=ξ(s)π(a∣s) for π:Sk→A s.t. ξ=Zπ. It remains to show that the linear program amounts to maximizing Eξ[R]−ηexpD∞(ξ||Zσ) inside ImZ. Indeed, the 4th constraint just means that D∞(ξ||Zσ)<∞. The last constraint implies that we are actually maximizing

mins∈suppZσ(Eξ[R]−ηZσ(s)ξ(s))

The latter is indeed Eξ[R]−ηexpD∞(ξ||Zσ), since every s∈suppZσ corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function P:S→[0,∞) to

P(t)=[[t=s]]Zσ(s)

(Strategies that place non-vanishing penalty on states outside of suppZσ become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)