Proof Section to Crisp Supra-Decision Processes

This post accompanies Crisp Supra-Decision Processes and contains the proof of the following proposition.

Proposition 1 [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]: Let be a crisp supra-MDP with geometric time discount such that $S$ and $A$ are finite. Then there exists a stationary optimal policy.

Proof: We first recall some notation. Let $A$ denote the set of actions, and let $S$ denote the set of states. Let $(A \times S)^{*}$ denote the set of histories and $(A \times S)^{n} \subset (A \times S)^{*}$ denote the set of histories of length $n$ . For $h \in (A \times S)^{*}$ , let $h (A \times S)^{ω}$ denote the set of destinies with prefix $h .$

Throughout, we assume that $L^{γ} : (A \times S)^{ω} \to [0, 1]$ is the sum of the momentary losses at each time-step with geometric time discount $γ \in [0, 1) .$ More specifically, given $d \in (A \times S)^{ω},$ we write $d = a_{0} \prod_{t = 1}^{\infty} s_{t} a_{t} .$ Then $L^{γ} (d)$ is given by $L^{γ} (d) = (1 - γ) \sum_{t = 0}^{\infty} γ^{t} L (s_{t}, a_{t}) .$

Fix $n \in N .$ Recall that $(A \times S)^{ω}$ can be written as the finite disjoint union

(A \times S)^{ω} = \prod h \in (A \times S)^{n} h (A \times S)^{ω} .

This fact, together with Fubini's theorem, implies that the expected loss can be written as an iterated expectation. More concretely, let $θ \in Δ (A \times S)^{ω} .$ Then the (classical) expected loss with respect to $θ$ can be written as

(1) E_{θ} [L^{γ}] = E_{h \sim θ_{1} \in Δ (A \times S)^{n}} [E_{d \sim θ_{2} \in Δ (h (A \times S)^{ω})} [L^{γ} (d)]] .

Equation (1) implies the following claim. Let $Λ$ denote a crisp causal law. Given a policy $π$ , let $π^{n}$ denote the partial policy obtained by restricting $π$ to the first $n$ time steps. More specifically, let $(A \times O)^{\leq n}$ denote the set of histories of length at most $n .$ Then $π^{n} = π |_{(A \times O)^{\leq n}} .$ Furthermore, let $^{h} π$ denote the continuation of $π$ after a history $h,$ i.e. $^{h} π = π |_{h (A \times S)^{*}}$ where $h (A \times S)^{*}$ denotes the set of histories with prefix $h .$ Given a policy $π \in Π$ , let $Λ (π^{n}) \in □ (A \times S)^{n}$ and $Λ (^{h} π) \in □ (h (A \times S)^{ω})$ denote the credal sets induced by $π^{n}$ and $^{h} π$ by restricting $Λ$ in the natural way.^[1]

Then using the decomposition above, the supra-expectation can be written (in a manner reminiscent of Fubini's Theorem) as

(2) E_{Λ (π)} [L^{γ}] = E_{Λ (π^{n})} [E_{Λ (^{h} π)} [L^{γ}]] .

To prove (2), we use (1) and observe that

E_{Λ (π)} [L^{γ}] := max θ \in Λ (π) E_{θ} [L^{γ}]

= max θ_{1} \in Λ (π^{n}) max θ_{2} \in Λ (^{h} π) E_{θ_{1}} [E_{θ_{2}} [L^{γ}]]

= max θ_{1} \in Λ (π^{n}) E_{θ_{1}} [max θ_{2} \in Λ (^{h} π) E_{θ_{2}} [L^{γ}]]

= E_{Λ (π^{n})} [E_{Λ (^{h} π)} [L^{γ}]] .

We now prove the key claim, which states that given any policy, the expected loss can only decrease by switching at some time to the policy that is optimal for the current state.

Claim: Given a policy $π$ , let $π_{n}^{†}$ denote the policy obtained by following $π$ for the first $n$ time steps and then following the policy that is optimal for the state observed at time $n .$ Then for all $n \in N,$

E_{Λ (π)} [L^{γ}] \geq E_{Λ (π_{n}^{†})} [L^{γ}] .

Proof of claim: By Equation (2),

\begin{matrix} E_{Λ (π)} [L^{γ}] & = E_{Λ (π^{n})} [E_{Λ (^{h} π)} [L^{γ} (d)]] = E_{Λ (π^{n})} [E_{Λ (^{h} π)} [(1 - γ) \infty \sum t = 0 γ^{t} L (s_{t}, a_{t})]] = E_{Λ (π^{n})} [E_{Λ (^{h} π)} [(1 - γ) (n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t}) + \infty \sum t = n γ^{t} L (s_{t}, a_{t}))]] = E_{Λ (π^{n})} [E_{Λ (^{h} π)} [(1 - γ) n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t})] + E_{Λ (^{h} π)} [(1 - γ) \infty \sum t = n γ^{t} L (s_{t}, a_{t})]] . \end{matrix}

Note that $E_{Λ (^{h} π)} [(1 - γ) \sum_{t = 0}^{n - 1} γ^{t} L (s_{t}, a_{t})] = (1 - γ) \sum_{t = 0}^{n - 1} γ^{t} L (s_{t}, a_{t}) .$ Thus,

E_{Λ (π)} [L^{γ}] = E_{Λ (π^{n})} [(1 - γ) (n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t}) + γ^{n} E_{Λ (^{h} π)} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})])] .

Since $π_{n}^{†}$ is optimal for the state observed at time $n,$

E_{Λ (^{h} π)} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})] \geq E_{Λ (π_{n}^{†})} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})] .

By monotonicity, we then have

E_{Λ (π^{n})} [(1 - γ) (n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t}) + γ^{n} E_{Λ (^{h} π)} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})])]

\geq E_{Λ (π^{n})} [(1 - γ) (n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t}) + γ^{n} E_{Λ (π_{n}^{†})} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})])] .

Repeating the same argument as above,

E_{Λ (π^{n})} [(1 - γ) (n - 1 \sum t = 0 γ^{t} L (s_{t}, a_{t}) + γ^{n} E_{Λ (π_{n}^{†})} [\infty \sum t = n γ^{t - n} L (s_{t}, a_{t})])]

= E_{Λ (π^{n})} [E_{Λ (π_{n}^{†})} [(1 - γ) \infty \sum t = 0 γ^{t} L (s_{t}, a_{t})]] .

By construction, the partial policy obtained by restricting $π_{n}^{†}$ to the first $n$ time steps is equal to $π^{n},$ i.e. $π^{n} = (π_{n}^{†})^{n} .$ Then by Equation (2),

E_{Λ (π_{n}^{†})} [L^{γ}] = E_{Λ ((π_{n}^{†})^{n})} [E_{Λ (π_{n}^{†})} [(1 - γ) \infty \sum t = 0 γ^{t} L (s_{t}, a_{t})]]

= E_{Λ (π^{n})} [E_{Λ (π_{n}^{†})} [(1 - γ) \infty \sum t = 0 γ^{t} L (s_{t}, a_{t})]] .

Therefore,

E_{Λ (π)} [L^{γ}] \geq E_{Λ (π_{n}^{†})} [L^{γ}] . □

We now apply the claim to finish the proof of the proposition. For ease of notation, let $† (π, n) := π_{n}^{†} .$ Let $π_{*}$ denote a given optimal policy. Define the sequence of policies ${{¯ π}_{n}}_{n \in N}$ recursively as follows: ${¯ π}_{0} := † (π_{*}, 0)$ and ${¯ π}_{n} = † ({¯ π}_{n - 1}, n) .$ The limit of this sequence is a stationary policy, which is optimal by the claim above. $□$

^{^}
For ease of notation, we drop the superscripts on $Λ$ that appear in the main post.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

4

Proof Section to Crisp Supra-Decision Processes

4