AI ALIGNMENT FORUM
AF

Agent FoundationsInfra-BayesianismAI
Frontpage

4

Proof Section to Crisp Supra-Decision Processes

by Brittany Gelb
17th Sep 2025
3 min read
0

4

Agent FoundationsInfra-BayesianismAI
Frontpage
New Comment
Moderation Log
More from Brittany Gelb
View more
Curated and popular this week
0Comments
Mentioned in
15What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism
17Crisp Supra-Decision Processes

This post accompanies Crisp Supra-Decision Processes and contains the proof of the following proposition.

Proposition 1 [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]: Let M=(S,s0,A,O,T,B,L,γ) be a crisp supra-MDP with geometric time discount such that S and A are finite. Then there exists a stationary optimal policy.

Proof: We first recall some notation. Let A denote the set of actions, and let S denote the set of states. Let (A×S)∗ denote the set of histories and (A×S)n⊂(A×S)∗ denote the set of histories of length n. For h∈(A×S)∗, let h(A×S)ω denote the set of destinies with prefix h.

Throughout, we assume that Lγ:(A×S)ω→[0,1] is the sum of the momentary losses at each time-step with geometric time discount γ∈[0,1). More specifically, given d∈(A×S)ω, we write d=a0∏∞t=1stat. Then Lγ(d) is given by  Lγ(d)=(1−γ)∑∞t=0γtL(st,at).

Fix n∈N. Recall that (A×S)ω can be written as the finite disjoint union 

(A×S)ω=∏h∈(A×S)nh(A×S)ω.

This fact, together with Fubini's theorem, implies that the expected loss can be written as an iterated expectation. More concretely, let θ∈Δ(A×S)ω. Then the (classical) expected loss with respect to θ can be written as 

(1)  Eθ[Lγ]=Eh∼θ1∈Δ(A×S)n[Ed∼θ2∈Δ(h(A×S)ω)[Lγ(d)]].

Equation (1) implies the following claim. Let Λ denote a crisp causal law. Given a policy π, let πn denote the partial policy obtained by restricting π to the first n time steps. More specifically, let (A×O)≤n denote the set of histories of length at most n. Then πn=π|(A×O)≤n. Furthermore, let hπ denote the continuation of π after a history h, i.e. hπ=π|h(A×S)∗ where h(A×S)∗ denotes the set of histories with prefix h.  Given a policy π∈Π, let Λ(πn)∈□(A×S)n and Λ(hπ)∈□(h(A×S)ω) denote the credal sets induced by πn and hπ by restricting Λ in the natural way.[1]

Then using the decomposition above, the supra-expectation can be written (in a manner reminiscent of Fubini's Theorem) as

(2) EΛ(π)[Lγ]=EΛ(πn)[EΛ(hπ)[Lγ]].

To prove (2), we use (1) and observe that

EΛ(π)[Lγ]:=maxθ∈Λ(π)Eθ[Lγ]=maxθ1∈Λ(πn)maxθ2∈Λ(hπ)Eθ1[Eθ2[Lγ]]=maxθ1∈Λ(πn)Eθ1[maxθ2∈Λ(hπ)Eθ2[Lγ]]=EΛ(πn)[EΛ(hπ)[Lγ]].

We now prove the key claim, which states that given any policy, the expected loss can only decrease by switching at some time to the policy that is optimal for the current state.

Claim: Given a policy π, let π†n denote the policy obtained by following π for the first n time steps and then following the policy that is optimal for the state observed at time n. Then for all n∈N, 

EΛ(π)[Lγ]≥EΛ(π†n)[Lγ].

Proof of claim: By Equation (2), 

EΛ(π)[Lγ]=EΛ(πn)[EΛ(hπ)[Lγ(d)]]=EΛ(πn)[EΛ(hπ)[(1−γ)∞∑t=0γtL(st,at)]]=EΛ(πn)[EΛ(hπ)[(1−γ)(n−1∑t=0γtL(st,at)+∞∑t=nγtL(st,at))]]=EΛ(πn)[EΛ(hπ)[(1−γ)n−1∑t=0γtL(st,at)]+EΛ(hπ)[(1−γ)∞∑t=nγtL(st,at)]].

Note that EΛ(hπ)[(1−γ)∑n−1t=0γtL(st,at)]=(1−γ)∑n−1t=0γtL(st,at). Thus,

EΛ(π)[Lγ]=EΛ(πn)[(1−γ)(n−1∑t=0γtL(st,at)+γnEΛ(hπ)[∞∑t=nγt−nL(st,at)])].

Since π†n is optimal for the state observed at time n,

EΛ(hπ)[∞∑t=nγt−nL(st,at)]≥EΛ(π†n)[∞∑t=nγt−nL(st,at)].

By monotonicity, we then have

EΛ(πn)[(1−γ)(n−1∑t=0γtL(st,at)+γnEΛ(hπ)[∞∑t=nγt−nL(st,at)])]≥EΛ(πn)[(1−γ)(n−1∑t=0γtL(st,at)+γnEΛ(π†n)[∞∑t=nγt−nL(st,at)])].

Repeating the same argument as above,

EΛ(πn)[(1−γ)(n−1∑t=0γtL(st,at)+γnEΛ(π†n)[∞∑t=nγt−nL(st,at)])]=EΛ(πn)[EΛ(π†n)[(1−γ)∞∑t=0γtL(st,at)]].

By construction, the partial policy obtained by restricting π†n to the first n time steps is equal to πn, i.e. πn=(π†n)n. Then by Equation (2), 

EΛ(π†n)[Lγ]=EΛ((π†n)n)[EΛ(π†n)[(1−γ)∞∑t=0γtL(st,at)]]=EΛ(πn)[EΛ(π†n)[(1−γ)∞∑t=0γtL(st,at)]].

Therefore,

EΛ(π)[Lγ]≥EΛ(π†n)[Lγ].□

We now apply the claim to finish the proof of the proposition. For ease of notation, let †(π,n):=π†n. Let π∗ denote a given optimal policy. Define the sequence of policies {¯πn}n∈N recursively as follows: ¯π0:=†(π∗,0) and ¯πn=†(¯πn−1,n). The limit of this sequence is a stationary policy, which is optimal by the claim above. □

  1. ^

    For ease of notation, we drop the superscripts on Λ that appear in the main post.