Infra-Bayesian Logic

Yegreg

This research was produced during SERI MATS. Thanks to Vanessa Kosoy for providing ideas and reading drafts.

Introduction

In this article, we investigate "infra-Bayesian logic" (IBL for short), which is an idea introduced by Vanessa Kosoy. The original idea is outlined in Vanessa Kosoy's Shortform. Here, we expand with some details and examples. We also try to fix the mentioned issues (e.g. lack of continuity for certain maps) by investigating several candidate categories for semantics. However, we fail to find something that works.

As a theory of beliefs and reasoning, infra-Bayesianism gives rise to semantics for a certain logic, which we might call "infra-Bayesian logic" (IBL for short). Formalizing the syntax for this logic has the advantage of providing a notion of description complexity for infra-POMDPs (partially observable Markov decision processes) expressed via this syntax, and hence could serve as a foundation for an infra-Bayesian analogue of AIXI. Infra-Bayesian logic is not the only way to define an infra-Bayesian analog of AIXI. Such an alternative could be to use oracle machines (with an extra input tape for a source of randomness): For a probabilistic oracle machine, the corresponding IB hypothesis is the closed convex hull of all stochastic environments that result from using the oracle machine with a particular oracle, and then one could use the description complexity of the oracle machine (which can be defined by choosing a universal oracle machine) as a basis for an infra-Bayesian AIXI. The analogue of oracle machines for (non-infra-Bayesian) AIXI would be (probabilistic) Turing machines.

However, infra-Bayesian logic might be an interesting alternative to describing infra-Bayesian hypotheses, and this approach would be specific to infra-Bayesianism (there is, as far as we know, no good analogue of infra-Bayesian logic in (ordinary) Bayesianism).

Note that we are currently unaware of a working semantics for the higher order logic. We discuss some candidates considered in Candidates for the Semantics, and the ways in which they fail. It's unclear whether these issues with the semantics are mostly technical in nature, or whether they point to some deeper phenomena. We can, however, restrict to a first order IBL on finite sets, which, albeit less powerful, is still expressive enough for many applications (in particular covers all the examples in Examples).

See also the paragraph on IBL in the overview of the learning theoretic Agenda.

Outline and Reading Guide

You can skim/skip the Notation section and use it as a reference instead. The main description of IBL is given in Infra-Bayesian Logic. To get an initial and informal idea what IBL looks like, read Less Formal Description of Infra-Bayesian Logic. For a technical description and a complete list of properties, read Formal Description of Infra-Bayesian Logic. Note that the latter description requires some familiarity with category theory.

In Constructions in IBL we list how we can combine basic elements of IBL to express more complicated things.

In Infra-POMDPs we talk about infra-POMDPs as an application of IBL. Some knowledge of infra-Bayesianism is beneficial here, but not required. We highlight the section on Examples, where we describe the IBL terms for concrete Infra-POMDPs.

In Candidates for the Semantics we discuss the various (failed) candidates we tried for the higher-order logic version of IBL. A typical reader is not expected to read this section.

The Appendix is for mostly technical proofs and counterexamples. We recommend to only look these up when you are interested in the technical details. For example, if you wish to understand why conjunction is not continuous in IBL, you can find the details there.

Notation

Let us describe notation needed for using concepts from infra-Bayesianism.

This notation section is partially copied from the Infra-Bayesian Physicalism article, but extended from finite sets to compact Polish spaces, wich requires a bit more measure theory. Also, we use the symbol for homogeneus ultracontributions (instead of $□^{c} X$ as in the IBP article).

The notation from the IBP post is a bit different from some of the other posts on infra-Bayesianism in that it uses ultradistributions/ultracontributions rather than infradistributions.

We denote $R_{+} := [0, \infty)$ . We will work a lot with compact Polish spaces.

Compact Polish spaces are spaces that are topologically equivalent to some compact metric space (the metric is not fixed, however, only the topology). An important special case of compact Polish spaces are finite (discrete) sets.

Definition 1. Given a compact Polish space $X$ , a contribution on $X$ is a (Borel) measure $θ$ on $X$ with $θ (X) \leq 1$ . Given a measurable function $f : X \to R$ and $θ \in Δ^{c} X$ , we denote $θ (f) := \int_{X} f (x) d θ (x)$ . The space of contributions is denoted $Δ^{c} X$ , equipped with the weak- $*$ topology (i.e. the weakest topology such that the functions $Δ^{c} X ∋ θ \mapsto θ (f) \in R$ are continuous for all continuous functions $f : X \to R$ ).

Naturally, any (probability) distribution is in particular a contribution, so $Δ X \subseteq Δ^{c} X$ . There is a natural partial order on contributions: $θ_{1} \leq θ_{2}$ when $θ_{1} (A) \leq θ_{2} (A)$ for all measurable $A \subset X$ .

Definition 2. A homogeneous ultracontribution (HUC) on $X$ is non-empty closed convex $Θ \subseteq Δ^{c} X$ which is downward closed w.r.t. the partial order on $Δ^{c} X$ . The space of HUCs on $X$ is denoted $□ X$ .

If a HUC $Θ \in □ X$ has the property $Θ$ s.t. $Θ \cap Δ X \neq \emptyset$ , we call it a homogeneuous ultradistribution.

Sometimes we need a topology on the space of HUCs $□ X$ . In this case we will use the topology induced by the Hausdorff metric, (where we use a metric on $Δ^{c} X$ that fits with the weak- $*$ topology. The topology induced by the Hausdorff metric is also equivalent to the so-called Vietoris topology, which is a topology that only depends on the underlying toplogy on $Δ^{c} (X)$ , see here. See also Propositions 42 and 50 in Less Basic Inframeasure Theorey for the same result for infra-distributioins). One can show that $□ X$ is a compact Polish space with this topology when $X$ is a compact Polish space, see Lemma 10. Let us introduce more notation.

For a set $A \subset Δ^{c} (X)$ , $cl A$ denotes its closure.

Given another compact Polish space $Y$ and a measurable function $s : X \to Y$ , $s_{*} : □ X \to □ Y$ is the pushforward by $s$ :

s_{*} Θ := cl {s_{*} θ ∣ θ \in Θ},

where $(s_{*} θ) (A) := θ (s^{- 1} (A))$ for all measurable $A \subset Y$ . We can omit the $cl$ in the definition if $s$ is continuous.

A map $X \to □ Y$ is sometimes referred to as an infrakernel.

Given an infrakernel $Ξ : X \to □ Y$ , $Ξ_{*} : □ X \to □ Y$ is the pushforward by $Ξ$ :

\begin{matrix} Ξ_{*} Θ := cl {κ_{*} θ ∣ θ \in Θ, κ : X \to Δ^{c} Y, κ measurable, \forall x \in X : κ (x) \in Ξ (x)}, \end{matrix}

where (for measurable $A \subset Y$ )

(κ_{*} θ) (A) := \int_{X} κ (x) (A) d θ (x) .

For a measurable function $s : X \to Y$ We also define the pullback operator $s^{*} : □ Y \to □ X$ via

s^{*} Θ := cl {θ \in Δ^{c} X ∣ s_{*} θ \in Θ} .

We can omit $cl$ here if $s$ is continuous.

${pr}_{X} : X \times Y \to X$ is the projection mapping, $i_{X} : X \to X ∐ Y$ is the inclusion map.

Given $Θ \in □ X$ and an infrakernel $Ξ : X \to □ Y$ , $Θ ⋉ Ξ \in □ (X \times Y)$ is the semidirect product:

\begin{matrix} Θ ⋉ Ξ := cl {κ ⋉ θ ∣ θ \in Θ, κ : X \to Δ^{c} Y, κ measurable, \forall x \in X : κ (x) \in Ξ (x)}, \end{matrix}

where the measure $κ ⋉ θ$ is defined by

(κ ⋉ θ) (A \times B) := \int_{A} κ (x) (B) d θ (x)

for all measurable $A \subset X$ , $B \subset Y$ , which can be extended to arbitrary measurable sets $C \subset X \times Y$ in the usual measure-theoretic way.

We will also use the notation $Ξ ⋊ Θ \in □ (Y \times X)$ for the same HUC with $X$ and $Y$ flipped. And, for $Λ \in □ Y$ , $Θ ⋉ Λ \in □ (X \times Y)$ is the semidirect product of $Θ$ with the constant ultrakernel whose value is $Λ$ .

For a closed set $A \subset X$ , we define $⊤_{A} := {μ \in Δ^{c} X ∣ μ (X ∖ A) = 0} \in □ X$ (we'll call these sharp ultradistributions). We also define $⊥ \in □ X$ via $⊥ = {0}$ .

Infra-Bayesian Logic

The formal description in this section assumes some familiarity with type theory and categorical logic. For an informal overview of the syntax, we point the reader to Less Formal Description of Infra-Bayesian Logic.

The infra-Bayesian notion of beliefs gives rise to a fairly weak form of logic, in the sense that the spaces of predicates in infra-Bayesian logic don't form Heyting algebras, and are generally non-distributive and even non-modular as lattices. Also, due to the lack of function types, we cannot use the full syntax of type theory.

We will therefore work with a simplified syntax, still maintaining versions of many of the usual constructions appearing in higher order logic. This language can be used to describe infra-POMDPs (see Infra-POMDPs), which in turn is a useful beginning to something like an infra-AIXI (see Application: an Infra-Bayesian AIXI Analogue).

Less Formal Description of Infra-Bayesian Logic

Infra-Bayesian logic is a logical language. The language has types. Types can come from a set of initial types $T^{ι}$ , or can be combined from other types using sums and products. Also, $0$ and $1$ are special types (where $0$ contains no element, and $1$ contains exactly one element, so to speak). There is also a more complicated construction of additional types: If $α$ is a type, then we also consider predicates on $α$ a type, which we denote by $[α]$ .

Infra-Bayesian logic also has maps between types. The set of maps between types $α$ and $β$ is denoted by $F_{α \to β}$ . We will not list every kind of map that we require for infra-Bayesian logic in this informal description. But some examples are $\land_{α} \in F_{[α] \times [α] \to [α]}$ and $\lor_{α} \in F_{[α] \times [α] \to [α]}$ , which correspond to a meaning of "and" and "or".

A map of type $F_{1 \to α}$ is basically the same as something of type $α$ , but it allows us to compose this with other maps. Examples are $⊤ \in F_{1 \to [1]}$ and $⊥ \in F_{1 \to [1]}$ which correspond to "true" and "false". There are also maps for equality and predicate evaluation. All kinds of different maps can be combined and composed to yield expressions in infra-Bayesian logic.

How does this connect to infra-Bayesianism? A collection of sentences in infra-Bayesian logic can have a model $M$ . There are lots of rules for a model to ensure that the symbols of infra-bayesian logic correspond to their intended meaning. A model maps types into topological spaces. In order to make sure the above-mentioned types behave reasonably under the model, we require that $M (0)$ is the empty topological space, $M (1)$ is the $1$ -point topological space, products and sums work as expected ( $M (α \times β) = M (α) \times M (β)$ and $M (α + β) = M (α) ⊔ M (β)$ ). A special case is the topological space for $[α]$ (the type of predicates on $α$ ): We require that $M ([α])$ is the topological space of homogeneous ultra-contributions over $M (α)$ , i.e. $□ M (α)$ . We will also explore alternatives to HUCs over topological spaces, but it should be some general notion of infra-distribution. The model of a map $f \in F_{α \to β}$ is a map between the topological spaces $M (α) \to M (β)$ . We will have conditions that the model of a map is consistent with the meaning. For example, we require that the model of " $\land_{α}$ " maps two HUCs to their intersection.

We will later investigate the precise conditions on toplogical spaces and maps that we would like to have for infra-Bayesian logic.

Formal Description of Infra-Bayesian Logic

Syntax

Given a set $T^{ι}$ of primitive types, we can require that the types form a category $C (T)$ , which should be the "freest" category such that:

$T^{ι} \subset O b (C (T))$

$C (T)$ has finite products and coproducts

$C (T)$ supports infra-Bayesian logic (Definition 3).

We won't construct the category $C (T)$ here (uniqueness is clear). We'll use the shorthand $F_{α \to β} = C (T) (α, β)$ for the set of morphisms, and write $V_{α} = F_{1 \to α}$ (where $1$ is the terminal object).

Definition 3. We say that a category $C$ with finite products and coproducts supports infra-Bayesian logic, if it satisfies the following requirements:

There is a functor $[_] : C \to C^{o p}$ (recall that $C^{o p}$ is the opposite category of $C$ ). For $α \in O b (C (T))$ , $[α]$ is the object intended to correspond to predicates on $α$ . For $f \in C (α, β)$ , we write $f^{*} : [β] \to [α]$ instead of $[f]$ to denote pullback of predicates.

There are morphisms $\lor_{α}, \land_{α} \in F_{[α] \times [α] \to [α]}$ . We require that these operations turn $V_{[α]}$ into a bounded lattice, and we require the functor $[_]$ to induce a functor into the category of lattices. In particular, we have $⊤, ⊥ \in V_{[1]}$ .

We require that ${pr}_{β}^{*} : [β] \to [α \times β]$ have natural left and right adjoints with respect to the poset structure (of the lattice) above. We also require these adjoints to come from morphisms in $C$ , denoted by $\exists_{α β}, \forall_{α β} \in F_{[α \times β] \to [β]}$ respectively.

We require the existence of an equality predicate $=_{α} \in V_{[α \times α]}$ , which is the minimal predicate such that $({diag}_{α}^{*} \circ =_{α}) = ⊤_{α}$ (cf. A constant function with output top).

We require the existence of ${ev}_{α} \in V_{[[α] \times α]}$ (predicate evaluation). Given $f : β \to [α]$ , we can pull back ${ev}_{α}$ via $f \times 1_{α}$ to get $^f \in V_{[β \times α]}$ . Note however that we cannot require universality here (i.e. that every predicate in $V_{[β \times α]}$ arise this way), due to the failure of the disintegration theorem in infra-Bayesianism. It's not entirely clear currently what the appropriate condition to require from ${ev}_{α}$ should be on the syntax side, even though the intended semantics can be described (cf. Item 5 in the semantics).

For $q \in Q \cap [0, 1]$ we use $δ_{q} \in V_{[1 + 1]}$ as a syntactic constant symbol corresponding to a coin toss with probability $q$ .

Note that in Item 6 we could require the following more general set of symbols, however, the relevant ones can likely be approximated via the existing syntax. For $n \in N$ , let $D_{n} \subset □ n$ (treating $n$ as an $n$ -point discrete space) be the subset of "describable" ultracontributions under some such notion. We would then introduce constant symbols $┌ μ ┐ \in V_{[n]}$ for each $μ \in D_{n}$ .

Remark 4. In order to construct the syntax for the first order theory, we can instead consider two categories $C$ (base types) and $D$ (predicate types), with the functor $[_] : C \to D^{o p}$ . In this context the predicate functor $[_]$ cannot be iterated, hence we are left with a first order theory. The requirements in Definition 3 remain the same, except for Item 5 being removed. The construction of a pushforward via a kernel in Pushforward via a kernel no longer makes sense in that generality, but we explain how to recover it for finite sets in the first order theory in Pushforward via kernel for finite sets, which is then used when unrolling the time evolution of an infra-POMDP in Process.

Semantics

We require a model to be a functor $M : C (T) \to P$ , which preserves finite products and coproducts.

Moreover, we require $P$ to support infra-Bayesian logic (cf. Definition 3), and the model functor $M$ to respect the logical structure.

In practice, we will want $P$ to be some subcategory of $T o p$ . The motivating candidate is to take $P$ to be the category of compact Polish spaces with continuous maps, and the predicate functor on $P$ to be $□$ . This choice however doesn't work for higher order logic.

We nevertheless spell out what the logical semantics translate to, using $□$ as a generic symbol denoting some appropriate notion of "ultracontributions".

$M ([α]) = □ M (α)$ (cf. Lemma 10), and $M (f^{*}) = M (f)^{*}$ is the pullback (see Continuity Counterexamples about issues with continuity in the case of HUCs).

We require $M$ to induce a lattice homomorphism $V_{[α]} \to □ M (α)$ , where

$M (\lor_{α}) : □ M (α) \times □ M (α) \to □ M (α)$ is given by convex hull of the union

$M (\land_{α}) : □ M (α) \times □ M (α) \to □ M (α)$ is given by intersection (see Continuity Counterexamples about issues with continuity in the case of HUCs)

Predicates. Let $X = M (α)$ , $Y = M (β)$ , and ${pr}_{Y} : X \times Y \to Y$ be projection onto the second factor. The following follow from the adjunction

$M (\exists_{α β}) = {pr}_{Y *}$ is the pushforward

For $μ \in □ (X \times Y)$ , we have

\begin{matrix} M (\forall_{α β}) (μ) = {p \in □ Y | \forall q \in Δ^{c} (X \times Y) : ({pr}_{Y *} (q) = p ⟹ q \in μ)} \end{matrix}

(see Continuity Counterexamples about issues with continuity in the case of HUCs)

$M (=_{α}) = ⊤_{{diag}_{M (α)}}$ (as a sharp ultradistribution)

If $X = M (α)$ , then $M ({ev}_{α}) = ⊤_{□ X} ⋉ 1_{□ X}$ , where $⊤_{□ X} \in □ □ X$ , and $1_{□ X} : □ X \to □ X$ is considered as an infrakernel.

$M (δ_{q}) \in □ (pt + pt)$ is the crisp ultradistribution corresponding to a coin toss with probability $q \in Q \cap [0, 1]$ (here $pt$ is the one point space).

Note in the more general setting of describable ultracontributions, we would require $M (┌ μ ┐) = μ$ .

Definition 5. A subset of $V_{[1]}$ (i.e. a set of sentences) is called a theory. We say that $M$ models the theory $T$ if $M (ϕ) = ⊤_{pt}$ for all $ϕ \in T$ .

Remark 6. Finding an appropriate category $P$ turns out to be harder than first expected. We discuss various candidates considered in Candidates for the Semantics.

In general, we note that for infra-POMDPs with finite state spaces, the transition (infra-)kernels are always continuous, and some of the issues with continuity mentioned in Candidates for the Semantics are less of a problem.

Constructions in IBL

In the following we construct various useful components of IBL using the syntax outlined above. These constructions will also be used to build examples of infra-POMDPs in Infra-POMDPs.

Pushforward

Given $f : X \to Y$ , we can construct the pushforward $f_{*} : [X] \to [Y]$ as follows. First, consider the following two maps into $[X \times Y]$ :

$1 =_{Y} - - \to [Y \times Y] (f \times 1_{Y})^{*} - ---- \to [X \times Y]$ (this represents the graph of $f$ )

$[X] {pr}_{X}^{*} - - \to [X \times Y]$ .

Composing the product of these two with $\land_{X \times Y} : [X \times Y] \times [X \times Y] \to [X \times Y]$ , we get $[X] = 1 \times [X] \to [X \times Y]$ , and finally post-composing with $[X \times Y] \exists_{X Y} - - \to [Y]$ , we get $f_{*} : [X] \to [Y]$ .

Pushforward via a kernel

Given $f : X \to [Y]$ , we can construct the pushforward $f_{*} : [X] \to [Y]$ as well. This is done similarly to Pushforward, except the element in $[X \times Y]$ is replaced with

1 {ev}_{Y} - - \to [[Y] \times Y] (f \times 1_{Y})^{*} - ---- \to [X \times Y],

where ${ev}_{Y}$ is predicate evaluation.

Pushforward via kernel for finite sets

If $X$ is a finite set, then we can think of a map $f : X \to [Y]$ as a collection of points $f_{x} : 1_{D} \to [Y]$ , which is meaningful in the first order syntax too, as long as the predicate category has a final object $1_{D} \in D$ . In this case we can construct the pushforward along the kernel $f$ in the first order theory as follows. For each $x \in X$ we have the composite

G_{f_{x}} : 1 f_{x} - \to [Y] (ι_{x} \times 1_{Y})_{*} - ---- \to [X \times Y],

where $ι_{x}$ is the inclusion of $x \in X$ , and $(ι_{x} \times 1_{Y})_{*}$ is the regular pushforward from Pushforward. Now, taking the product of these over $x \in X$ , and taking (iterated) disjunction, we have

1 \prod_{x \in X} G_{f_{x}} - ----- \to \prod x \in X [X \times Y] \lor \to [X \times Y],

which is the "graph of $f$ ", using which we construct the pushforward $f_{*} : [X] \to [Y]$ analogously to Pushforward and Pushforward via a kernel.

Equality of functions

Given $f, g : α \to β$ , we want to construct a sentence $s_{f = g} \in V_{[1]}$ . This can be done as follows. First, take the composite with the diagonal on $α$

φ : α {diag}_{α} - -- \to α \times α f \times g - - \to β \times β,

then we have

s_{f = g} : 1 =_{β} - \to [β \times β] φ^{*} - \to [α] \forall_{α 1} - - \to [1],

so $s_{f = g} \in V_{[1]}$ is the desired sentence.

Mixing with probability 1/2

Assume that we have two terms $s_{1} \in F_{1 \to σ_{1}}$ , $s_{2} \in F_{1 \to σ_{2}}$ . Then we have $s_{1} + s_{2} \in F_{1 + 1 \to σ_{1} + σ_{2}}$ . Applying a pushforward yields $(s_{1} + s_{2})_{*} \in F_{[1 + 1] \to [σ_{1} + σ_{2}]}$ . Then we can compose with $δ_{1 / 2} \in F_{1 \to [1 + 1]}$ , which assigns a fair coin distribution on $1 + 1$ .

f : 1 δ_{1 / 2} - - \to [1 + 1] (s_{1} + s_{2})_{*} - ---- \to [σ_{1} + σ_{2}] .

A constant function with output bottom

We can construct a constant function term $f \in F_{α \to [β]}$ whose model $M (f)$ is a constant function that maps everything to $⊥_{M (β)}$ as follows. Let $t_{α} \in F_{α \to 1}$ be the terminal map. Then we can define the term $f$ via

f : α t_{α} - \to 1 ⊥ \to [1] t_{β}^{*} - \to [β] .

This does the right thing: By factoring through $M (1) = pt$ the function $M (f)$ has to be constant, and by using $⊥$ , the output has to be $⊥_{M (β)}$ .

A constant function with output top

This works the same way, just using the symbol $⊤ \in F_{1 \to [1]}$ instead of $⊥ \in F_{1 \to [1]}$ :

f : α t_{α} - \to 1 ⊤ \to [1] t_{β}^{*} - \to [β] .

Components of infrakernels for sum types

For a function $f \in F_{α + β \to [γ + δ]}$ , how do we get access to its components, so that we get a term $f_{α, γ} \in F_{α \to [γ]}$ ? We do this by composing with $i_{α β}$ and $i_{γ δ}^{*}$ , i.e.

f_{α, γ} : α i_{α} - \to α + β f \to [γ + δ] i_{γ}^{*} - \to [γ] .

For the other components this would work essentially in the same way, which would give us functions $f_{α, δ} \in F_{α \to [δ]}$ , $f_{β, γ} \in F_{β \to [γ]}$ , $f_{β, δ} \in F_{β \to [δ]}$ .

Infra-POMDPs

Setup

We can describe infra-POMDPs in the language of infra-Bayesian logic in the following way. Infra-POMDPs consist of the following.

Finite sets of actions $A$ and observations $O$

For each observation $o \in O$ , a type $σ_{o}$ of states producing the given observation. Let $σ_{*} = \sum_{o \in O} σ_{o}$ be the type of all states.

An initial state $s_{0} \in V_{[σ_{*}]}$

For each $a \in A$ , a transition kernel $K_{a} \in F_{σ_{*} \to [σ_{*}]}$ .

Process

Given a state at time $t$ , $s_{t} \in V_{[σ_{*}]}$ , and an action $a \in A$ , we can apply the kernel pushforward $(K_{a})_{*} : [σ_{*}] \to [σ_{*}]$ , to end up in a new state $s_{t + 1} \in V_{[σ_{*}]}$ . Once receiving an observation $o_{t} \in O$ , we can do an update on the observed state. In general, this update depends on the loss function, but we can obtain a "naive" update via pull back and push forward through $i_{o_{t}} : σ_{o_{t}} \to \sum_{o \in O} σ_{o} = σ_{*}$ to get the observed state ${^s}_{t + 1} = (i_{o_{t}})_{*} (i_{o_{t}})^{*} (s_{t + 1}) \in V_{[σ_{*}]}$ .

Laws

In infra-Bayesianism, laws (also known as belief functions in earlier posts on infra-Bayesianism) are functions $Θ : Π \to □ ((A \times O)^{ω})$ , where $Π$ is the space of possible (deterministic) policies of the IB agent (a technical note: the space $(A \times O)^{ω}$ is a Polish space, when equipped with the topology generated by cylinder sets). Such a law describes infra-Bayesian beliefs how a policy will lead to certain histories.

Under some conditions, one can convert an infra-POMDP into a corresponding law. We will not explain exactly how, and instead refer to Theorem 4 in "The many faces of infra-beliefs".

Examples

Let us give an example of an infra-POMDP. The first example is basically just and ordinary MDP, rewritten to fit into the above notation for infra-POMDPs.

Example 7. We set $O = {0, 1}$ , $A = {0, 1}$ . This example has (functionally) no relevant hidden states, i.e. states are equivalent to observations. Taking action $0 \in A$ will lead to the same observation as the previous observation. Taking action $1 \in A$ will lead to the other observation as the previous observation. As for the initial conditions, we start out with probability $\frac{1}{2}$ for each observation/corresponding state.

Let us explain how Example 7 can be described using an IBL theory $T \subset V_{1}$ . We have types $σ_{0}$ and $σ_{1}$ for the states corresponding to the two possible observations. First, we introduce non-logical symbols $s_{1} \in V_{σ_{0}}$ , $s_{2} \in V_{σ_{1}}$ . Then, using the construction in Mixing with probability 1/2, we can define a term in $V_{[σ_{*}]}$ that assigns probability $1 / 2$ to each of $σ_{0}$ , $σ_{1}$ , as desired. Using function equality from Equality of functions, we can equate the initial state $s_{0} \in V_{[σ_{*}]}$ with this mixture. We add the equality sentence to our theory $T$ . This ensures that every model of the theory $T$ satisfies $M (s_{0}) = \frac{1}{2} ⊤_{M (σ_{0})} + \frac{1}{2} ⊤_{M (σ_{1})}$ .

Next, we will add sentences that describe the transition kernels. Note that we have two transition kernels $K_{0}, K_{1} \in F_{σ_{*} \to [σ_{*}]}$ , one for each possible action. Then, for each transition kernel, we can decomposition it, like in Components of infrakernels for sum types. This gives us $8$ possibilities. For example, $(K_{0})_{σ_{0}, σ_{1}}$ describes our beliefs that we will be in state $σ_{1}$ after we take action $0 \in A$ and are in state $σ_{0}$ . For all the $8$ possible combinations, the resulting function should be a constant function, with values $⊥$ or $⊤_{σ_{0, 1}}$ . Again, we can use the equality of functions as described in Equality of functions to ensure that these components of the transition kernels have the correct values, in accordance with the transition kernel as described in the example. We will add these sentences to the theory $T$ , so that $T$ now has 9 elements.

Example 8. The following is a "fully infra" POMDP.

$O = {o}$ , single observation
$σ_{o} = {A, B}$ , two states
$A = {a}$ , single action.

The credal set corresponding to

$K (A)$ is the interval $[\frac{1}{3} A + \frac{2}{3} B, B]$
$K (B)$ is the interval $[A, \frac{2}{3} A + \frac{1}{3} B]$ .

We can take downward closure to end up with a homogeneous ultradistribution.

So a single step in the Markov process swaps $A$ and $B$ , but with some Knightean uncertainty of landing up to $1 / 3$ away from the target. As we iterate through multiple timesteps, we gain no information through observation (since $| O | = 1$ ), so the intervals of Knightean uncertainty widen (the portion outside the Knightean interval shrinks as $(\frac{2}{3})^{t}$ ).

We can construct the above infra-POMDP via IBL as follows. Let

δ_{1 / 3}, δ_{2 / 3} \in V_{[2]}

be coin tosses with probability of heads equal to $1 / 3$ and $2 / 3$ respectively. We could in principle approximate these with iterated fair coin tosses, but we assume them as given for convenience. Let

i_{A}, i_{B} : 1 \to 1 + 1 = σ_{*}

be the two states. Then we have

⊤_{A} : 1 ⊤ \to [1] i_{A *} - - \to [σ_{*}],

and similarly for $⊤_{B}$ . We can construct the intervals of length $1 / 3$ from $A$ and $B$ :

I_{A} = ⊤_{A} \lor δ_{2 / 3} \in V_{[σ_{*}]}

I_{B} = ⊤_{B} \lor δ_{1 / 3} \in V_{[σ_{*}]} .

Finally, we construct the transition kernel $K_{a} = I_{B} + I_{A} : σ_{*} \to [σ_{*}]$ .

Application: an Infra-Bayesian AIXI Analogue

A potential application for infra-Bayesian logic is to describe the hypotheses of an AIXI-like infra-Bayesian agent, by using IBL to describe infra-POMDPs. Let us expand a bit on this idea.

Let's assume an infra-Bayesian agent knows that the environment is an infra-POMDP and also happens to know the finitely many actions $A$ the agent can take each time step, and the finitely many possible observations $O$ the agent could be receiving. However, it does not not know how many states there are. (We will leave rewards out of the picture here. We could include rewards by encoding it in the observations, e.g., by adding a reward $r_{o}$ for each observation. But for now let's focus only on the agent's model of the environment.)

An infra-Bayesian agent would have a set of hypotheses of how the environment could be. In our setting, each hypothesis has the form of an infra-POMDP. Like an AIXI agent, we would like to have a variant of Occam's razor: simpler hypotheses should receive a higher weight.

There is a potential approach to use a finite theory in IBL to describe a hypothesis for our infra-POMDP, and then the weight of this hypothesis in the prior is based on the length of the description of the IBL theory, e.g. proportional to $2^{- 2 l}$ when $l$ is the description length. Note that some IBL theories might have no valid model, so these theories should be discarded from the hypotheses; and there might be other reason why an IBL theory might be excluded from the space of hypotheses, see below.

Each theory in infra-Bayesian logic can have (possibly none or multiple) models. If we use types $σ_{o}$ , $σ_{*}$ and symbols $s_{0} \in V_{[σ_{*}]}$ , $K_{a} \in F_{σ_{*} \to [σ_{*}]}$ as in Setup, then each model $M$ describes an infra-POMDP: The initial state can be described by $M (s_{0})$ , and for each $a \in A$ , $M (K_{a})$ is the transition kernel.

The question arises what do we do if there are multiple models for the same IBL term. Let us describe two approaches, which both rely on converting the infra-POMDP into a corresponding law, see Laws.

A first approach could be to only allow theories in IBL that produce models which are unique up to equivalence, where we say that two models are equivalent if they produce infra-POMDPs that produce the same laws.

A second approach could be to consider all possible models in a first step, then convert these models into corresponding laws, and then take the disjunction (i.e. closed convex hull of all outputs of the laws) to turn it into the final law.

Candidates for the Semantics

Compact Polish Spaces with Continuous Maps

A first candidate for the category $P$ would be compact Polish spaces with continuous maps, and ' $□$ ' being the space of HUCs as defined in Notation.

However, we run into issues with continuity in this category. Namely, the maps $M (\land)$ , $M (\forall_{α β})$ , $M (f)^{*}$ are not always continuous, see Continuity Counterexamples.

Allowing Discontinuity

Another approach could be to bite the bullet on discontinuity and broaden the set of allowed morphisms in the semantics category. The problem with these approaches generally is that pullbacks really only behave nicely with respect to continuous maps, we typically lose functoriality when allowing discontinuity. That is, the desired property $(g \circ f)^{*} = f^{*} \circ g^{*}$ can fail to hold for maps $f : X \to Y$ , $g : Y \to Z$ . This is mostly because for discontinuous functions we have to take the closure in the definition of the pullback (see Notation).

Upper Semicontinuity

This approach is inspired by the fact that while intersection is not continuous on $□ X$ , it is upper semicontinuous (wrt to the set inclusion order) on $□ X$ .

For this approach, we would use the category of compact polish spaces with a closed partial order, and for the maps we require that they are monotone increasing and upper semicontinuous. Here, a function $f : X \to Y$ is upper semicontinuous when

x_{n} \to x \land f (x_{n}) \to y ⟹ y \leq_{Y} f (x)

holds for all sequences ${x_{n}} \subset X$ and points $x \in X$ , $y \in Y$ . As for the partial order on $□ X$ , one could consider the partial order based on set inclusions, or additionally combine this with the stochastic order mentioned in Definition 2.1 in IBP by requiring that elements in $□ X$ are also downwards closed with respect to the stochastic order.

In both cases, the approach with upper semicontinuous maps, runs into the functoriality issues mentioned above, see Semi-continuous Pullback is not Compositional.

Measurable Maps

We could try to broaden the set of morphisms even further, but naturally the issues with functoriality remain.

A Different Notion of HUCs

A possible remedy for the continuity issues could be to use an alternative definition of HUCs, in which the problematic maps become continuous. Roughly speaking, the high level idea here is to require elements of the modified HUCs to "widen" closer to $⊥$ , thus avoiding the discontinuities.

Unfortunately, we were not able to find a working semantics of this type, despite trying different approaches, as described below.

We can require HUCs to be closed under the finer partial order $⪯$ in Definition 9, and modify Definition 2 accordingly. Let $□^{L} (X)$ denote the space of these alternative HUCs. One downside of this approach is that now the space $□^{L} X$ depends on the metric on $X$ , while previously it only depended on the underlying topology.

Definition 9 (downward closure related to 1-Lipschitz functions). Let $X$ be a compact metric space. For $θ_{1}, θ_{2} \in Δ^{c} (X)$ we define the relation

θ_{1} ⪯ θ_{2} : ⟺ \forall f : X \to R_{+}, f 1 -Lipschitz : θ_{1} (1 + f) \leq θ_{2} (1 + f) .

The space $□^{L} (X)$ is defined as the set of closed convex subsets of $Δ^{c} (X)$ that are downward closed under the partial order $⪯$ .

To see that it achieves some of the desired behavior, we show that $M (\land)$ is continuous in Lemma 14. This approach also requires that we work with compact metric spaces, instead of merely with compact polish spaces.

We give some technical details in Standard setting for the alternative HUC case. Later we will look at alternatives, mostly to justify the choices made in Standard setting for the alternative HUC case.

Standard setting for the alternative HUC case

The Category of metric spaces is typically equipped with maps that are $1$ -Lipschitz continuous, and the metric on a product $X \times Y$ of metric spaces $X, Y$ is given by

d_{X \times Y} ((x_{1}, y_{1}), (x_{2}, y_{2})) = max (d_{X} (x_{1}, x_{2}), d_{Y} (y_{1}, y_{2})) .

We do make some further modifications. First, we only consider compact metric spaces whose metric $d$ only has values in $[0, 1]$ . This allows us to consider sums/coproducts of metric spaces (usually the category of metric spaces does not have coproducts). For the metric on a sum (disjoint union) of two metric spaces, we set the distance of two points from different components to $1$ .

Let us define the metric we will use on $□^{L}$ . As a metric on $Δ^{c} (X)$ we will use the Kantorovich-Rubenstein metric

d (p, q) := sup {| p (g) - q (g) | : g : X \to [- 1, 1] 1 -Lipschitz},

which is a metric that is compatible with the weak- $*$ topology on $Δ^{c} (X)$ . Then we use the Hausdorff metric $d_{H}$ on $□^{L}$ . It can be seen that this Hausdorff metric has values in $[0, 1]$ again.

A problem with $□^{L} (X)$ is that it is not always preserved under the pullback, i.e. $f^{*} (Θ) \notin □^{L} (X)$ can happen for some $f : X \to Y$ , $Θ \in □^{L} (Y)$ . Thus, we will redefine the pullback in this context. For a function between metric spaces $f : X \to Y$ we define the modified pullback $f^{* L} : □^{L} Y \to □^{L} X$ via $f^{* L} (Θ) := {μ \in Δ^{c} (Y) | \exists ν \in f^{*} (Θ) : μ ⪯ ν}$ . We will use this alternative pullback for the case of $□^{L}$ .

Under this setting, we run into the issue that $M (\land)$ is not $1$ -Lipschitz continuous, see Example 16. This is despite the fact that $M (\land)$ is continuous (and probably even Lipschitz continuous).

Allow higher Lipschitz constants for maps

One might wonder what happens, when we pick a higher Lipschitz constant $c > 1$ for maps in our category (instead of $c = 1$ ). If we have two functions with a Lipschitz constant $c > 1$ , then we can only guarantee a Lipschitz constant of $c^{2}$ for the composition (e.g. with the function $[0, 1] ∋ x \mapsto min (1, c x) \in [0, 1]$ ). If we repeat this argument, we cannot guarantee a finite Lipschitz constant, because $c^{n} \to \infty$ as $n \to \infty$ .

So, what happens if we have allow maps with Lipschitz constant $2$ ? It turns out that then we run into issues with functoriality, as in Allowing Discontinuity. A concrete counterexample is described in Example 15.

Another metric on products

Instead of the choice for $d_{X \times Y}$ in Standard setting for the alternative HUC case, we could consider using the metric

d_{X \times Y} ((x_{1}, y_{1}), (x_{2}, y_{2})) = d_{X} (x_{1}, x_{2}) + d_{Y} (y_{1}, y_{2}) .

A first problem with this choice is that this is incompatible with any finite bound on $d$ , as is the case in Standard setting for the alternative HUC case. This issue can be avoided by considering extended metric spaces, which are metric spaces whose metric can take values in $[0, \infty]$ (in that case the distance between disjoint components in a sum is set to $\infty$ ).

A more difficult problem with this approach would be that the diagonal map $diag : X \to X \times X$ is not $1$ -Lipschitz continuous for other natural choices of metric on products. We do need the diagonal map, see, e.g., Equality of functions.

Modifying the definition of alternative HUCs

We mention that as an extension, the definition for $⪯$ in Definition 9 could be modified to include a constant $c > 0$ and use $θ_{1} (c + f) \leq θ_{2} (c + f)$ . For smaller $c > 0$ , this would make these alternative HUCs sharper. The version with $c = 1$ has some drawbacks as a lot of probability mass gets assigned to points nearby: For $X = [0, 1]$ , we have $\frac{1}{2} δ_{0} ⪯ δ_{1}$ , so an alternative HUC which contains $δ_{1}$ also has to contain $\frac{1}{2} δ_{0}$ . In the context of infra-Bayesianism, this means we can use these kind of HUCs only in combination together with a smaller class of loss functions (in this case, loss functions that are of type $1 + f$ for a nonnegative Lipschitz continuous $f$ ).

One might wonder whether increasing $c$ might help us decrease the Lipschitz constant for $M (\land)$ or even choose a $c > 1$ such that $M (\land)$ is $1$ -Lipschitz. While the Lipschitz constant does get improved, calculating through Example 16 would show that the Lipschitz constant $M (\land)$ cannot reach $1$ even if we use these modified versions of $□^{L}$ .

Appendix

Lemma 10. If $X$ is a compact Polish space then so is $□ X$ .

Proof. It is known that the space of nonempty compact subsets of $X$ is compact in the Hausdorff metric. One can also show that $□ X$ is a closed subset of the space of nonempty compact subsets of $X$ .

For the related concepts (but not HUCs) in IB theory, this result can also be found in Proposition 47 in "Less Basic Inframeasure Theory" ◻

Issues with Continuity

Using the topology induced by the Hausdorff metric on $□ X$ (see below Definition 2), we find that in the 'logical' semantics (inverse) limit type constructions (in the category theoretical sense) are generally not continuous (see Continuity Counterexamples for details). This includes the semantics for $\land_{α}$ , $\forall_{α β}$ , and the pullback $f^{*}$ .

If $X$ itself is finite, then $□ X$ is nice enough that we don't run into these issues. This observation is sufficient for example in the construction of the infra-POMDPs in Examples when the state spaces are finite. However, we quickly run into issues if we're not careful with iterating the $□$ construction, since $□ X$ is not finite even if $X$ is.

Below is some discussion on a possible approach to trying to fix these issues (we haven't found an entirely satisfactory fix though).

Continuity Counterexamples

Lemma 11. The function $M (\land) : □ M (α) \times □ M (α) \to □ M (α)$ , given by $M (\land) (μ, ν) = μ \cap ν$ is not always continuous.

Proof. Consider the case where $M (α)$ is the space $[0, 1]$ with standard topology. This is a Polish space.

We consider sequences with $x_{i} = 1 / 2 - 1 / i$ and $y_{i} = 1 / 2 + 1 / i$ . We define $x_{0} = y_{0} = 1 / 2$ . Then $⊤_{{x_{i}}}$ and $⊤_{{y_{i}}}$ are in $□ M (α)$ . (For a point $z \in [0, 1]$ , I understand this notation to be $⊤_{{z}} := {a δ_{z} ∣ a \in [0, 1]}$ , where $δ_{z} \in Δ ([0, 1])$ is the measure that puts all weight on the single point $z$ ).

One can calculate that $M (\land) (⊤_{{x_{i}}}, ⊤_{{y_{i}}}) = ⊥$ .

Because of the Hausdorff metric in $□ M (α)$ we have $⊤_{{x_{i}}} \to ⊤_{{x_{0}}}$ and $⊤_{{y_{i}}} \to ⊤_{{x_{0}}}$ .

To put it together, we have

\begin{matrix} lim M (\land) (⊤_{{x_{i}}}, ⊤_{{y_{i}}}) = ⊥ & \neq ⊤_{{x_{0}}} = M (\land) (⊤_{{x_{0}}}, ⊤_{{y_{0}}}) = M (\land) (lim ⊤_{{x_{i}}}, lim ⊤_{{y_{i}}}) . \end{matrix}

Thus, $M (\land)$ is not continuous. ◻

Lemma 12. The function $M (\forall_{α β})$ is not always continuous.

Proof. Consider the case that $M (α) = M (β) = [0, 1]$ . We use the notation $f := M (\forall_{α β})$ . We choose the sequence $μ_{n} = ⊤_{[(0, 0), (1, 1 / n)]}$ , where $[,]$ denotes the line between points. Then one can show that $f (μ_{n}) = ⊥$ .

However, we also have $μ_{n} \to μ_{0} := ⊤_{[(0, 0), (1, 0)]}$ . One can show that $f (μ_{0}) = ⊤_{{0}} \in □ M (β)$ holds. Thus, we have

f (μ_{n}) \to ⊥ \neq f (μ_{0}),

i.e. $f$ is not continuous. ◻

Lemma 13. There is a continuous function $f : X \to Y$ such that $f^{*} : □ Y \to □ X$ is not continuous.

Proof. Pick $f : [- 1, 1] \to [0, 1]$ , $f (x) = max (0, x)$ . Then $f^{*} (⊤_{{1 / n}}) = ⊤_{{1 / n}}$ , but for the limit we have $f^{*} (⊤_{{0}}) = ⊤_{[- 1, 0]}$ , and not $⊤_{{1 / n}} \to ⊤_{[- 1, 0]}$ . ◻

Semi-continuous Pullback is not Compositional

Let $f, g : [0, 1] \to [0, 1]$ be given by

g (x) = {\begin{matrix} 0 & if x < 1 / 2 1 & if x \geq 1 / 2, \end{matrix}

And

f (x) = ⎧ ⎨ ⎩ \begin{matrix} 0 & if x < 1 / 3 1 / 2 & if 1 / 3 \leq x < 2 / 3 1 & if x \geq 2 / 3. \end{matrix}

Both $f$ and $g$ are monotone increasing, upper semi-continuous. Let $C = [0, 1 / 2] \subset [0, 1]$ . Then $B = g^{- 1} (C) = [0, 1 / 2)$ , the closure $¯ ¯¯ ¯ B = [0, 1 / 2]$ . Hence $f^{- 1} (B) = [0, 1 / 3)$ , while $f^{- 1} (¯ ¯¯ ¯ B) = [0, 2 / 3)$ , so

¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ f^{- 1} (¯ ¯¯ ¯ B) \neq ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ f^{- 1} (B) .

We can construct a corresponding counterexample where $(g \circ f)^{*} \neq g^{*} \circ f^{*}$ by considering $Θ \in □ [0, 1]$ , defined as follows. Let $Θ = ⊤_{C}$ be the sharp HUC on $C$ . Then

g_{*}^{- 1} (Θ) = {ρ \in Δ^{c} [0, 1] : supp (ρ) \subset [0, 1 / 2)}

g^{*} (Θ) = ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ g_{*}^{- 1} (Θ) = {ρ \in Δ^{c} [0, 1] : supp (ρ) \subset [0, 1 / 2]} .

Then

(g \circ f)^{*} (Θ) = {ρ \in Δ^{c} [0, 1] : supp (ρ) \subset [0, 1 / 3]},

but

(f^{*} \circ g^{*}) (Θ) = f^{*} (g^{*} (Θ)) = {ρ \in Δ^{c} [0, 1] : supp (ρ) \subset [0, 2 / 3]},

which is different.

Stricter Homogeneity

We collect some technical results for A Different Notion of HUCs

With Definition 9 we can prove continuity for $M (\land)$ , which does not hold for ordinary HUCs.

Lemma 14. The function $M (\land) : □^{L} M (α) \times □^{L} M (α) \to □^{L} M (α)$ , given by $M (\land) (μ, ν) = μ \cap ν$ is continuous.

Proof. We set $X := M (α)$ . As a metric on $Δ^{c} (X)$ we will use the Kantorovich-Rubenstein metric

\begin{matrix} d (p, q) := sup {| p (g) - q (g) | : g : X \to [- 1, 1] 1 -Lipschitz} . \\ (1) \end{matrix}

Let $μ_{k}$ , $ν_{k}$ be sequences in $□^{L} (X)$ with $μ_{k} \to μ_{0}$ , $ν_{k} \to ν_{0}$ . We want to show that $μ_{k} \cap ν_{k} \to μ_{0} \cap ν_{0}$ . We have

\begin{matrix} d_{H} (μ_{k} \cap ν_{k}, μ_{0} \cap ν_{0}) \leq sup p \in μ_{k} \cap ν_{k} d (p, μ_{0} \cap ν_{0}) + sup p \in μ_{0} \cap ν_{0} d (p, μ_{k} \cap ν_{k}) . \\ (2) \end{matrix}

For the first term, let $p_{k} \in μ_{k} \cap ν_{k}$ be such that $d (p_{k}, μ_{0} \cap ν_{0})$ is maximized. By compactness, there exists a convergent subsequence of $p_{k}$ . Wlog let $p_{k} \to p_{0}$ . We have $p_{k} \in μ_{k}$ for all $k$ , and taking limits implies $p_{0} \in μ_{0}$ . The same argument also implies $p_{0} \in ν_{0}$ . Thus, we have

sup p \in μ_{k} \cap ν_{k} d (p, μ_{0} \cap ν_{0}) = d (p_{k}, μ_{0} \cap ν_{0}) \to d (p_{0}, μ_{0} \cap ν_{0}) = 0.

For the second term, let $p_{0} \in μ_{0} \cap ν_{0}$ be given. Due to $μ_{k} \to μ_{0}$ , there exists a sequence $p_{k}$ with $p_{k} \in μ_{k}$ and $p_{k} \to p_{0}$ . Similarly, there exists a sequence $q_{k}$ with $q_{k} \in ν_{k}$ and $q_{k} \to q_{0}$ . The idea is now to construct a measure (i.e. a point in $Δ^{c} (X)$ ) $r_{k}$ that is close to $p_{k}$ and $q_{k}$ but in $μ_{k} \cap ν_{k}$ .

Wlog we assume $p_{0} (1) > 0$ and $p_{k} (1) > 0$ , $q_{k} (1) > 0$ for all $k$ . We define $δ_{k} > 0$ via

δ_{k} := \frac{d (p_{k}, q_{k}) (2 + {diam}_{X})}{(p_{k} + q_{k}) (1)},

where ${diam}_{X} := {sup}_{x, y \in X} d (x, y)$ is the diameter of $X$ . Clearly, we have $δ_{k} \to 0$ . We also define

r_{k} := \frac{1 - δ_{k}}{2} (q_{k} + p_{k}) .

We want to show $r_{k} ⪯ q_{k}$ and $r_{k} ⪯ p_{k}$ . By symmetry we only need to show $r_{k} ⪯ p_{k}$ . For that purpose, let $f : X \to R_{+}$ be an arbitrary $1$ -Lipschitz continuous function. We define $g := 1 + f$ . We have to show $r_{k} (g) \leq p_{k} (g)$ .

We can decompose $g = g_{min} + g_{0}$ , where $g_{min} := {min}_{x \in X} g (x)$ is a constant, and $g_{0}$ is a nonnegative function that satisfies $g_{0} (x_{0}) = 0$ for some $x_{0} \in X$ . Since $g_{0}$ is $1$ -Lipschitz continuous, this implies that $g_{0} (x) \leq {diam}_{X}$ for all $x \in X$ . Thus, $(1 + {diam}_{X})^{- 1} g_{0}$ is a function that is $1$ -Lipschitz continuous and bounded by $1$ , i.e. admissible in the definition Eqn. (1). We can use this to obtain

\begin{matrix} (q_{k} - p_{k}) (g_{0}) & = (1 + {diam}_{X}) (q_{k} - p_{k}) ((1 + {diam}_{X})^{- 1} g_{0}) \leq (1 + {diam}_{X}) d (q_{k}, p_{k}) \leq \frac{d (p_{k}, q_{k}) (1 + {diam}_{X})}{(p_{k} + q_{k}) (1)} (p_{k} + q_{k}) (g) . \end{matrix}

On the other hand, we have

\begin{matrix} (q_{k} - p_{k}) (g_{min}) & = g_{min} (q_{k} - p_{k}) (1) \leq g_{min} d (p_{k}, q_{k}) \leq \frac{d (p_{k}, q_{k})}{(p_{k} + q_{k}) (1)} (p_{k} + q_{k}) (g_{min}) \leq \frac{d (p_{k}, q_{k})}{(p_{k} + q_{k}) (1)} (p_{k} + q_{k}) (g) \end{matrix}

Adding these yields

\begin{matrix} (q_{k} - p_{k}) (g) & = (q_{k} - p_{k}) (g_{min} + g_{0}) \leq \frac{d (p_{k}, q_{k}) (1 + {diam}_{X} + 1)}{(p_{k} + q_{k}) (1)} (p_{k} + q_{k}) (g) = δ_{k} (p_{k} + q_{k}) (g) . \end{matrix}

If we rearrange this inequality, we get $(1 - δ_{k}) (p_{k} + q_{k}) (g) \leq 2 p_{k} (g)$ . Dividing by $2$ yields $r_{k} (g) \leq q_{k} (g)$ , Since $g = f + 1$ and $f \geq 0$ was an arbitrary $1$ -Lipschitz function this proves $r_{k} ⪯ p_{k}$ . Since $p_{k} \in μ_{k}$ and $μ_{k}$ is downwards closed wrt $⪯$ , this implies $r_{k} \in μ_{k}$ . Similarly, we can obtain $r_{k} ⪯ q_{k}$ and $r_{k} \in ν_{k}$ . From the definition of $r_{k}$ we also obtain $r_{k} \to p_{0}$ (due to $p_{k} \to p_{0}$ and $q_{k} \to q_{0}$ ). Thus we have $d (p_{0}, μ_{k} \cap ν_{k}) \leq d (p_{0}, r_{k}) + d (r_{k}, μ_{k} \cap ν_{k}) = d (p_{0}, r_{k}) + 0 \to 0$ as $k \to \infty$ . Since $p_{0} \in μ_{0} \cap ν_{0}$ was arbitrary, this implies that the right-hand side in Eqn. (2) converges to $0$ . This implies $μ_{k} \cap ν_{k} \to μ_{0} \cap ν_{0}$ in the Hausdorff metric, which completes the proof. ◻

In the next we will use the notation $f^{* L}$ for the modified pullback, see Standard setting for the alternative HUC case.

Example 15. Consider $X = [0, 1]$ as a compact metric space with the usual metric. Further, let $f, g : X \to X$ be given by $f (x) = min (2 x, 1)$ , $g (x) = x / 2$ for all $x \in [0, 1]$ . Then $g^{* L} \circ f^{* L} \neq (f \circ g)^{* L}$ , i.e. functoriality does not hold.

Proof. Clearly, we have $(f \circ g) (x) = x$ for all $x \in [0, 1]$ .

We define

Θ_{0} := {μ \in Δ^{c} ([0, 1]) : μ ⪯ δ_{0}} .

We claim that $(g^{* L} \circ f^{* L}) (Θ_{0}) \neq (f \circ g)^{* L} (Θ_{0})$ holds. Due to $(f \circ g) (x) = x$ for all $x \in [0, 1]$ this simplifies to

\begin{matrix} (g^{* L} \circ f^{* L}) (Θ_{0}) \neq Θ_{0} . \\ (3) \end{matrix}

We start by observing $δ_{0} \in Θ_{0}$ and $δ_{0} \in f^{* L} (Θ_{0})$ .

We also claim $\frac{3}{4} δ_{1 / 3} ⪯ δ_{0}$ . Indeed, for $1$ -Lipschitz continuous $h : [0, 1] \to R^{+}$ we have

\frac{3}{4} δ_{1 / 3} (1 + h) = \frac{3}{4} (1 + h (\frac{1}{3})) \leq \frac{3}{4} (1 + h (0) + \frac{1}{3}) = \frac{3}{4} (\frac{4}{3} + h (0)) \leq 1 + h (0) = δ_{0} (1 + h) .

Due to $δ_{0} \in f^{* L} (Θ_{0}) \in □^{L} [0, 1]$ this implies $\frac{3}{4} δ_{1 / 3} \in f^{* L} (Θ_{0})$ . Due to $g (2 / 3) = 1 / 3$ we have $g_{*} (\frac{3}{4} δ_{2 / 3}) = \frac{3}{4} δ_{1 / 3}$ . This implies $\frac{3}{4} δ_{2 / 3} \in g^{*} (f^{* L} (Θ_{0})) \subset (g^{* L} \circ f^{* L}) (Θ_{0})$ .

However, we can also show that $\frac{3}{4} δ_{2 / 3} \notin Θ_{0}$ . Indeed, lets consider the $1$ -Lipschitz continuous function $h (x) = x$ . Then $\frac{3}{4} δ_{2 / 3} (1 + h) = \frac{3}{4} (1 + 2 / 3) = 5 / 4 ≰ 1 = δ_{0} (1 + h) .$ Thus $\frac{3}{4} δ_{2 / 3} \notin Θ_{0}$ . This completes the proof of Eqn. (3) and hence completes the proof of the lemma. ◻

Example 16. Let $X = {0, 1}$ be the compact metric space where the metric satisfies $d (0, 1) = 1$ . We use the standard setting from Standard setting for the alternative HUC case otherwise. Then the function $M (\land) : □^{L} M (α) \times □^{L} M (α) \to □^{L} M (α)$ , given by $M (\land) (Θ_{1}, Θ_{2}) = Θ_{1} \cap Θ_{2}$ is not $1$ -Lipschitz continuous.

Proof. Consider the measures $μ_{0} = \frac{1}{2} δ_{0}$ , $μ_{1} = \frac{1}{2} δ_{1}$ , $μ_{2} = \frac{1}{2} δ_{0} + \frac{1}{2} δ_{1}$ .

Let $Θ_{i}$ be the smallest set in $□^{L}$ that contains $μ_{i}$ , for $i = 0, 1, 2$ . Then it can be calculated that

\begin{matrix} Θ_{0} & = {α_{0} δ_{0} + α_{1} δ_{1} ∣ α_{0} + 2 α_{1} \leq \frac{1}{2}, α_{0}, α_{1} \geq 0}, Θ_{1} & = {α_{0} δ_{0} + α_{1} δ_{1} ∣ 2 α_{0} + α_{1} \leq \frac{1}{2}, α_{0}, α_{1} \geq 0}, Θ_{2} & = {α_{0} δ_{0} + α_{1} δ_{1} ∣ 2 α_{0} + α_{1} \leq \frac{3}{2}, α_{0} + 2 α_{1} \leq \frac{3}{2}, α_{0}, α_{1} \geq 0} . \end{matrix}

We want to calculate $d_{H} (Θ_{2}, Θ_{0})$ . Due to $Θ_{0} \subset Θ_{2}$ , we have $d_{H} (Θ_{2}, Θ_{0}) = {sup}_{μ \in Θ_{2}} d (μ, Θ_{0})$ . Since the supremum of a convex function is attained on extremal points, we only need to consider $d (μ, Θ_{0})$ for the case that $μ$ is an extremal point of $Θ_{2}$ . The extremal points of $Θ_{2}$ are $μ_{2} = \frac{1}{2} δ_{0} + \frac{1}{2} δ_{1}$ , $\frac{3}{4} δ_{0}$ , $\frac{3}{4} δ_{1}$ , $0$ . Let us calculate the distances (or bounds on distances) of these points to $Θ_{0}$ . We start with $μ_{2}$ . We observe $\frac{1}{2} δ_{0} \in Θ_{0}$ . This implies

\begin{matrix} d (μ_{2}, Θ_{0}) & \leq d (μ_{2}, \frac{1}{2} δ_{0}) = sup {| μ_{2} (g) - \frac{1}{2} δ_{0} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = sup {| \frac{1}{2} δ_{1} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = \frac{1}{2} . \end{matrix}

On the other hand, we have $μ_{2} (1) = 1$ and $μ (1) \leq \frac{1}{2}$ for all $μ \in Θ_{0}$ . Since $1$ is admissible in Eqn. (1), this implies $d (μ_{2}, μ) \geq \frac{1}{2}$ for all $μ \in Θ_{0}$ , and therefore $d (μ_{2}, Θ_{0}) \geq \frac{1}{2}$ . Thus, $d (μ_{2}, Θ_{0}) = \frac{1}{2}$ .

Next, we calculate a bound on $d (\frac{3}{4} δ_{0}, Θ_{0})$ . As we have $\frac{1}{2} δ_{0} \in Θ_{0}$ , we obtain

\begin{matrix} d (\frac{3}{4} δ_{0}, Θ_{0}) & \leq d (\frac{3}{4} δ_{0}, \frac{1}{2} δ_{0}) = sup {| \frac{3}{4} δ_{0} (g) - \frac{1}{2} δ_{0} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = sup {| \frac{1}{4} δ_{0} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = \frac{1}{4} . \end{matrix}

Similarly, for a bound on $d (\frac{3}{4} δ_{1}, Θ_{0})$ , we observe $\frac{1}{4} δ_{1} \in Θ_{0}$ and calculate

\begin{matrix} d (\frac{3}{4} δ_{1}, Θ_{0}) & \leq d (\frac{3}{4} δ_{1}, \frac{1}{4} δ_{1}) = sup {| \frac{3}{4} δ_{1} (g) - \frac{1}{4} δ_{1} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = sup {| \frac{1}{2} δ_{1} (g) | : g : X \to [- 1, 1] 1 -Lipschitz} = \frac{1}{2} . \end{matrix}

Finally, $d (0, Θ_{0}) = 0$ holds due to $0 \in Θ_{0}$ . In summary, we have shown for all extremal points of $Θ_{2}$ that the distances to $Θ_{0}$ do not exceed $\frac{1}{2}$ (and obtain the value $\frac{1}{2}$ in at least one case). This implies $d_{H} (Θ_{2}, Θ_{0}) = \frac{1}{2}$ .

By symmetry we also have $d (Θ_{2}, Θ_{1}) = \frac{1}{2}$ .

Now, let us consider the intersection $Θ_{0} \cap Θ_{1}$ . Using the above descriptions for $Θ_{0}$ and $Θ_{1}$ we can obtain the subset relation

\begin{matrix} Θ_{0} \cap Θ_{1} & \subset {α_{0} δ_{0} + α_{1} δ_{1} ∣ α_{0} + α_{1} \leq \frac{1}{3}, α_{0}, α_{1} \geq 0} = {μ \in Δ^{c} (X) ∣ μ (1) \leq \frac{1}{3}} . \end{matrix}

This implies $μ (1) \leq \frac{1}{3}$ for all $μ \in Θ_{0} \cap Θ_{1}$ . Recall that we have $μ_{2} \in Θ_{2}$ and $μ_{2} (1) = 1$ . By using $g = 1$ in Eqn. (1), it follows that $d (μ_{2}, μ) \geq 1 - \frac{1}{3} = \frac{2}{3}$ for all $μ \in Θ_{0} \cap Θ_{1}$ . Thus,

\begin{matrix} d (Θ_{2}, Θ_{0} \cap Θ_{1}) & \geq d (μ_{2}, Θ_{0} \cap Θ_{1}) = inf {d (μ_{2}, μ) : μ \in Θ_{0} \cap Θ_{1}} \geq \frac{2}{3} . \end{matrix}

Combining all these results shows

\begin{matrix} d_{H} (Θ_{2} \cap Θ_{2}, Θ_{0} \cap Θ_{1}) & \geq \frac{2}{3} > \frac{1}{2} = max (d_{H} (Θ_{2}, Θ_{0}), d (Θ_{2}, Θ_{1})) = d ((Θ_{2}, Θ_{2}), (Θ_{0}, Θ_{1})), \end{matrix}

which shows that the Lipschitz constant has to be at least $\frac{3}{2} > 1$ . ◻

Measurability and Continuity Proofs

Let us collect proofs that some maps between compact Polish spaces that are mentioned in Semantics are indeed measurable, as required. This concerns $M (\exists_{α β})$ , $M (\forall_{α β})$ , $M (\lor_{α})$ , $M (\land_{α})$ .

Lemma 17. The function $M (\lor) : □ M (α) \times □ M (α)$ is continuous and measurable.

Proof. Let $X := M (α)$ and $f := M (\lor)$ . Recall that $f (μ, ν) := conv (μ \cup ν)$ . Let $μ_{n}, ν_{n} \in □ X$ be sequences with $μ_{n} \to μ_{0}$ and $ν_{n} \to ν_{0}$ .

We need to show $f (μ_{n}, ν_{n}) \to f (μ_{0}, ν_{0})$ for continuity. We do this by showing two things: First, every limmit $z_{0}$ of a convergent sequence $z_{n}$ with $z_{n} \in f (μ_{n}, ν_{n})$ satisfies $z_{0} \in f (μ_{0}, ν_{0})$ . Second, for every point $z_{0} \in f (μ_{0}, ν_{0})$ there exists a sequence $z_{n}$ with $z_{n} \in f (μ_{n}, ν_{n})$ .

So, first, let $z_{n}$ be a sequence with $z_{n} \in f (μ_{n}, ν_{n})$ and $z_{n} \to z_{0}$ . We have $z_{n} \in conv (μ_{n} \cup ν_{n})$ , and thus there exists $x_{n} \in μ_{n}$ , $y_{n} \in ν_{n}$ , $α_{n} \in [0, 1]$ with $z_{n} = α_{n} x_{n} + (1 - α_{n}) y_{n}$ . Wlog we have convergences $x_{n} \to x_{0} \in μ_{0}$ , $y_{n} \to y_{0} \in ν_{0}$ , $α_{n} \to α_{0} \in [0, 1]$ . This implies $z_{0} = α_{0} x_{0} + (1 - α_{0}) y_{0} \in conv (μ_{0} \cup ν_{0})$ .

For the other part, let $z_{0} \in f (μ_{0}, ν_{0})$ be given. Thus, there exist $x_{0} \in μ_{0}$ , $y_{0} \in μ_{0}$ , $α_{0} \in [0, 1]$ with $z_{0} = α_{0} x_{0} + (1 - α_{0}) y_{0}$ . Due to $μ_{n} \to μ_{0}$ and $ν_{n} \to ν_{0}$ there exist sequences with $x_{n}$ , $y_{n}$ with $x_{n} \in μ_{n}$ , $y_{n} \in ν_{n}$ , and $x_{n} \to x_{0}$ , $y_{n} \to y_{0}$ . Then, we have $z_{n} := α_{0} x_{n} + (1 - α_{0}) y_{n} \in conv (μ_{n} \cup ν_{n})$ . Taking limits implies $z_{n} \to z_{0}$ . This completes the proof of continuity of $f$ . ◻

Lemma 18. If $f : X \to Y$ is a continuous function between compact Polish spaces, then $f_{*} : □ X \to □ Y$ is continuous.

In particular, the function $M (\exists_{α β}) = {pr}_{M (β) *}$ is continuous and measurable.

Proof. Let $μ_{k}$ be a sequence in $□ (X \times Y)$ with $μ_{k} \to μ_{0} \in □ (X \times Y)$ . We want to show $f_{*} (μ_{k}) \to f_{*} (μ_{0})$ . As before, we do this by showing two things. First, every convergent sequence $p_{k} \in f_{*} (μ_{k})$ with $p_{k} \to p_{0}$ implies $p_{0} \in f_{*} (μ_{0})$ . Second, for every point $p_{0} \in f_{*} (μ_{0})$ we can find a sequence $p_{k}$ with $p_{k} \to p_{0}$ and $p_{k} \in f_{*} (μ_{k})$ .

Note that $f_{*} : Δ^{c} (X) \to Δ^{c} (Y)$ is continuous if $f$ is continuous.

Let $p_{k} \in f_{*} (μ_{k})$ be a convergent sequence with $p_{k} \to p_{0}$ . Then there exists $q_{k} \in μ_{k}$ with $p_{k} = f_{*} (q_{k})$ . Wlog $q_{k} \to q_{0}$ for some $q_{0}$ due to compactness. This implies $f_{*} (q_{k}) \to f_{*} (q_{0})$ and $q_{0} \in μ_{0}$ . Due to uniqueness of the limit, we have $f_{*} (q_{0}) = p_{0}$ and therefore the desired $p_{0} \in f_{*} (μ_{0})$ .

For the remaining part, let $p_{0} \in f_{*} (μ_{0})$ be given. Then there exists $q_{0} \in μ_{0}$ with $p_{0} = f_{*} (q_{0})$ . Since $μ_{k} \to μ_{0}$ , there exists a sequence $q_{k}$ with $q_{k} \in μ_{k}$ and $q_{k} \to q_{0}$ . We define $p_{k} := f_{*} (q_{k})$ . Then we have $p_{k} \to f_{*} (q_{0}) = p_{0}$ and $p_{k} \in f_{*} (μ_{k})$ . This completes the proof. ◻

Lemma 19. The function $M (\land) : □ M (α) \times □ M (α) \to □ M (α)$ , given by $M (\land) (μ, ν) = μ \cap ν$ is measurable.

Proof. We use the abbreviation $X := M (α)$ , and use the notation $d (\cdot, \cdot)$ for the metric on $Δ^{c} (X)$ and $d_{H} (\cdot, \cdot)$ for the induced Hausdorff metric on $□ X$ . To show measurability, it suffices to show that the preimages of open balls are measurable. Thus, we need to show that ${(A, B) \in (□ X)^{2} : d_{H} (A \cap B, C) < ε}$ is measurable for all $C \in □ X$ and $ε > 0$ .

First, we show an auxiliary claim.

Claim 1: Let $D \subset X$ be a compact subset. Then $⋃_{y \in D} {(A, B) : y \in A \cap B}$ is closed and measurable.

Let us show that the claim is true. Let $(A_{k}, B_{k})$ be a sequence in the set with $d_{H} (A_{k}, A) \to 0$ , $d_{H} (B_{k}, B) \to 0$ . Then there exists $y_{k} \in D$ with $y_{k} \in A$ and $y_{k} \in B$ . Wlog $y_{k} \to y$ for some $y \in D$ (because of convergent subsequences). Then $d_{H} (A, y) = lim d_{H} (A_{k}, y_{k}) = 0$ and thus $y \in A$ . Similar, $y \in B$ . Thus, the limit $(A, B)$ of $(A_{k}, B_{k})$ is in the set $⋃_{y \in D} {(A, B) : y \in A \cap B}$ . Thus, the set is closed and measurable.

We have ${(A, B) : d_{H} (A \cap B, C) < ε} = {(A, B) : \forall x \in A \cap B : d (x, C) < ε} \cap {(A, B) : \forall x \in C : d (A \cap B, x) < ε}$ .

Let us show measurability of ${(A, B) : \forall x \in C : d (A \cap B, x) < ε}$ . We have

\begin{matrix} {(A, B) : \forall x \in C : d (A \cap B, x) < ε} = ⋃ δ \in (0, ε) \cap Q {(A, B) : \forall x \in C : d (A \cap B, x) \leq δ} = ⋃ δ \in (0, ε) \cap Q ⋂ x \in C {(A, B) : d (A \cap B, x) \leq δ} = ⋃ δ \in (0, ε) \cap Q ⋂ x \in C ⋃ y : d (y, x) \leq δ {(A, B) : y \in A \cap B} \end{matrix}

For each $δ \in (0, ε) \cap Q$ , $x \in C$ , the set $⋃_{y : d (y, x) \leq δ} {(A, B) : y \in A \cap B}$ is closed due to claim 1, because ${y : d (y, x) \leq δ}$ is a compact set. Recall that intersections of closed sets are closed. Thus, the set in question is the countable union of closed sets, and therefore measurable.

Now, let us show the measurability of ${(A, B) : \forall x \in A \cap B : d (x, C) < ε}$ . We have

\begin{matrix} {(A, B) : \forall x \in A \cap B : d (x, C) < ε} = {(A, B) : \forall x : d (x, C) \geq ε ⟹ x \notin A \cap B} = ⋂ x : d (x, C) \geq ε {(A, B) : x \notin A \cap B} = (□ X)^{2} ∖ ⎛ ⎝ ⋃ x : d (x, C) \geq ε {(A, B) : x \in A \cap B} ⎞ ⎠ . \end{matrix}

Again, since the set ${x : d (x, C) \geq ε}$ is compact, it follows that the above set is measurable by claim 1.

This completes the proof that ${(A, B) : d_{H} (A \cap B, C) < ε}$ is measurable. ◻

Lemma 20. Let $X, Y$ be compact metric spaces, $q_{0} \in Δ (X \times Y)$ and $p_{1} \in Δ (Y)$ be given. Then there exists $q_{1} \in Δ (X \times Y)$ with ${pr}_{Y *} q_{1} = p_{1}$ and $W_{1} (q_{0}, q_{1}) \leq W_{1} ({pr}_{Y *} q_{0}, p_{1})$ , where $W_{1}$ is the Wasserstein metric.

Proof. We set $p_{0} := {pr}_{Y *} q_{0} \in Δ (Y)$ . In the following, we will also use the notation ${pr}_{i} : Y \times Y \to Y$ , ${pr}_{i} (x) = x_{i}$ for $i = 1, 2$ .

Then, due to the definition of $W_{1}$ , there exists a measure $μ \in Δ (Y \times Y)$ with ${pr}_{1 *} μ = p_{0}$ , ${pr}_{2 *} μ = p_{1}$ , $μ (d_{Y}) = W_{1} (p_{0}, p_{1})$ (due to compactness arguments, the infimum is attained in the definition of $W_{1}$ ). We choose a measure $ν \in Δ (X \times Y \times X \times Y)$ via

ν (A \times B \times C \times D) = \int_{B} q_{0} (A \cap C | y)^μ (D | y) d p_{0} (y),

where $y \mapsto q_{0} (^A | y)$ denotes a density function that satisfies $q_{0} (^A \times^B) = \int_{^B} q_{0} (A | y) d p_{0} (y)$ for all measurable $^A \subset X$ , $^B \subset Y$ , and $y \mapsto^μ (D | y)$ is the same for the "mirrored" measure $^μ$ of $μ$ , i.e. $μ (^B \times D) = \int_{^B}^μ (D | y), d p_{0} (y)$ for all measurable $^B \subset Y$ . These density functions exist due to Radon-Nikodym. The measure $ν$ exists due to Caratheodory's theorem and is unique.

The measure $ν$ has the properties

{pr}_{4 *} ν = p_{1}, {pr}_{12 *} ν = q_{0}, {pr}_{24 *} ν = μ, {pr}_{13 *} ν = {diag}_{X *} {pr}_{1 *} q_{0},

where ${pr}_{i j}$ is map that projects on components $i$ and $j$ , and ${diag}_{X *} : X \to X \times X$ is the diagonal map.

We now choose $q_{1} := {pr}_{34 *} ν \in Δ (X \times Y)$ . Indeed, this satisfies the desired property ${pr}_{Y *} q_{1} = p_{1}$ . As for the Wasserstein metric, $ν (d_{X \times Y})$ is an upper bound on $W_{1} (q_{0}, q_{1})$ . Let us calculate $ν (d_{X \times Y})$ . We have

\begin{matrix} ν (d_{X \times Y}) & = \int d_{X} (x_{1}, x_{2}) + d_{Y} (y_{1}, y_{2}) d ν (x_{1}, y_{1}, x_{2}, y_{2}) = \int d_{X} (x_{1}, x_{2}) d ν (x_{1}, y_{1}, x_{2}, y_{2}) + \int d_{Y} (y_{1}, y_{2}) d ν (x_{1}, y_{1}, x_{2}, y_{2}) = \int d_{X} (x_{1}, x_{2}) d ({diag}_{X *} {pr}_{1 *} q_{0}) (x_{1}, x_{2}) + \int d_{Y} (y_{1}, y_{2}) d μ (y_{1}, y_{2}) = \int d_{X} (x_{1}, x_{1}) d ({pr}_{1} q_{0}) (x_{1}) + μ (d_{Y}) = 0 + μ (d_{Y}) = W_{1} (p_{0}, p_{1}) . \end{matrix}

This implies $W_{1} (p_{0}, p_{1}) \leq W_{1} (q_{0}, q_{1})$ , which completes the proof. ◻

Lemma 21. The function $M (\forall_{α β})$ is measurable.

Proof. We abbreviate $X = M (α)$ , $Y = M (β)$ , $f = M (\forall_{α β})$ . Again, we use the notation $d (\cdot, \cdot)$ for a metric on $Δ^{c} (X)$ or $Δ^{c} (Y)$ and $d_{H} (\cdot, \cdot)$ for the induced Hausdorff metric.

Again, we start with an auxiliary claim.

Claim 2: Let $D \subset Y$ be a compact subset. Then $⋃_{p \in D} {B \in □ (X \times Y) : p \in f (B)}$ is closed and measurable.

Let us show that the claim is true. Note, that the set is the same as $⋃_{p \in D} {B \in □ (X \times Y) : \forall q \in Δ^{c} (X \times Y) : {pr}_{Y *} q = p ⟹ q \in B}$ . Let $B_{k}$ be a sequence in the set with $B_{k} \to B_{0}$ . That means, there is a sequence $p_{k}$ with $p_{k} \in D$ and $\forall q \in Δ^{c} (X \times Y) : {pr}_{Y *} q = p_{k} ⟹ q \in B_{k}}$ . Since $D$ is compact, we can wlog assume that $p_{k} \to p_{0} \in D$ (by selecting a convergent subsequence). Let $q_{0} \in Δ^{c} (X \times Y)$ be given with $p r_{Y *} q_{0} = p_{0}$ . We need to show that $q_{0} \in B_{0}$ holds (this would show that $B_{0}$ is in the set mentioned in the claim, i.e. it would prove the claim). Wlog $q_{0} (X \times Y) > 0$ (otherwise $q_{0} \in B_{0}$ follows from definition). This implies $p_{0} (Y) > 0$ , which allows us to wlog assume that $p_{k} (Y) > 0$ for all $k \in N$ . Let us define the probability measures ${^q}_{0} := p_{0} (Y)^{- 1} p_{0} \in Δ (X \times Y)$ , ${^p}_{0} := p_{0} (Y)^{- 1} p_{0} \in Δ (Y)$ , ${^p}_{k} := p_{k} (Y)^{- 1} p_{k} \in Δ (Y)$ .

We can apply Lemma 20 to these probability measures, which implies that there exists ${^q}_{k} \in Δ (X \times Y)$ with ${pr}_{Y *} {^q}_{k} = {^p}_{k}$ and $W_{1} ({^q}_{0}, {^q}_{k}) \leq W_{1} ({^p}_{0}, {^p}_{k})$ .

It is known that the Wasserstein metric $W_{1}$ is compatible with the weak- $*$ topology on $Δ (X \times Y)$ and $Δ (Y)$ . Thus, we obtain $W_{1} ({^p}_{0}, {^p}_{k}) \to 0$ from ${^p}_{k} \to {^p}_{0}$ . This, in turn, implies ${^q}_{k} \to {^q}_{0}$ .

Next, we define $q_{k} := p_{k} (Y) {^q}_{k} \in Δ^{c} (X \times Y)$ . This satisfies $q_{k} \to q_{0}$ and ${pr}_{Y *} q_{k} = p_{k}$ . Due to the properties of $p_{k}$ and $B_{k}$ , this implies $q_{k} \in B_{k}$ . Now the desired property $q_{0} \in B_{0}$ follows by taking limits. This completes the proof of the claim.

To show measurability of $f = M (\forall_{α β})$ , it suffices to show that the preimages of open balls are measurable. Thus, we need to show that ${B \in □ (X \times Y) : d_{H} (f (B), A) < ε}$ is measurable for all $A \in □ Y$ and $ε > 0$ . We proceed similarly to Lemma 19. We have

\begin{matrix} {B ∣ d_{H} (f (B), A) < ε} = {B ∣ \forall x \in f (B) : d (x, A) < ε} \cap {B ∣ \forall x \in A : d (f (B), x) < ε} . \end{matrix}

Let us start by showing measurability of the second set, i.e. ${B ∣ \forall x \in A : d (f (B), x) < ε}$ . Since $A$ is compact, there exists a countable dense set $A_{1} \subset A$ . We have

\begin{matrix} {B ∣ \forall x \in A : d (f (B), x) < ε} & = ⋃ δ \in (0, ε) \cap Q {B ∣ \forall x \in A : d (f (B), x) \leq δ} = ⋃ δ \in (0, ε) \cap Q {B ∣ \forall x \in A : d (f (B), x) \leq δ} = ⋃ δ \in (0, ε) \cap Q ⋂ x \in A {B ∣ d (f (B), x) \leq δ} = ⋃ δ \in (0, ε) \cap Q ⋂ x \in A ⋃ y : d (x, y) \leq δ {B ∣ y \in f (B)} . \end{matrix}

Applying claim 2 for the compact set ${y : d (y, x) \leq δ}$ . Thus, we have a countable union of an intersection of closed sets. This is measurable.

Now, let us show the measurability of the other set ${B ∣ \forall x \in f (B) : d (x, A) < ε}$ . We have

\begin{matrix} {B ∣ \forall x \in f (B) : d (x, A) < ε} & = {B ∣ \forall x : d (x, A) \geq ε ⟹ x \notin f (B)} = ⋂ x : d (x, A) \geq ε {B ∣ x \notin f (B)} = □ (X \times Y) ∖ ⎛ ⎝ ⋃ x : d (x, A) \geq ε {B ∣ x \in f (B)} ⎞ ⎠ . \end{matrix}

Again, since the set ${x : d (x, A) \geq ε}$ is compact, claim 2 implies that the above set is measurable. This completes the proof that ${B \in □ (X \times Y) : d_{H} (f (B), A) < ε}$ is measurable. ◻

As for the measurability of $M (f)^{*}$ , we were not able to prove it for measurable $f$ . There might be some relation to the problem that the image of a measurable set under a measurable map is not measurable.