An Introduction to Credal Sets and Infra-Bayes Learnability

Introduction

Credal sets, a special case of infradistributions^[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.

Defining Credal Sets

Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set $X .$ A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let $X_{0} = {0, 1} .$ The set of probability distributions over $X_{0}$ is given by

Δ X_{0} = {P : P (0) = a, P (1) = 1 - a, a \in [0, 1]} .

There is a bijection between $Δ X_{0}$ and the closed interval $[0, 1],$ which is the map that sends a distribution to the probability of zero. More generally, recall from the proof of Lemma 1 in the preceding post, that if $X$ is a finite set with $n$ elements, then there is a bijection between $Δ X$ and the closed $(n - 1) -$ simplex

Δ_{n - 1} := {(x_{1}, x_{2}, \dots, x_{n}) \in R^{n} : n \sum i = 1 x_{i} = 1, x_{i} \geq 0 for all 1 \leq i \leq n} .

Consider the following subset of $Δ X_{0} :$

S_{(0, 1)} := {P : P (0) = a, P (1) = 1 - a, a \in (0, 1)} \subset Δ X_{0} .

The set $S_{(0, 1)}$ is an open subset of $Δ X_{0}$ in the same way that the open interval $(0, 1)$ is an open subset of the closed interval $[0, 1] .$ (See Figure 1.) In particular, $S_{(0, 1)}$ does not contain the distributions $P_{0}$ and $P_{1}$ defined by $P_{0} (0) = 0$ and $P_{1} (0) = 1,$ which are limit points of $S .$ For example, given $ϵ > 0,$ the distribution $P_{ϵ}$ defined by $P_{ϵ} (0) = ϵ$ is an element $S$ , and the total variation distance between $P_{0}$ and $P_{ϵ}$ is $d_{T V} (P_{0}, P_{ϵ}) = {max}_{x \in {0, 1}} {| P_{0} (x) - P_{ϵ} (x) |} = ϵ .$

In contrast,

S_{[0, 1 / 4]} := {P : P (0) = a, P (1) = 1 - a, a \in [0, 1 / 4]} \subset Δ X_{0},

S_{[3 / 4, 1]} := {P : P (0) = a, P (1) = 1 - a, a \in [3 / 4, 1]} \subset Δ X_{0}, and

S_{[0, 1 / 4]} \cup S_{[3 / 4, 1]}

are all examples of closed subsets of $Δ X_{0} .$

We now consider the meaning of convexity in this context.

Definition: Convex set of probability distributions
A convex combination of probability distributions $P_{i}$ is a linear combination $\sum_{i = 1}^{n} α_{i} P_{i}$ such that $α_{i} \in R,$ $α_{i} \geq 0,$ and $\sum_{i = 1}^{n} α_{i} = 1.$ A set of probability distributions is convex if it closed under convex combinations.

If a distribution $P$ is written as a convex combination of a set of distributions, then it is called a mixture distribution. Note that the definition of a convex combination ensures that mixture distributions are indeed probability distributions. Sampling from the mixture distribution $\sum_{i = 1}^{n} α_{i} P_{i}$ can be thought of as first determining an index $i$ by sampling from the distribution $τ$ over ${1, . . ., n}$ defined by $τ (i) = α_{i}$ and then sampling from the corresponding $P_{i} .$

As defined above, $S_{[0, 1 / 4]}$ and $S_{[3 / 4, 1]}$ are convex sets, analogous to how $[0, 1 / 4]$ and $[3 / 4, 1]$ are convex subsets of $R$ . However, $S_{[0, 1 / 4]} \cup S_{[3 / 4, 1]}$ is not a convex set. To see this, let $P_{1 / 4} (0) = 1 / 4$ and $P_{3 / 4} (0) = 3 / 4.$ Note that $P_{1 / 4}$ is an element of $S_{[0, 1 / 4]}$ and $P_{3 / 4}$ is an element of $S_{[3 / 4, 1]} .$ The distribution defined by $P_{1 / 2} (0) = 1 / 2$ can be written as the mixture distribution $P_{1 / 2} = \frac{1}{2} P_{1 / 4} + \frac{1}{2} P_{3 / 4} .$ However, as shown in Figure 1, $P_{1 / 2}$ is not an element of $S_{[0, 1 / 4]} \cup S_{[3 / 4, 1]},$ so $S_{[0, 1 / 4]} \cup S_{[3 / 4, 1]}$ is not closed under convex combinations.

Figure 1: (Left) The set $S_{(0, 1)} \subset Δ X$ visualized as a subset of $[0, 1]$ , which is not a closed set of distributions as it does not contain the limit points corresponding to 0 and 1. (Right) The sets $S_{[0, 1 / 4]}$ and $S_{[3 / 4, 1]}$ , which are closed and convex, and thus credal sets. The union of $S_{[0, 1 / 4]}$ and $S_{[3 / 4, 1]}$ is not convex.

If a set of probability distributions is both closed and convex, it is called a credal set. Among the examples we have seen, $S_{[0, 1 / 4]},$ $S_{[3 / 4, 1]}$ , and $Δ X$ are credal sets. See Figure 2 for another example. Since $S_{[0, 1 / 4]} \cup S_{[3 / 4, 1]}$ fails to satisfy convexity, it is not a credal set.

Figure 2: A credal set visualized as a subset of the 2-dimensional simplex in $R^{3} .$

Definition: Credal set
Let $X$ be a set, and let $Δ X$ denote the set of probability distributions over $X .$ A credal set over $X$ is a closed, convex subset of $Δ X,$ where closedness is defined with respect to some metric on $Δ X .$ The set of credal sets over $X$ is denoted by $□ X .$

Credal sets are a natural tool for describing situations in which uncertainty cannot be reduced to probabilities. These situations are said to have Knightian uncertainty^[2]. Knightian uncertainty is notably distinct from situations in which probabilities can be assigned with low confidence.

For example, in Risk, Ambiguity, and the Savage Axioms (1961), Daniel Ellsberg writes of the "Knightian urn," an urn with 100 balls in which an observer is ignorant of the proportion of black and red balls. We would say that an observer of this urn has Knightian uncertainty with respect to the events of drawing either a red or black ball. Corresponding red to 0 and black to 1, this Knightian uncertainty can be captured by the credal set $Δ X_{0}$ as above. On the other hand, an observer may gain partial information while remaining ignorant about the remaining possibilities, such as learning that there are no more than 25 red balls. The credal set $S_{[0, 1 / 4]}$ captures this partial information. In this case, we would say that the observer then has Knightian uncertainty over the elements of this credal set.

Convexity is a natural condition in the presence of Knightian uncertainty. If an observer considers some distributions as all possible and furthermore does not hold any beliefs about which distributions are more likely than others, then they should also consider any mixture distribution as possible. When the possible set of distributions is finite, this is equivalent to taking the convex hull of a finite set of points, which yields a convex polytope.

Furthermore, the decision rule discussed below makes the same recommendation for a set of probability distributions and its closed convex hull. Therefore, we can always without loss of generality assume that a set of distributions is a credal set.

Infrakernels

In machine learning classification, neural networks learn functions from sets (e.g. sets representing images) to probability distributions over possible classes; this kind of function of type $X \to Δ Y$ for sets $X$ and $Y$ is called a probability kernel. In this case, X is called the source of the kernel and $Y$ is called the target of the kernel.

An infrakernel^[3] is the infra-Bayesian analogue of a probability kernel. A special case of infrakernels in which the target is a set of credal sets is called a crisp infrakernel.

Definition: Crisp infrakernel
Let $X$ be a topological space, and let $□ Y$ denote the set of credal sets over a set $Y$ . A crisp infrakernel is a continuous function of type $X \to □ Y .$

For example, every continuous probability kernel is a crisp infrakernel since a single probability distribution is a credal set with one element.

An example of a crisp infrakernel that is not a probability kernel is shown in Figure 3. Let $X = {William, Carl, Amalie}$ , and let $Y = {0, 1, 2}$ , with the elements of $Y$ referring to numbers of cats. We can define $κ : X \to □ Y$ as follows: $κ (William)$ is the set containing only the distribution that is one at two, $κ (Carl)$ is the set of distributions that assign zero to two, and $κ (Amalie)$ is the set of all distributions over $Y .$ This infrakernel represents certainty that William has two cats and Carl does not have two cats, and complete uncertainty about how many cats Amalie has. Since $X$ is finite, $κ$ is automatically continuous.

Figure 3: An example of a crisp infrakernel that sends people to credal sets visualized respectively as a single point, the entire 2-dimensional simplex, and a line segment.

Deterministic Versus Stochastic Policies

In the preceding post, we discussed stochastic policies, maps of the form $π : (A \times O)^{*} \to Δ A$ . Deterministic policies are a special case of stochastic policies, and can be written as partial functions of the form $π : (A \times O)^{*} ⇀ A .$ By partial function, we mean that the domain of the policy $π$ is restricted to those histories consistent with it. We previously observed that under some mild assumptions in the classical setting, an optimal policy exists. Furthermore, the optimal policy may be chosen to be deterministic. As we argue below, deterministic policies can be represented by stochastic policies and thus from this point forward, we will assume that policies are deterministic.

To emulate a stochastic policy using a deterministic policy, an agent can choose an element of $Δ A$ and use a random number generator (RNG) to sample it. The output of the RNG then dictates an action. An observation from the environment then follows as usual. To formalize this, if $A$ and $O$ are the sets of actions and observations respectively from the stochastic setting, we can define a new set of observations equal to $A \times O$ . This new set records both the action dictated by the RNG and the usual observation that follows. The new action space would be $Δ A,$ with each element of $Δ A$ corresponding to an RNG that encodes the distribution.^[4]

Importantly, this allows the random number generator to be insecure, biased, or influenced by the environment. In comparison, assuming that the set of policies is stochastic implicitly invokes an assumption that the agent can access a source of true randomness that is inaccessible by the environment. The deterministic theory not only encompasses the stochastic theory, but it is also applicable to embedded agency and can be used to solve Newcombian problems.

Topologies on policies and destinies

In this section, we review several important topologies. These topologies are relevant to how we define credal sets over the set of destinies $(A \times O)^{ω}$ , as we require a notion of closure in the abstract space that is $Δ (A \times O)^{ω} .$

The topology on the set of deterministic policies

Recall that the set of deterministic policies $Π$ is a set of partial functions $π : (A \times O)^{*} ⇀ A .$ The set of deterministic policies $Π$ can be written as a subset of the product space $\prod_{i \in (A \times O)^{*}} A_{i}$ where $A_{i} = A$ for all $i \in (A \times O)^{*} .$

We assume that $A$ is finite, and thus the natural topology on $A$ is the discrete topology. Then $Π$ is naturally endowed with the (subspace topology of the) product topology that comes from the topology on $A .$ This means that a basis element of the topology on $Π$ has the form $\prod_{i \in (A \times O)^{*}} U_{i}$ where $U_{i} \subseteq A$ for all $i \in (A \times O)^{*}$ and $U_{i} \subset A$ for only finitely many $i \in (A \times O)^{*} .$ An open set is then an arbitrary union of basis elements.

Recall from Lemma 2 of the preceding post that $Π$ is compact.

The topology on the set of destinies

The set of countably infinite histories $(A \times O)^{ω}$ , referred to as destinies, is a metric space under the metric $d (h, h^{'}) = γ^{t (h, h^{'})}$ where $γ \in (0, 1)$ and $t (h, h^{'})$ is the time (index) of first difference between $h$ and $h^{'} .$ In the proof section, we show that $(A \times O)^{ω}$ is compact under this metric.

Lemma 1: Suppose $A$ and $O$ are finite. Then the set of destinies $(A \times O)^{ω}$ is a compact metric space.

The topology on the set of credal sets over destinies

Recall that the total-variation distance between probability distributions $P_{1}$ and $P_{2}$ is given by $d_{T V} (P_{1}, P_{2}) := \frac{1}{2} \sum_{x \in X} | P_{1} (x) - P_{2} (x) | .$ A natural way to extend any metric on probability distributions to credal sets is to use the Hausdorff metric. The Hausdorff metric is a metric that can be defined on the closed, non-empty subsets of a metric space. When restricted to one-element sets, the Hausdorff metric is equivalent to the metric of the original space.

Definition: Hausdorff total-variation distance between credal sets
Let $Θ_{1}$ and $Θ_{2} \in □ X$ be two non-empty credal sets. For a credal set $Θ \in □ X$ and distribution $θ \in Δ X,$ let $d (θ, Θ) := {inf}_{θ^{'} \in Θ} d_{T V} (θ, θ^{'}) .$ The Hausdorff distance between $Θ_{1}$ and $Θ_{2}$ is given by
$d_{H} (Θ_{1}, Θ_{2}) := max {sup θ_{1} \in Θ_{1} d (θ_{1}, Θ_{2}), sup θ_{2} \in Θ_{2} d (θ_{2}, Θ_{1})} .$

Because credal sets are closed by definition, the Hausdorff total variation distance between non-empty credal sets is a well-defined metric.

The topology of $□ (A \times O)^{ω}$ is then given by the metric topology induced by the Hausdorff distance between credal sets.

Crisp Causal Laws

Causal laws^[5] are a special type of infrakernel used to analyze sets of environments.

Given a set of environments $E$ and a policy $π$ , we can consider the set of distributions over destinies that arise from the interaction of each environment and the policy, which we denote by ${μ^{π} : μ \in E}$ (for details see the preceding post).

Wrapping this into a function results in an infrakernel as follows.

Definition: The infrakernel generated from a set of environments
A set of environments $E$ generates the infrakernel $Λ_{E} (π) : Π \to □ (A \times O)^{ω}$ defined by $Λ_{E} (π) = ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch ({μ^{π} : μ \in E}),$ where $ch$ denotes convex hull and $¯ \cdot$ denotes closure.

This leads to the definition of crisp causal laws, which take the role of hypotheses in infra-Bayesianism, as opposed to environments in classical reinforcement learning theory.

Definition: Crisp causal law
A crisp causal law is an infrakernel $Λ : Π \to □ (A \times O)^{ω}$ generated by a set of environments.

For example, let $A = {a_{0}, a_{1}}$ and $O = {o_{0}, o_{1}}$ denote a set of actions and a set of observations respectively. Let $π$ be a policy such that $π (ϵ) = a_{0}$ where $ϵ$ denotes the empty history. Let $μ_{1}$ denote an environment such that $μ_{1} (o_{0} | a_{0}) = \frac{1}{12}$ . Let $μ_{2}$ denote a second environment such that $μ_{2} (o_{0} | a_{0}) = 1$ . A calculation as in Figure 2 of the preceding post shows that $μ_{1}^{π}$ is given by $(\frac{1}{12}, \frac{11}{12}, 0, 0),$ with each index of the tuple corresponding to one of the four possible histories of length one. Furthermore, $μ_{2}^{π}$ is given by $(1, 0, 0, 0) .$

By disregarding the last two irrelevant coordinates, we can visualize $μ_{1}^{π}$ and $μ_{2}^{π}$ as points in the 1-dimensional simplex. Let $E$ be the convex closure of ${μ_{1}, μ_{2}}$ . Then $Λ_{E} (π)$ corresponds to a line segment on the 1-dimensional simplex as shown in Figure 4.

Figure 4: The image of a crisp causal law, which is a credal set.

The next proposition states that crisp causal laws are continuous with respect to the topologies discussed in the previous section. Since we are working with topological spaces, a function $σ : T_{1} \to T_{2}$ between topological spaces $T_{1}$ and $T_{2}$ is continuous if for every open set $U$ in $T_{2}$ , $f^{- 1} (U)$ is open in $T_{1} .$ It is well known that this definition is equivalent to the epsilon-delta definition of continuity when $T_{1}$ and $T_{2}$ are metric spaces.

Proposition 1: Every crisp causal law $Λ : Π \to □ (A \times O)^{ω}$ is continuous with respect to the product topology on $Π$ and the Hausdorff topology on $□ (A \times O)^{ω} .$

The Minimax Decision Rule

In the classical setting, given a set of policies and a prior over a class of environments, a natural rule for choosing a policy is expected loss minimization in which a policy is chosen to minimize expected loss with respect to the prior. Recall that crisp causal laws take the place of environments as hypotheses in infra-Bayesianism. Even if a prior over hypotheses (crisp causal laws) is given, it is impossible to carry out expected loss minimization because there is Knightian uncertainty over the distributions that make up a credal set.

Because expected loss minimization is impossible, it is necessary to use a different decision rule. In the context of AI alignment, it is of interest to have guarantees on agent behavior that hold even in the worst-case scenarios. Given a set of environments, worst-case reasoning describes a decision-making heuristic in which an agent assumes that the true environment is the environment that would be worst with respect to some measure. The minimax decision rule stipulates that an agent should choose a policy so that the maximum expected loss over the set of possible environments is minimized.^[6] Other decision rules can be reduced to minimax^[7], and furthermore, there are natural classes of hypotheses that are learnable with respect to minimax, a notion that we define below.^[8]

Worst-case reasoning is implicitly built into the following definition.

Definition: Expected loss of a policy with respect to a crisp causal law
Let $L : (A \times O)^{ω} \to [0, 1]$ denote a continuous loss function. The expected loss of a policy $π \in Π$ with respect to a crisp causal law $Λ$ is defined by
$E_{Λ (π)} [L] := max θ \in Λ (π) E_{θ} [L] .$

Using the assumption that $L$ is continuous, Lemma 1 and Lemma 4 of the proof section imply that $E_{Λ (π)} [L]$ is well defined.

Choosing a policy that minimizes expected loss with respect to a crisp causal law is equivalent to choosing a policy that minimizes the maximum possible loss over the distributions obtained from the interaction of the policy and each environment. Therefore, the infra-Bayesian version of expected loss minimization is exactly a minimax decision.

Infra-Regret

Now that we have established the definition of expected loss in this setting, we can define the infra-Bayesian analogue to the regret of a policy with respect to an environment.

Definition: Infra-regret of a policy with respect to a crisp causal law
Let $L : (A \times O)^{ω} \to [0, 1]$ denote a continuous loss function. The infra-regret of a policy $π \in Π$ with respect to a crisp causal law $Λ$ is defined by
$Reg (π, Λ, L) := E_{Λ (π)} [L] - min π^{'} \in Π E_{Λ (π^{'})} [L] .$

It is natural to ask whether this notion of regret is well-defined. A proof can be given using Lemma 2 of the preceding post, which states that $Π$ is compact, and the following proposition.

Proposition 2^[9]: Let $π \in Π$ be a policy, $Λ$ a crisp causal law, and $L : (A \times O)^{ω} \to [0, 1]$ a continuous function. Then the map $π \mapsto E_{Λ (π)} [L]$ is continuous.

In the classical theory, a prior over environments quantifies uncertainty about the true environment. The analogue in infra-Bayesianism is a prior over crisp causal laws. The expected regret with respect to a prior of this type is called the infra-regret.

Definition: Infra-regret of a policy with respect to a prior
Let $ζ$ denote a prior over a set of crisp causal laws. The infra-regret of a policy $π$ with respect to $ζ$ is defined by
$IBReg (π, ζ, L) := E_{Λ \sim ζ} [Reg (π, Λ, L)] .$

Infra-Bayes Optimality and Learnability

In this section, we introduce some elementary results on learnability in infra-Bayesianism. The analogue to a Bayes optimal policy with respect to a prior over environments is an infra-Bayes optimal policy, defined as follows.

Definition: Infra-Bayes optimal policy
An infra-Bayes optimal policy with respect to a prior $ζ$ over crisp causal laws is a policy
$π^{*} \in {argmin}_{π \in Π} E_{Λ \sim ζ} [E_{Λ (π)} [L]] .$

A family of policies ${π^{γ}}_{γ \in [0, 1)}$ is said to be infra-Bayes optimal if for all $γ \in [0, 1),$ $π^{γ}$ is infra-Bayes optimal for the $γ$ -dependent loss function $L^{γ} .$

A natural question is: Does an infra-Bayes optimal policy always exist? The answer is yes; this is a corollary of Proposition 2 and the compactness of $Π$ . A standard continuity argument can also be used to show that the set of infra-Bayes optimal policies is closed, as stated in the following corollary.^[10]

Corollary 1: Existence of the infra-Bayes optimal policy
Let $π \in Π$ denote a policy, $Λ$ a crisp causal law, and $L : (A \times O)^{ω} \to [0, 1]$ a continuous function. Then ${argmin}_{π \in Π} E_{Λ \sim ζ} [E_{Λ (π)} [L]]$ is non-empty and closed.

The analogue of a learnable class of environments is a learnable class of causal laws.

Definition: Non-anytime learnable class of infra-Bayesian hypotheses
Let $I$ denote an indexing set. A class of crisp causal laws (hypotheses) ${Λ_{i}}_{i \in I}$ is non-anytime learnable if there exists a set of policies ${π^{γ}}_{γ \in [0, 1)}$ such that for all $i \in I,$ ${lim}_{γ \to 1} Reg (π^{γ}, Λ_{i}, L^{γ}) = 0.$ In this case, we say that ${π^{γ}}_{γ \in [0, 1)}$ learns ${Λ_{i}}_{i \in I} .$

Furthermore, we have the analogue of a non-anytime learnable prior over environments.

Definition: Non-anytime learnable prior over crisp causal laws
A prior $ζ$ over a set of causal laws is non-anytime learnable if there exists a set of policies ${π^{γ}}_{γ \in [0, 1)}$ such that
$lim γ \to 1 (E_{Λ \sim ζ} [E_{Λ (π^{γ})} [L^{γ}] - min π \in Π E_{Λ (π)} [L^{γ}]]) = 0.$
In this case, we say that ${π^{γ}}_{γ \in [0, 1)}$ learns $ζ .$

Proposition 2 of the preceding post states the classical result from learning theory that if a countable class of environments is learnable, given a non-dogmatic prior over the class, then any family of policies that is Bayes-optimal with respect to the prior must also learn the class. We end with the following corresponding result for crisp causal laws. The proof is contained in the proof section.

Proposition 3: For any non-dogmatic prior $ζ$ over a learnable and countable collection of crisp causal laws ${Λ_{i}}_{i = 0}^{\infty}$ , if a family ${π^{* γ}}_{γ \in [0, 1)}$ of policies is infra-Bayes optimal with respect to $ζ$ , then ${π^{* γ}}_{γ \in [0, 1)}$ learns ${Λ_{i}}_{i = 0}^{\infty} .$

Acknowledgements

I'm grateful to Vanessa Kosoy and Alexander Appel for insightful discussions, and Marcus Ogren and Mateusz Bagiński for their valuable feedback on the initial draft.

^{^}
In the original infra-Bayesianism sequence, credal sets are called crisp infradistributions.
^{^}
First called "unmeasurable uncertainty" by economist Frank Knight in his 1921 book Risk, Uncertainty, and Profit.
^{^}
As defined in The Many Faces of Infra-Belief.
^{^}
This requires dealing with infinite action spaces, which is typically not a problem as long as the action space is compact Polish.
^{^}
Previously called belief functions, e.g. as in The Many Faces of Infra-Belief.
^{^}
If using reward rather than loss, this rule can be equivalently formulated as a maximin rule (c.f. Infra-Bayesianism Distillation: Realizability and Decision Theory).
^{^}
See Vanessa Kosoy's Shortform.
^{^}
See Vanessa Kosoy's Shortform.
^{^}
This proposition is a special case of Proposition 5 from Belief Functions And Decision Theory.
^{^}
These two statements were written as Proposition 12 in Belief Functions And Decision Theory.

AI ALIGNMENT FORUM
AF