Attacking the grain of truth problem using Bayes-Savage agents

Motivated by the grain of truth problem, we propose a generalization of Bayesian inference that allows for incomplete models. Such a model can be thought of as a set of constraints on the environment which doesn't specify it completely. This means that in addition to ordinary (probabilistic) uncertainty, another ("Knightian") type of uncertainty is introduced. This new uncertainty is managed using Savage's minimax regret decision rule.

Motivation

The grain of truth problem, as described by Hutter (see problem 5j) concerns the interaction between several AIXI agents. Ideally, we expect that rational agents in multi-agent scenarios should converge to behavior corresponding to some reasonable solution concept in game theory, e.g. Nash equilibrium (for the purposes of this post I completely ignore stronger desiderata such as superrationality; also purely Bayesian agents often fail to do so because of insufficient exploration but this is also a problem mostly orthogonal to grain of truth). However, it is difficult to prove anything about multi-agent scenarios of AIXI since AIXI's prior is supported on computable environments whereas AIXI itself is uncomputable. This problem survives in bounded analogues: if we limit the computational complexity of environments included in the prior, the computational complexity of the agent itself will invariably be higher. Thus, two agents of equal power seem unable to "comprehend" each other.

There are several reasons this problems seem important to AI alignment:

The ability of an agent to operate in environments of complexity similar to that of the agent or even surpassing that of the agent seems an important desideratum for the mathematical formalization of intelligence. Without a formal understanding how this desideratum can be satisfied it is difficult to imagine any reliable theory of intelligent agents.
Arguably, self-improving agents need the capacity to understand systems that equal to themselves in complexity, more or less by definition.
Perhaps most importantly, IRL relies on the AI's ability to successfully reason about a complex agent (a human), presumably an agent that cannot be emulated precisely with the AI's computational power. CIRL relies on two-sided interaction between the AI and the human, where neither side can emulate the other. Thus, it seems impossible to prove reasonable guarantees for (C)IRL under quasi-realistic assumptions without solving the grain of truth problem.

To the best of our knowledge, the most serious attack on the problem to date is the work of Leike, Taylor and Fallenstein. They prove that Thompson samplers relatively to a reflective oracle converge to an "asymptotic" Nash equilibrium. In our opinion, it should be possible to extent that work to ensure subgame perfect and even proper equilibria and, under some assumptions, allow for "tamer" forms of exploration than Thompson sampling (Thompson sampling necessarily sacrifices many horizons until convergence). There is no doubt reflective oracles constitute significant progress towards the solution, in the sense that they clearly demonstrate the grain of truth problem is not inherently unsolvable and provide a formal framework where agents satisfying this desideratum can be studied. Nevertheless, we retain skepticism regarding the prospect of reaching the ultimate solution through developing that approach.

In our opinion, there are 3 ways in which reflective oracles are an unsatisfactory solution:

There are many different reflective oracles. In multi-agent scenarios, all results rely on all agents using the same reflective oracle. Realistic agents cannot use uncomputable oracles so they will have to replace them by some sort of bounded analogues. However, there is every reason to suspect these bounded analogues will also be far from unique. Thus, it's not clear that interaction between agents using different brands of "bounded reflective oracles" will have any substantial guarantees. In fact, this lack of uniqueness seems directly related to the lack of uniqueness for Nash equilibria. Any solution for the grain of truth will have to solve equilibrium selection, but reflective oracles seem to "cheat" by implicitly selecting Nash equilibria for all possible games in advance. A more realistic model should imply some sort of negotiation process which will probably need some non-trivial assumptions to converge.
Multi-agent scenarios are a test case for the ability to reason about environments more complex than the agent itself, but the latter is more general than the former. We expect reasonable multi-agent behavior to arise as a special cases of some broad capability, useful in a great variety of other scenarios. However, it is not clear what physically realistic environments except multi-agent scenarios benefit from reasoning with reflective oracles. If there are few such environments, we would have to admit that either reflective oracles are an unsatisfactory model or that realistic agents need to be designed qualitatively differently in order to manage multi-agent scenarios as opposed to all other complex scenarios. In other words, in the latter case there would be uncontrived examples of AIs that solve almost any problem better than humans (including proving theorems, discovering physics, inventing nanotechnology, von Neumann probes and Dyson spheres) but that fail miserably in games. This possibility, although not impossible, appears to us less likely than the alternative.
Reflective oracles are uncomputable, although they were proven to be limit computable. Obviously realistic agents use algorithms that are not only computable but satisfy rather stringent complexity bounds.

Of the problems above, 3 seems to be the least significant since it is plausible that bounded analogues can be constructed. In any case, in our formalism it also entirely unobvious how to construct computationally feasible agents. Problem 1 might be solved if a natural class of reflective oracles is found s.t. any agent that is computable w.r.t. any oracle in this class is also computable w.r.t. any other oracle $O_{2}$ in this class, at least approximately. This is not impossible but also not straightforward. Problem 2 might be the hardest.

Incomplete Models

Although an agent cannot simulate another agent of equal power without entering an infinite loop, it might be able to observe certain facts about the other agent's behavior. This calls for a notion of an incomplete model: a model of the environment that doesn't yield precise probabilistic predictions but provides some constraints on what might happen.

To give a trivial example, given a sequence of two bits $b_{1} b_{2}$ , the model might say that $b_{1} = \neg b_{2}$ without assigning any probability to $b_{1} = 0$ . Knowing only this model one has unquantifiable (Knightian) uncertainty between the possibilities $01$ and $10$ . The space of probability distributions on two bit sequences is a tetrahedron, and our model corresponds to one edge of this tetrahedron.

More generally, we have some measurable event space $X$ and $Δ X$ is the space of probability measures on $X$ . A complete (Bayesian) model corresponds to some $μ \in Δ X$ . An incomplete model corresponds to some convex set $Φ \subseteq Δ X$ (which can be assumed closed in the strong convergence topology). If we assign probability $p$ to model $μ$ and $1 - p$ to model $ν$ , this can be represented by the model $p μ + (1 - p) ν$ . Similarly, if we assign probability $p$ to model $Φ$ and probability $1 - p$ to model $Ψ$ , this can be represented by the model

$p Φ + (1 - p) Ψ := {p μ + (1 - p) ν ∣ μ \in Φ, ν \in Ψ}$

In ordinary Bayesian statistics, a prior is often specified by a some countable space $H$ of hypotheses (i.e. for each $x \in H$ we have $μ_{x} \in Δ X$ ; more generally $H$ can be some measurable space with a Markov kernel into $X$ , but we won't need that) and a probability measure $ζ$ on $H$ , so that the prior is $E_{x \in ζ} [μ_{x}]$ . Similarly, we can have $Φ_{x} \subseteq Δ X$ and the incomplete prior

$E_{x \in ζ} [Φ_{x}] := {E_{h \in ζ} [μ_{x}] ∣ μ : H \to Δ X, \forall x \in H : μ_{x} \in Φ_{x}}$

Now consider an agent interacting with an unknown environment. The agent has a finite set $A$ of actions and a finite set $O$ of percepts. The event space is the space of "pure" (deterministic) environments $E := (O \times A)^{*} \to O$ (its $σ$ -algebra comes from viewing it as the inverse limit of the sequence $(O \times A)^{\leq n} \to O$ ). In this context we don't usually work with the full space of "mixed" environments $Δ E$ but only with the quotient space of "behavioral" environments $Δ_{b} E$ . Here, every $μ \in Δ_{b} E$ is a partial function from $(O \times A)^{*}$ to $Δ O$ s.t. $μ (h)$ is defined iff $μ$ assigns positive probability to all percepts appearing in $h$ . For every $μ \in Δ E$ , we define the corresponding $μ_{b} \in Δ_{b} E$ by

$Pr [μ_{b} (h) = o] := {Pr}_{e \sim μ} [e (h) = o ∣ \forall i \in [| h |] : h_{i}^{O} = e (h_{0 : i - 1})]$

Here, $| h |$ is defined by $h \in (O \times A)^{| h |}$ and $h_{i}^{O} \in O$ stands for the first element of the pair $h_{i} \in O \times A$ . Note that the operation of taking convex linear combinations in $Δ E$ descends to a well-defined operation in $Δ_{b} E$ which we will also regard as convex linear combination (but which is not the same as taking pointwise convex linear combinations of the partial functions; instead it can be written as the result of Bayesian updating).

Priors are usually constructed from environments that are in some sense computable. There are several variants of the definition, but we will consider environments that can be represented as

$M : (O \times A)^{*} \times {0, 1}^{ω} alg - \to O$

Here, ${0, 1}^{ω}$ is regarded as an infinite sequence of fair coin flips and the notation $alg - \to$ means that $M$ is a Turing machine that halts with probability 1 for any fixed $h \in (O \times A)^{*}$ and $r \in {0, 1}^{ω}$ a sequence of fair coin flips. The corresponding environment is defined by

$Pr [μ_{M} (h) = o] := {Pr}_{r \sim U^{ω}} [M (h, r) = o]$

Here $U^{ω}$ is the fair coin probability measure on ${0, 1}^{ω}$ (it is easy to see that the set ${r \in {0, 1}^{ω} ∣ M (h, r) = o}$ is measurable because $M$ is a Turing machine).

Fix $U$ a prefix-free universal Turing machine and denote $H^{U} \subseteq {0, 1}^{*}$ the set of programs that satisfy the halting condition above. We have $ζ \in Δ H^{U}$ defined by

$ζ (x) := \frac{2^{- | x |}}{\sum_{y \in H^{U}} 2^{- | y |}}$

For any $x \in H^{U}$ , we can define the environment $μ_{x}^{U}$ :

$Pr [μ_{x}^{U} (h) = o] := {Pr}_{r \sim U^{ω}} [U (x; h, r) = o]$

We can now define the universal prior $μ^{U} := E_{x \sim ζ} [μ_{x}^{U}]$ .

Alternatively, we can consider environments that have an unobservable state. For any $ν \in Δ_{b} E$ we denote the domain of $ν$ by $(O \times A)_{ν}^{*}$ . An computable environment with unobservable state corresponds to a machine

$M : (O \times A)^{*} \times N \times {0, 1}^{ω} alg - \to O \times N$

The corresponding $μ_{M} \in Δ_{b} E$ is defined recursively, simultaneously with $ρ_{M} : (O \times A)_{μ_{M}}^{*} \to Δ N$ , the latter describing the probability distribution on unobservable states. Denoting $λ_{O \times A}$ the empty element of $(O \times A)^{*}$ :

$ρ_{M} (λ_{O \times A}) := δ_{0}$

$Pr [μ_{M} (h) = o] := {Pr}_{\begin{matrix} u \sim ρ_{M} (h) r \sim U^{ω} \end{matrix}} [M (h, u, r)^{O} = o]$

$Pr [ρ_{M} (h o a) = u] := {Pr}_{\begin{matrix} v \sim ρ_{M} (h) r \sim U^{ω} \end{matrix}} [M (h, v, r)^{{0, 1}^{*}} = u ∣ M (h, v, r)^{O} = o]$

For arbitrary computable environments the addition of an unobservable state gains us nothing. However, it makes a difference when imposing complexity bounds.

We now generalize these concepts to incomplete models. A computable incomplete model corresponds to an oracle machine

$M : (O \times A)^{*} \times {0, 1}^{ω} or - \to O$

We require that the machine halts with probability 1 for any oracle. The values returned by the oracle represent "Knightian coin flips." For any $K \subseteq N$ , we can define $μ_{M K} \in Δ_{b} E$ by

$Pr [μ_{M K} (h) = o] := {Pr}_{r \sim U^{ω}} [M^{K} (h, r) = o]$

The notation $M^{K}$ signifies we are running the machine $M$ with the oracle $K$ .

We define the incomplete model

$Φ_{M} := {E_{K \sim ξ} [μ_{M K}] ∣ ξ \in Δ 2^{N}}$

The $σ$ -algebra on $2^{N}$ is defined by regarding it as an infinite product: $2^{N} ≅ \prod_{n \in N} {0, 1}$ .

Taking $¯ U$ a universal oracle machine and $H^{¯ U}$ the set of programs for $¯ U$ that satisfy the our halting condition, we can define $Φ_{x} \subseteq Δ_{b} E$ for any $x \in H^{¯ U}$ in the obvious way. We define $¯ ζ \in Δ H^{¯ U}$ by

$¯ ζ (x) := \frac{2^{- | x |}}{\sum_{y \in H^{¯ U}} 2^{- | y |}}$

This allows us defining the "universal incomplete prior" $Φ^{¯ U} := E_{x \in ¯ ζ} [Φ_{x}]$ .

We also can represent computable incomplete models with unobservable states

$M : (O \times A)^{*} \times N \times {0, 1}^{ω} or - \to O \times N$

$Φ_{M}$ is defined analogously.

It is straightforward to impose complexity bounds on these objects, e.g. by limiting the time, space, number of random coins or number of oracle queries $M$ is allowed to use as a function of $| h |$ .

Hutter observes that it's unknown whether Solomonoff induction can correctly predict relationships betweens bits in a sequence that contains uncomputable information (see problem 4g). To deal with similar issues in online learning one can use "sleeping experts" (see also subsequence induction). Here, we deal with much more general scenarios than sleeping experts by including incomplete models in our prior.

Bayes-Savage Agents

Consider an agent that has to choose from a finite strategy set $S$ . The utility function of the agent is $u : S \times X \to R$ . If the agent's beliefs are described by a complete model $μ \in Δ X$ , the standard decision rule is maximizing expected utility:

$s (μ) := a r g m a x s \in S E_{w \sim μ} [u (s, w)]$

For an incomplete model $Φ \subseteq Δ X$ , we propose using the minimax regret decision rule:

$s (Φ) := a r g m a x s \in Δ S min μ \in Φ (E_{w \sim μ} [u (s, w)] - max t \in S E_{w \sim μ} [u (t, w)])$

We call agents following this rule "Bayes-Savage" agents.

Note that we lost nothing by requiring $Φ$ to be convex since the minimum in the above equation is of a concave function and therefore is always attained in an extreme point (so if we started from a non-convex subset of $Δ X$ , we might as well have replaced it with its convex hull).

As opposed to expected utility maximization, minimax regret might demand randomization. This is perhaps not surprising since if we expect our agents to produce Nash equilibria, there must be randomization in their decision rule (e.g. in reflective oracles this randomization comes from the oracle itself). In fact, this decision rule can be interpreted as selecting a Nash equilibrium in a game where the strategies of the opponent (who we dub Metanoia) are $μ \in Φ$ , the agent's utility function is $E_{w \sim μ} [u (s, w)]$ and Metanoia's utility function is the agent's regret

$R (s, μ) := max t \in S E_{w \sim μ} [u (t, w)] - E_{w \sim μ} [u (s, w)]$

Equivalently (i.e. yielding the same Nash equilibria), we can make it a zero-sum game by taking negative regret to be the agent's utility function (it makes no difference since the other term depends only on Metanoia). Note that the $μ$ attaining the minimum is not the Nash equilibrium strategy for the Metanoia but only a pure best response to a Nash equilibrium: the former would be a mixed strategy.

In most interesting settings the agent's strategy involves making a sequence of actions between which it receives additional information. The strategy of a Bayesian agent in this case can be regarded as performing a Bayesian update after each observation and computing and optimal policy for the rest of time using the maximal expected utility rule applied to the posterior. On the other hand, the policy of a Bayes-Savage agent cannot be decomposed in this way, i.e., the policy after making an observation is not the result of applying the minimax regret rule to the incomplete posterior. We deal with this "dynamic inconsistency" simply by requiring the entire policy to be determined in advance ("updatelessly"). In practical implementations we will likely need some approximate decoupling from distant "branches" in order to make the algorithm feasible, but we leave this issue for later.

In the sequential setting, the Metanoia game is naturally regarded as having extensive form. In particular, the Nash equilibrium condition is too weak to always ensure reasonable behavior, since there might be agent information sets that are never reached in Nash equilibrium. Therefore, we need to make the decision rule more stringent by requiring e.g. quasi-perfect or proper equilibrium. We leave working out the details for later.

As a simple example for a setting where the minimax regret rule yields reasonable behavior, consider a stochastic $k$ -armed bandit played for $n$ time steps. Here $X = R^{k n}$ , each $w \in X$ representing a vector of payoffs for all arms and time steps. $Φ$ is the convex hull of all i.i.d. distributions satisfying an appropriate moment condition. Clearly the minimax regret rule leads to behavior qualitatively similar to the UCB algorithm (at least assuming the correct handling of special information sets as suggested above).

This stands in sharp contrast to the minimax decision rule:

$s_{mm} (Φ) := a r g m a x s \in Δ S min μ \in Φ E_{w \sim μ} [u (s, w)]$

The latter leads to undefined behavior for stochastic bandits since there is no way to guarantee any payoff above the minimum, even if good payoffs were observed on previous time steps.

Finally, a Bayes-Savage agent for $Φ^{¯ U}$ is a natural counterpart of AIXI in this setting. We hope that for these agents strong game-theoretic guarantees can be derived.

Discussion

We think that there is a reasonable chance the concepts outlined in this essay can lead to solving the grain of truth problem. This will require much further work. Specifically, we suggest to continue the research in approximately the following order:

Better understanding of minimax regret in the sequential setting, including the definition of quasi-perfect or proper strategies.
Proving convergence of Bayes-Savage agents to policies satisfying appropriate regret guarantees. That is, given an environment $μ$ that does not appear explicitly in the prior but which satisfies $μ \in Φ$ for $Φ$ that appears in the prior, we expect the agent to converge to a policy $π$ whose loss w.r.t. the best response $π_{μ}$ to $μ$ is no greater than the maximin regret for $Φ$ .
Better understanding the relationship of the minimax regret rule to game theory e.g. whether the minimax regret rule can allow a better justification of Nash equilibria than best response (i.e. in a Nash equilibrium all strategies are best responses but not all best responses preserve the Nash equilibrium; maybe minimax regret allows doing better?)
Deriving game-theoretic guarantees for AIXI-like Bayes-Savage agents from the generic regret guarantees above. Presumably, this requires for each agent in the game describing a specific computable incomplete model consistent with the behavior of the other agents (which cannot be complete but should be sufficiently strong in some sense). In general, we expect being able to prove e.g. convergence to iteratively undominated strategies under fairly general conditions and convergence to Nash equilibria (preferably e.g. proper equilibria) under some assumptions on the game (e.g. time discount falling sufficiently slowly).
Looking for Bayes-Savage agents of bounded complexity. Here the algorithm by Koller and Megiddo for two-player zero-sum extensive games with perfect recall might be relevant (in our case the game tree is at least exponentially big w.r.t. horizon length, but such inefficient agents can be a good starting point towards understand bounded agents in general).

AI ALIGNMENT FORUM
AF