The Learning-Theoretic Agenda: Status 2023

[-]Vanessa Kosoy1y*45Review for 2023 Review

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.

Since the article was written, there were several new publications:

Gergely Szűcs's article on interpreting quantum mechanics using infra-Bayesian physicalism.
My paper on linear infra-Bayesian bandits.
An article on infra-Bayesian haggling by my MATS scholar Hanna Gabor. This approach to multi-agent systems did not exist when the overview was written, and currently seems like the most promising direction.
An article on time complexity in string machines by my MATS scholar Ali Cataltepe. This is a rather elegant method to account for time complexity in the formalism.

In addition, some new developments were briefly summarized in short-forms:

A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
Ambidistributions: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

[-]harfe2y30

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix $M = M_{0}$ for these considerations too.) Consider an utility function $^U$ . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for $^U$ ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of $^U$ ; higher $n$ is better approximation). We will call this approximation of the optimal policy $π_{n}^{^U}$ . This approximation algorithm has complexity $K (π_{n}^{^U}) = C + K (^U) + K (n)$ , where $C > 0$ is a constant needed to describe the general algorithm (this should not be too large).

We can get better approximation by using a quickly growing function, such as the Ackermann function with $n = A (k, k)$ . Then we have $K (π_{A (k, k)}^{^U}) = C + K (^U) + K (A (k, k)) \leq C + K (^U) + log (k)$ .

What is the $g$ score of this policy? We have $g (π_{A (k, k)}^{^U}) = {max}_{U} ({min}_{π^{'} : \dots} K (π^{'}) - K (U))$ . Let $¯ U$ be maximal in this expression. If $K (¯ U) \geq K (^U) - C$ , then $g (π_{A (k, k)}^{^U}) = min π^{'} : E_{ζ_{M_{0}} π^{'}} (¯ U) \geq E_{ζ_{M_{0}} π_{A (k, k)}^{^U}} (¯ U) K (π^{'}) - K (¯ U) \leq K (π_{A (k, k)}^{^U}) - K (^U) + C \leq 2 C log (k)$ .

For the other case, let us assume that if $K (¯ U) < K (^U) - C$ , the policy $π_{A (k, k)}^{¯ U}$ is at least as good at maximizing $¯ U$ than $π_{A (k, k)}^{^U)}$ . Then, we have $g (π_{A (k, k)}^{^U}) = min π^{'} : E_{ζ_{M_{0}} π^{'}} (¯ U) \geq E_{ζ_{M_{0}} π_{A (k, k)}^{^U}} (¯ U) K (π^{'}) - K (¯ U) \leq K (π_{A (k, k)}^{¯ U}) - K (¯ U)) \leq C + log (k)$ .

I don't think that the assumption ( $(π_{A (k, k)}^{¯ U}$ maximizes $b a r U$ better than $(π_{A (k, k)}^{^U}$ ) is true for all $^U$ and $k$ , but plausibly we can select $^U$ such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than $A (k, k)$ ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.

To summarize, maybe there exist policies $π_{A (k, k)}^{^U}$ which strongly optimize a non-trivial utility function $^U$ with approximation parameter $n = A (k, k)$ , but where $g (π_{A (k, k)}^{^U}) \leq 2 C + log (k)$ is relatively small.

[-]Vanessa Kosoy2y30

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes computational complexity into account instead of . For example, you can look at the measure $C$ I defined here. More realistically, this measure should be based on the frugal universal prior.

[-]Frank_R3y20

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in $[0, 1]$ and let $f : [0, 1] \to [0, 1]$ be a strictly monotonous function. Then $U_{1}$ and $U_{2} = f \circ U_{1}$ have the same maxima. $f$ can be non-linear, e.g. $f (x) = x^{2}$ . Therefore, I wonder if the condition $u (y) = α v (y) + β$ should be weaker.

Moreover, I ask myself if it is possible to modify $U_{1}$ by a small amount at a place far away from the optimal policy such that $π$ is still optimal for the modified utility function. This would weaken the statement about the uniqueness of the utility function even more. Think of an AI playing Go. If a weird position on the board has the utility -1.01 instead of -1, this should not change the winning strategy. I have to go through all of the definitions to see if I can actually produce a more mathematical example. Nevertheless, you may have a quick opinion if this could happen.

[-]Vanessa Kosoy3y30

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in $[0, 1]$ and let $f : [0, 1] \to [0, 1]$ be a strictly monotonous function. Then $U_{1}$ and $U_{2} = f \circ U_{1}$ have the same maxima. $f$ can be non-linear, e.g. $f (x) = x^{2}$ . Therefore, I wonder if the condition $u (y) = α v (y) + β$ should be weaker.

No, because it changes the expected value of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify $U_{1}$ by a small amount at a place far away from the optimal policy such that $π$ is still optimal for the modified utility function.

Good catch, the conjecture as stated is obviously false. Because, we can e.g. take $U_{2}$ to be the same as $U_{1}$ everywhere except after some action which $π^{*}$ doesn't actually take, in which case make it identically 0. Some possible ways to fix it:

Require the utility function to be of the form $U : O^{ω} \to [0, 1]$ (i.e. not depend on actions).
Use (strictly) instrumental reward functions.
Weaken the conclusion so that we're only comparing $U_{1}$ and $U_{2}$ on-policy (but this might be insufficient for superimitation).
Require $π^{*}$ to be optimal off-policy (but it's unclear how can this generalize to finite $g$ ).

^{^}

listed alphabetically by last name

^{^}

By "foundational part" I mean the theory of intelligent agents qua intelligent agents, as opposed to the "applied part" where we use this theory to study alignment per se.

^{^}

Sometimes we are content with "any except for a set of prior probability 0", such as when we only bound the Bayesian regret in a Bayesian online/bandit/reinforcement learning setting.

^{^}

See e.g. Lattimore and Szepesvari for an introduction to regret bounds, in particular section 38 talks about reinforcement learning.

^{^}

Here, we assume that the first observation comes before of the first action, in contrast to my usual convention, because this time it's more convenient.

^{^}

The diameter is the maximal expected time to reach one state from another state. The bound has to scale with $τ$ since as $τ \to \infty$ , a trap can develop. Alternative parameterizations are also possible, such as using mixing time or bias span. The latter might be advantageous since bounding the diameter also caps the number of states at $O (2^{τ})$ , whereas bounding bias span allows an infinite number of states. On the other hand, the bias span depends on the reward function.

^{^}

Alternatively, we can consider geometric time discount $γ$ , in which case $T$ is replaced by $\frac{1}{1 - γ}$ .

^{^}

The existence of a computationally feasible optimal policy is usually a much stronger condition than the computational feasibility of simulation: for example finding an approximately optimal policy for a POMDP is PSPACE-hard while simulating it in P. Of course there are degenerate cases in which the optimal policy is easy even though simulation is hard.

^{^}

See e.g. Shalev-Shwartz and Ben-David.

^{^}

The name comes from Kalai and Lehrer.

^{^}

I haven't properly studied the prior work relating automata to category theory (which definitely exists) so I make no strong claims to originality here.

^{^}

The paper uses a theorem stated without proof, for which the citation given is "Lempp, Miller, Ng and Turetsky, 2010, unpublished, private communication". However, Lattimore kindly provided me with this unpulished communication upon request, and, to the best of my judgement, the proof is valid.

^{^}

Infra-Bayesian laws were originally called "belief functions" in the infra-Bayesianism sequenece.

^{^}

I haven't done a sufficiently thorough literature survey, so there might already be such results.

^{^}

It's not just a law that depends on a point in the support of $Θ$ because there is no disintegration theorem for infradistributions. See this.

^{^}

While running most programs doesn't have irreversible harmful side-effects, but it probably doesn't hold for all programs: for example we can imagine code that uses some hardware exploit. In particular, this is related to what I previous called non-cartesian daemons. It should be possible to get guarantees that take this into account by e.g. somewhat randomizing the precise code we run each time (it might be related to quantilization).

^{^}

See e.g. Kearns and Vazirani chapter 8, Higuera 2004 and Higuera 2010.

^{^}

As a consequence, this section carries an especially high risk of missing some pertinent known results that I'm unfamiliar with.

^{^}

See e.g. Vereshchagin and Shen.

^{^}

The convention I'm using here doesn't normalize the sum of payoffs over time, i.e.

U_{t o t} = \infty \sum i = 0 γ^{i} U_{i}

and the probability of playing a strategy is proportional to $exp (λ E [U_{t o t}])$ .

^{^}

It would be more natural to say that the hidden environment is an equivalence class of such objects, where two are considered equivalent if it is not possible to distinguish them in terms of actions and states in $S_{0}$ . However, the distinction is not critical for the exposition.

^{^}

See e.g. chapter 37 in Lattimore and Szepesvari.

^{^}

Or maybe just TMs, that would still allow making sense of $ζ_{M}$ as the lower semicomputable environment computed by $M$ .

^{^}

If we want to infer the utility function and the prior simultaneously, we probably need to take into account that some ways to redefine both of them together produce an equivalent decision problem. Hence, it makes sense to focus on recovering their product. Formally, any environment $ζ$ corresponds to a measure $ζ^{pol}$ on $(O \times A)^{ω}$ s.t. for any $π$ , the distribution induced by $ζ$ and $π$ is equal to $ζ^{pol} χ_{π}$ , where $χ_{π}$ is the function that's 1 on sequences compatible with $π$ and 0 otherwise. We can then define the utility measure to be the product $U ζ^{pol}$ .

^{^}

This doesn't seem to automatically follow from the assumption $g_{0} (π) = + \infty$ . However, it might be possible to show that there is always at least an uncomputable utility function that can be computed with a slow-growing amount of advice.

^{^}

There might still be alignment overhead. Specifically, (i) the unusual structure of the loss function, (ii) prior shaping to deal with potential traps and (iii) prior shaping to deal with potential side-effects of computations, might incur overhead. We need to return to this question when we have better models. Maybe the overhead is small, or maybe we can prove that significant overhead is unavoidable.

^{^}

Technically, the mathematical analysis will probably need to focus on the asymptotic in which the agents can run all computations, but the original and the imitator can converge to this limit with different speeds.

^{^}

from Greek: anthropos (human) + sinepis (coherent, according to ChatGPT)

^{^}

Speculatively, we might not even have to do that: the AI itself can facilitate superrational cooperation between all agents who could affect the creation of this or similar AI.

^{^}

Input is a special case of instantiation: saying that program $p$ runs with input $x$ is equivalent to saying that the computation $p (x)$ is instantiated.

^{^}

We would have to be careful to set the ADAM threshold correctly and make sure the room indeed doesn't contain other agents. Otherwise, we are risking the AI version of "The Fly".

^{^}

In some "behaviorist" sense. IBP doesn't really have an update rule, and therefore there is no well-defined "posterior" in general.

^{^}

I hope...

54

The Learning-Theoretic Agenda: Status 2023

54

Preamble

Philosophy

Key Problems

Problem 1: Computational Resource Constraints

Problem 2: Frequentist Guarantees

Subproblem 2.1: Traps

Subproblem 2.2: Password guessing games

Subproblem 2.3: Nonrealizability

Problem 3: Value Ontology

Problem 4: Cartesian Privilege

Problem 5: Descriptive Agent Theory

Nonproblem: Expected Utility

Research Directions

Subproblem 1.1: Frugal Universal (Infra-)Prior

Direction 1: Frugal Compositional Languages

Direction 2: Deep Learning Theory

Subproblem 2.3: Nonrealizability

Direction 3: Infra-Bayesian RL

Direction 4: Exploiting properties of simplicity priors

Direction 5: Agnostic RL

Subproblem 1.2/2.1: Traps

Direction 6: Metacognitive Agents

Direction 7: Approximation algorithms for Bayesian planning

Direction 8: Expanding Safety Envelope

Better regret bounds for simple priors

Direction 9: Unifilar MDPs

Direction 10: Foreground MDPs

Direction 11: Canonical RL Dimension

Direction 12: Epistemic Regret Bounds

Subproblem 2.4: Multi-Agent Guarantees

Direction 13: Population Games

Direction 14: Logit equilibria of finite-state repeated games

Direction 15: Infra-Bayesian Veil of Ignorance

Direction 16: Hidden Rewards

Direction 16.1: Agency with partial monitoring

Direction 16.2: Specifying semi-instrumental reward functions

Direction 16.3: Undogmatic Ontologies

Direction 16.3: Bayes-optimal planning for communicating RUMDPs

Direction 17: Algorithmic Descriptive Agency Measure (ADAM)

Direction 17.1: Comparing ADAM variants

Direction 17.2: ADAM of random policies

Direction 17.3: ADAM Hierarchy Theorem

Direction 17.4: ADAM for finite-state policies

Direction 17.5: Inferring the utility function

Direction 18: Infra-Bayesian Physicalism

Physicalist Superimitation

Motivation

Superimitation

Applicable to humans

Agent Detection

User Identification

Full Specification

Alignment Guarantee

Formal Alignment Criterion

Axiological alignment of PSI

Epistemic alignment of PSI

A Plan

Summary