Review

TLDR: I give an overview of (i) the key problems that the learning-theoretic AI alignment research agenda is trying to solve, (ii) the insights we have so far about these problems and (iii) the research directions I know for attacking these problems. I also describe "Physicalist Superimitation" (previously knows as "PreDCA"): a hypothesized alignment protocol based on infra-Bayesian physicalism that is designed to reliably learn and act on the user's values.

I wish to thank Steven Byrnes, Abram Demski, Daniel Filan, Felix Harder, Alexander Gietelink Oldenziel, John S. Wentworth[1] and my spouse Marcus Ogren for reviewing a draft of this article, finding errors and making helpful comments and suggestions. Any remaining flaws are entirely my fault. 

Preamble

I already described the learning-theoretic agenda (henceforth LTA) in 2018. While the overall spirit stayed similar, naturally LTA has evolved since then, with new results and research directions, and slightly different priorities. This calls for an updated exposition.

There are several reasons why I decided that writing this article is especially important at this point:

  • Most of the written output of LTA focuses on the technical, mathy side. This leaves people confused about the motivation for those inquiries, and how they fit into the big picture. I find myself having to explain it again and again in private conversations, and it seems more efficient to have a written reference.
  • In particular, LTA is often conflated with infra-Bayesianism. While infra-Bayesianism is a notable part, it is not the only part, and without context it's not even clear why it is relevant to alignment.
  • Soares has recently argued that alignment researchers don't "stack", meaning that it doesn't help much to add more researchers to the same programme. I think that for LTA, this is not at all the case. On the contrary, there is a variety of shovel-ready problems that can be attacked in parallel. With this in mind, it's especially important to explain LTA in order to get more people on board.
  • In particular, ALTER has announced a prize designed to encourage work on LTA. I expect this article to be a useful resource for researchers who decide to compete.
  • It seemed important to explain Physicalist Superimitation better, and this is hard to do without the broader context.

I am fairly optimistic that LTA leads to a solution to alignment. On the other hand, it's much harder to say whether the solution will arrive in time. I think that the rate of progress relative to person-hours invested was pretty good so far, and we didn't hit any obstacle that cast doubt on the entire endeavor. At the same time, in absolute terms we still have most of the work in front of us. In the coming years, I am planning to put considerable effort into scaling up: getting more researchers on board, and hopefully accelerating progress by a significant factor.

Philosophy

The philosophy of LTA was covered in the original article, I mostly stand behind what I wrote there and don't want to repeat it. The goal of this section is merely to recap and add a couple of points.

The goal of LTA is creating a general mathematical theory of intelligent agents. There are a number of reasons why this is necessary for AI alignment:

  • Having a mathematical theory enables constructing models in which we can prove that a particular AI design is aligned (or unaligned), or at least form rigorous and strongly supported conjectures (similar to the role  in cryptography). While such a proof doesn't imply absolute confidence, it does allow us to reduce the problem to becoming confident in the assumptions of the model. This is a much better situation than dealing with a sequence of heuristic steps, each of which might be erroneous and/or hiding assumptions that are not stated clearly. Of course, similarly to how in cybersecurity having a provably secure protocol doesn't protect you from e.g. someone using their birthday for a password, here too we will need to meticulously verify that the assumptions hold in the actual implementation. This might require knowledge from domains outside of theoretical computer science, e.g. physics, cognitive science, evolutionary biology etc.
  • Empirical data is insufficient in itself, since without an underlying theory it's very hard to be confident about how the results extrapolate to new domains, scales, architectures and algorithms. On the other hand, the combination of theory and experiment can be extremely powerful for such extrapolation, even if the theory cannot produce quantitative predictions ab initio on its own.
  • Even if we don't do any explicit calculations using the theory about the AI we are designing, merely knowing the theory leads to much better intuitions, and equips us with the right vocabulary to reason about the problem. To give an analogy, it is much easier to reason about designing engines if you're familiar with concepts such as "heat", "work", "temperature", "entropy", even if you're just reasoning informally rather than actually calculating.

After I wrote the original article about LTA, more was written about the feasibility and importance of mathematics for alignment, both by me and by others.

Why is LTA concerned specifically with agents? Aren't there AI systems which are not agents? The reason is: the sort of risks I want to address are risks that arise from AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue unaligned goals, with catastrophic consequences), and the sort of solution I envision also relies on AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue an aligned goal, s.t. the result is protecting humanity from unaligned AI).

Finally, I want to address a common point of confusion. Sometimes I tell someone about subproblems in LTA and they respond by trying to understand why each individual subproblem (e.g. nonrealizability, or cartesian privilege, or value ontology) is relevant to alignment. Often there are some specific ways in which the subproblem can be connected to alignment. But the more important point is that any fundamental question about intelligent agency is relevant to alignment, just because without answering this question we cannot understand intelligent agency. For any aspect of intelligent agency that we don't understand and any AI that we design, one of the two will hold: either the AI lacks this aspect, in which case it is probably insufficient to protect humanity, or the AI has this aspect but we don't understand why or how, in which case it is probably unaligned (because alignment is a small target, and significant unaccounted for phenomena are likely to make you miss).

Key Problems

The starting point of LTA is Marcus Hutter's AIXI: a simplistic model of ideal agency.

Intuitively, an "intelligent agent" is a system that builds sophisticated models of the world and uses them to construct and execute plans that lead to particular goals (see also Yudkwosky). Building models implies starting from a state of ignorance and updating on observations, which can be formalized by Bayesian probability theory. Building sophisticated models requires using Occam's razor, which can be formalized by the Solomonoff prior. Constructing and executing plans can be formalized as maximizing expected utility. Putting all these ingredients together gives us AIXI.

However, there are both important gaps both in our understanding of AIXI-like agents, and significant weaknesses in the AIXI framework itself. Solving these problems is the natural framing for the entire foundational[2] part of LTA.

Problem 1: Computational Resource Constraints

AIXI is uncomputable. Obviously, real-world agents are not only computable but operate under strong computational resource constraints. This doesn't mean AIXI is not a useful toy model for studying the interaction of Occam's razor with Bayesian optimization or reinforcement learning. However, it is a major limitation.

One obvious attempted solution is to impose some computational resource constraints on the programs that appear in Solomonoff's prior, for example as in Schmiduber's speed prior or in some other way. This typically leads to agents that are computable but still computationally intractable. Relatedly, a recent line of research connects the hardness of time-bounded Kolmogorov complexity with the existence of one-way functions.

On the other hand, we know some priors for which asymptotic Bayes-optimality is possible in polynomial time, for example priors supported on Markov decision processes (MDPs) with a small (i.e. polynomial in the security parameter) number of states. Some stronger feasible models are also known, e.g. MDPs with linear features or kernel-MDPs. However, all such priors have significant shortcomings:

  • They require the MDP to be communicating, i.e. contain no irreversible transitions (a.k.a. "traps"). This is obviously not true in the real-world, where e.g. jumping off a cliff is often irreversible. Indeed, without this assumption approximating the Bayes-optimal policy is NP-hard, even for a small number of deterministic hypotheses.
  • They usually don't embody any sort of Occam's razor.
  • They are not rich enough to capture sophisticated models of the real world.

Problem 2: Frequentist Guarantees

AIXI is Bayes-optimal (by definition) but is not necessarily optimal in any sense for any particular environment. In contrast, in statistical learning theory we usually demand a frequentist guarantee, i.e. that a learning algorithm converges to optimality (in some sense) for any[3] data source (subject to some assumptions which depend on the setting). Such guarantees are important for several reasons:

  • Learning is a key ability of intelligent agents, and played a fundamental role in AI progress in recent decades. A natural way to operationalize learning is: an agent learned a fact when its behavior is optimal conditional on this fact. In other words, learning which of a set of hypotheses is true means converging to optimal behavior for the true hypothesis, i.e. precisely a frequentist guarantee. This means that a theory of frequentist guarantees tell us both which facts agents can learn and how fast they learn them (or how much data they require to learn them). These are important questions that a theory of intelligent agents has to answer. In particular, it is required to analyze the feasibility of alignment protocols: e.g. if we expect the AI to learn human values, we need to make sure it has sufficient information to learn them within a practical time frame.
  • A Bayes-optimal algorithm does well on average across some large ensemble of possible universes. But in reality, we only observe one universe (which is not even in the ensemble: see Subproblem 2.3 below). Without frequentist guarantees, it's not clear why would a Bayes-optimal algorithm do especially well in our specific universe, or why should algorithms that do well in our specific universe be Bayes-optimal.
  • Evolution selected humans by their performance in the ancestral environment, which involved e.g. gathering berries and running away from lions. But they turned out to perform reasonably well in very different environments that require e.g. writing code or designing rockets. This hints at the existence of some underlying frequentist guarantee as a key property of intelligent agents.

Moreover, in addition to single-agent frequentist guarantees it is desirable to derive multi-agent frequentist guarantees. Interactions between agents are quite important in the real-world, and have significance specifically for AI alignment as well (AI-human, AI-other AI, AI-acausal trade partner, to some extent human-human too). A multi-agent frequentist guarantee can take the form of convergence to a particular game-theoretic solution concept, or an asymptotic lower bound on expected utilities corresponding to some notion of game value.

There are multiple difficulties in deriving frequentist guarantees for AIXI-like agents. The first two difficulties ("traps" and "password guessing games" below) are not problems with the agent, but rather with the way conventional frequentist guarantees are defined (which are nonetheless non-trivial to solve). The third difficulty (nonrealizability) might be a problem with the way the agent is defined.

Subproblem 2.1: Traps

The most common type of frequentist guarantee in reinforcement learning (RL) is regret bounds[4]. However, AIXI doesn't satisfy any regret bound w.r.t. the underlying hypothesis class (i.e. the class of all computable environments) because these hypotheses involve irreversible transitions (traps). For example, suppose that in environment 1 taking action A sends you to a sink state with reward , whereas taking action B sends you to a sink state with reward , and in environment 2 the opposite is true. Obviously a hypothesis class which contains both environments doesn't admit a regret bound.

This problem is especially acute in the multi-agent setting, because other agents have memory. Either it's impossible to erase memory in which case the environment is irreversible by design, or it's possible to erase memory which breaks conventional learning theory in other ways.

Subproblem 2.2: Password guessing games

What if we arbitrarily rule out traps? Consider an agent with action set  and observation set . Suppose that the reward is a function of the last observation only . We can specify an environment[5]  by providing a communicating MDP with state set  and action set  augmented by representation mappings

Here,  is required to be an onto mapping from  to  for any . Also, we require that for any  and 

However, this only enables a fairly weak regret bound. Naively, it seems reasonable to expect a regret bound of the form  where  is the Kolmogorov complexity of the MDP+representation,  is the diameter[6] and  is the time horizon[7], and  are constants s.t. . Indeed,  can be regarded as the amount of information that needs to be learned and e.g. Russo and Van Roy showed that in a bandit setting regret scales as the square root of the entropy of the prior (which also expresses the amount of information to be learned). However, this is impossible, as can be seen from the following example.

Fix  and  (the "password"), let the action set be  and the state space be . We think of a state  as "the agent entered the string of bits  into the terminal". The initial state is  (the empty string). The transition kernel works as follows:

In state , taking action  produces the state  (i.e. the agent enters the digit ). In state , taking action  produces the state  if  (the agent entered the correct password) or state  (the agent entered an incorrect password and has to start from the beginning). In state  taking action  stays in state  and taking action  produces state  (restarts the game).

All states have reward  except for the state  which has reward .

Obviously the agent needs time  to learn the correct password, in expectation w.r.t. the uniform distribution over passwords. At the same time,  and . This is incompatible with having a regret bound of the desired form.

Subproblem 2.3: Nonrealizability

As we discussed in Problem 1, a realistic agent cannot be Bayes-optimal for a prior over all computable environments. Typically, each hypothesis in the prior has to admit a computationally feasible approximately optimal policy in order for it to be feasible to approximate Bayes-optimality (in particular this has to be the case if the prior is learnable), which usually implies the hypotheses are computationally feasible to simulate[8]. However, in this situation we can no longer assume the true environment is within the hypothesis class. Even for AIXI itself, it is not really fair to assume the environment is computable since that would exclude environments that contain e.g. other AIXIs.

A setting in which the true environment is not in the hypothesis class is known in learning theory as "nonrealizable". For offline (classification/regression) and online learning, there is a rich theory of nonrealizable learning[9]. However, for RL (which, among classical learning theory frameworks, is the most relevant for agents), the nonrealizable setting is much less understood (although there are some results, e.g. Zanette et al).

Therefore, even if we arbitrarily assume away traps, and are willing to accept a factor of  in the regret bound, a satisfactory theory of frequentist guarantees would require good understanding of nonrealizable RL.

This problem is also especially acute in the multi-agent case, because it's rarely the case that each agent is a realizable environment from the perspective of the other agent (roughly speaking, if Alice would simulate Bob simulating Alice, Alice would enter an infinite loop). This is known as the "grain of truth" problem[10]. One attempt to solve the problem is by Leike, Taylor and Fallenstein, using agents that are equipped with "reflective oracles". However, there are many different reflective oracles, and the solution relies on all agents in the system to use the same one, i.e. that the design of the agents has been essentially centrally coordinated to make them compatible.

Problem 3: Value Ontology

AIXI's utility function depends on its direct observations. This is also assumed in RL theory. While this might be a legitimate type of agent, it is insufficiently general. We can easily imagine agents that place value on parameters they don't directly observe (e.g. the number of paperclips in the observable universe, or the number of people who suffer from malaria). Arguably, humans are such agents.

The obvious workaround is to translate the utility function from its original domain to observations by taking a conditional expectation. That is, suppose the utility function is  for some space , let  be the space of everything that is directly observed (e.g. action-observation time sequences) and suppose we have some prior over . Then, we can define  by

However, there are problems with this:

  • In order to have reasonable computational complexity and frequentist guarantees, we will need to make assumptions about the utility function. While such assumptions can be natural for , they make much less sense for  unless we can somehow justify them via the prior over . Without a theory that explicitly talks about  and  it is impossible to know whether such justification is possible.
  • Even if  satisfies the required assumptions, a frequentist guarantee about  does not necessarily imply an analogous frequentist guarantee about . That's because the transformation above involves taking expected value which implicitly mixes different "possible universes", while a frequentist guarantee is supposed to refer to a particular universe.
  • Different agents have different observation spaces and there is no natural joint prior on all of them. This means it's impossible to talk about different agents having "aligned" preferences.

Therefore, it is desirable to have a theory in which it is possible to explicitly talk about utility functions with domains other than observations. (See also de Blanc.)

Problem 4: Cartesian Privilege

The formalization of Occam's razor in AIXI relates to the description complexity of hypotheses represented in terms of the agent's actions  and observations  (as functions ). However, this is not how Occam's razor is used in science. When we talk about a "simple" or "elegant" scientific theory, we refer to the equations governing the fundamental degrees of freedom on that theory (e.g. particles, or fields or strings), not to the equations that would be needed to describe the RGB values of points on a person's retina. Indeed, expressing physics via the latter equations would be horrifically complicated.

In other words, AIXI-like agents believe they occupy a privileged position in the universe, which has various pathological consequences. See "infra-Bayesian physicalism" for more discussion and Demski and Garrabrant for related prior work.

Problem 5: Descriptive Agent Theory

The framing of AIXI is: given certain subjective choices (e.g. universal Turing machine and utility function), what is an ideal agent? However, in the real-world agents are not ideal. Moreover, we want to be able to understand which systems are agents, and what kind of agents they are, without already assuming e.g. a specific utility function. In particular, such a theory would have applications to:

  • Value learning, by regarding humans as agents.
  • Studying "accidental" formation of agents, e.g. via evolution or as mesa-optimizers (related: Wentworth's selection theorems programme).

Legg and Hutter defined an intelligence measure that for every policy produces a number determining its intelligence, s.t. AIXI has maximal intelligence. The measure is essentially the policy's expected utility w.r.t. the Solomonoff prior. This allows us to talk how close any given system is to an ideal agent. However, it suffers from two major limitations (in addition to the other problems of AIXI discussed before):

  • It depends on the choice of universal Turning machine (UTM). While the same is true of everything in algorithmic information theory, usually we have theorems that limit this dependence. For example, Kolmogorov complexity only changes by  when we switch to a different UTM. On the other hand, the Legg-Hutter measure doesn't have any such property (except in the asymptotic in which it approaches 0, which is pretty meaningless).
  • It depends on the choice of utility function. Technically, they don't make an explicit choice but instead assume the reward is one of the observation channels. In practice, this is a choice (and a very restrictive one).

Moreover, naive attempts to ascribe a utility function to a policy run into difficulties. Specifically, any policy can arguably be ascribed a utility function which rewards following precisely this policy. With respect to such a utility function, the policy is Bayes-optimal for any prior. Armstrong and Mindermann have argued on the basis of this type of reasoning that ascribing a utility function is essentially impossible in a purely behaviorist framework (but I argue the opposite, see Direction 17.5 below).

Nonproblem: Expected Utility

Finally, I want to briefly discuss one issue which I don't consider a key problem. Namely, the reliance of AIXI on expected utility maximization.

Expected utility is often justified by coherence theorems such as von Neumann–Morgenstern (VNM) and Savage. These theorems are indeed important: they show that a small number of natural assumptions produce a fairly narrow mathematical object (expected utility). This should indeed increase our credence in expected utility as a useful model. Often, objections are put forward that argue with individual assumptions. IMO these objections miss the point: a coherence theorem is a piece of evidence, not a completely watertight proof that expected utility is philosophically correct. And, if expected utility is philosophically correct, such a watertight proof is still not something we should expect to exist (what would it even look like?)

The more important justification of expected utility is the large body of work it supports: control theory, game theory, reinforcement learning theory etc. The methodology I believe in is, start with plausible assumptions and try to build a theory. If the theory that results is rich, has synergy with other bodies of knowledge, has explanatory power, is useful in practice, each of these is evidence that it's on the right track. (And if we failed to build such a theory, we probably also learned something.)

Another objection that is often raised is "humans have no utility function". This is ostensibly supported by research in behavioral economics which shows that humans behave irrationally. I consider this deeply unconvincing. For one thing, I suspect that a large part of that research doesn't replicate. But even putting that aside, the interesting thing about irrational behavior is that we recognize it as irrational: i.e. learning about it makes you behave differently (unless it's so minor that it isn't worth the effort). This already indicates that this irrationality is better viewed not as a fundamental fact about human preferences, but as some combination of:

  • Computational or informational resource constraints
  • Error terms that vanish in the relevant asymptotic regime
  • Random noise or some other effect that's not part of the cognitive algorithm

All of this is not to say that expected utility cannot be questioned. Rather that a productive objection would be to either come up with a demonstrably superior alternative, or at least a good argument why some extremely concrete problem cannot be solved without an alternative to expected utility. Lacking that, the best strategy is to press forward unless and until we encounter such a problem.

Now, we actually do have examples where deviating from purist VNM in specific ways is useful, e.g.:

  • Infra-Bayesianism, where the conventional notion of "expectation" is replaced by some non-linear concave functional.
  • Taylor's quantilization, where instead of maximizing expected utility we sample a random choice out of some top fraction of choices.
  • Nash bargaining, where we maximize the product of several expected utilities (see also Garrabrant's "