# 20

In this essay I will try to explain the overall structure and motivation of my AI alignment research agenda. The discussion is informal and no new theorems are proved here. The main features of my research agenda, as I explain them here, are

• Viewing AI alignment theory as part of a general abstract theory of intelligence

• Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions

• Formulating alignment problems in the language of learning theory

• Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment

• Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions

# Philosophy

In this section I explain the key principles and assumptions that motivate my research agenda.

## The importance of rigor

I believe that the solution to AI alignment must rely on a rigorous mathematical theory. The algorithms that comprise the solution must be justified by formal mathematical properties. All mathematical assumptions should be either proved or at least backed by considerable evidence, like the prominent conjectures of computational complexity theory. This needs to be the case because:

• We might be facing one-shot success or failure. This means we will have little empirical backing for our assumptions.

• To the extent we have or will have empirical evidence about AI, without a rigorous underlying theory it is very hard to know how scalable and transferable the conclusions are.

• The enormity of the stakes demands designing a solution which is as reliable as possible, limited only by the time constraints imposed by competing unaligned projects.

That said, I do expect the ultimate solution to have aspects that are not entirely rigorous, specifically:

• The quantitative risk analysis will probably rely on some parameters that will be very hard to determine from first principles, because of the involvement of humans and our physical universe in the equation. These parameters might be estimated through (i) study of the evolution of intelligence (ii) study of human brains, (iii) experiments with weak AI and its interaction with humans (iv) our understanding of physics. Nevertheless, we should demand the solution to be highly reliable even given cautious error margins on these parameters.

• The ultimate solution will probably involve some heuristics. However, it should only involve heuristics that are designed to improve AI capabilities without invalidating any of the assumptions underlying the risk analysis. Thus, in the worst-case scenario these heuristics will fail and the AI will not take off but will not become unaligned.

• In addition to the theoretical analysis, we do want to include as much empirical testing as possible, to provide an additional layer of defense. At the least, it can be a last ditch protection in the (hopefully very unlikely) scenario that some error got through the analysis.

## Metaphilosophy and the role of models

In order to use mathematics to solve a real-world problem, a mathematical model of the problem must be constructed. When the real-world problem can be defined in terms of data that is observable and measurable, the validity of the mathematical model can be ascertained using the empirical method. However, AI alignment touches on problems that are philosophical in nature, meaning that there is still no agreed-upon empirical or other criterion for evaluating an answer. Dealing with such problems requires a metaphilosophy: a way of evaluating answers to philosophical questions.

Although I do not claim a fully general solution to metaphilosophy, I think that, pragmatically, a quasiscientific approach is possible. In science, we prefer theories that are (i) simple (Occam's razor) and (ii) fit the empirical data. We also test theories by gathering further empirical data. In philosophy, we can likewise prefer theories that are (i) simple and (ii) fit intuition in situations where intuition feels reliable (i.e. situations that are simple, familiar or received considerable analysis and reflection). We can also test theories by applying them to new situations and trying to see whether the answer becomes intuitive after sufficient reflection.

Moreover, I expect progress on most problems to be achieved by the means of successive approximations. This means that we start with a model that is grossly oversimplified but that already captures some key aspects of the problems. Once we have a solution within this model, we can start to attack its assumptions and arrive at a new, more sophistical model. This process should repeat until we arrive at a model that (i) has no obvious shortcomings and that (ii) we seem unable to improve despite our best efforts.

Like in science, we can never be certain that a theory is true. Any assumption or model can be questioned. This requires striking a balance between complacency and excessive skepticism. To avoid complacency, we need to keep working to find better theories. To avoid excessive skepticism, we should entertain hypotheses honestly and acknowledge when a theory is already capable of passing non-trivial quasiscientific tests. Reaching agreement is harder work (because our tests rely on intuition which may vary from individual to individual), but we should not despair of that goal.

## Intelligence is understandable

It is possible to question whether a mathematical theory of intelligence is possible at all. After all, we don't expect to have a tractable mathematical theory of Rococo architecture, or a simple equation describing the shape of the coastline of Africa in the year 2018.

The key difference is that intelligence is a natural concept. Intelligence, the way I use this word in the context of AI alignment, is the ability of an agent to make choices in a way that effectively promote its goals, in an environment that is not entirely known or even not entirely knowable. Arguing over the meaning of the word would be a distraction: this is the meaning relevant to AI alignment, because the entire concern of AI alignment is about agents that effectively pursue their goals, undermining the conflicting goals of the human species. Moreover, intelligence is (empirically) a key force in determining the evolution of the physical universe.

I conjecture that natural concepts have useful mathematical theories, and this conjecture seems to me supported by evidence in natural and computer science. It would be nice to have this conjecture itself follow from a mathematical theory, but this is outside of my current scope. Also, we already have some progress towards a mathematical theory of intelligence (I will discuss it in the next section).

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments. That said, I am less confident about this than about the previous question. In any case, having a mathematical theory of intelligence should allow us to resolve this question too, whether positively or negatively.

## Value alignment is understandable

The core of AI alignment is reliably transferring human values to a strong AI. However, the problem of defining what we mean by "human values" is a philosophical problem. A common and natural model of "values" is expected utility maximization: this is what we find in game theory and economics, and this is supported by VNM and Savage theorems. However, as often pointed out, humans are not perfectly rational, therefore it's not clear in what sense they can be said to maximize the expectation of a specific utility function.

Nevertheless, I believe that "values" is also a natural concept. Denying the concept of "values" altogether is paramount to nihilism, and in such a belief system there is no reason to do anything at all, including saving yourself and everyone else from a murderous AI. Admitting the general concept of "values" as something complex and human specific (despite the focus on "values" rather than "human values") seems implausible, since intuitively we can easily imagine alien minds facing a similar AI alignment problem. Moreover, the concept of "values" is part and parcel of the concept of "intelligence", so if we believe that "intelligence" (due to its importance in shaping the physical world) is a natural concept, then so are "values".

Therefore, I conjecture that there is a simple mathematical theory of imperfect rationality, within which the concept of "human values" is well-defined modulo the (observable, measurable) concept of "humans". Some speculation on what this theory looks like appears in the following sections.

Now, that doesn't mean that "human values" are perfectly well-defined, anymore than, for example, the center of mass of the sun is perfectly well-defined (which would require deciding exactly which particles are considered part of the sun). However, like the center of mass of the sun is sufficiently well-defined for many practical purposes in astrophysics, the concept of "human values" should be sufficiently well-defined for designing an aligned AGI. To the extent alignment remains ambiguous, the resolution of these ambiguities doesn't have substantial moral significance.

# Foundations

In this section I briefly explain the mathematical tools with which I set out to study AI alignment, and the outline of the mathematical theory of intelligence that these tools already painted.

## Statistical Learning Theory

Statistical learning theory studies the information-theoretic constraints on various types of learning tasks, answering questions such as, when is a learning task solvable at all, and how much training data is required to solve the learning task within given accuracy (sample complexity). Learning tasks can be broadly divided into:

• Classifications tasks: The input is sampled from a fixed probability distribution, and the objective is assigning the correct label. The deployment phase (during which the performance of the algorithm is evaluated) is distinct from the training phase (during which the correct labels are revealed).

• Online learning / multi-armed bandits: There is no distinction between deployment and training. Instead, the algorithm's performance on each round is evaluated, but also the algorithm might receive some feedback on its performance. The behavior of the environment might change over time, possibly even respond to the algorithm's output. However, we only evaluate each output conditioned on the past history (we don't require the algorithm to plan ahead).

• Reinforcement learning: There is two-sided interaction between the algorithm and the environment, and the algorithm's performance is the aggregate of some reward function over time. The algorithm is required to plan ahead in order to achieve optimal performance. This might or might not assume "resets" (when the environment periodically returns to the initial state) or partition of time into "episodes"\ (when the performance of the algorithm is only evaluated conditioned on the previous episodes, so that it doesn't have to plan ahead more than one episode into the future).

It is the last type of learning tasks, in particular assuming no resets or episodes, that is the most relevant for studying intelligence in the relevant sense. Indeed, the abstract setting of reinforcement learning is a good formalization for the informal definition of intelligence we had before. Note that the name "reward" might be misleading: this is not necessarily a signal received from outside, but can just as easily be some formally specified mathematical function.

In online learning and reinforcement learning, the theory typically aims to derive upper and lower bounds on "regret": the difference between the expected utility received by the algorithm and the expected utility it would receive if the environment was known a priori. Such an upper bound is effectively a performance guarantee for the given algorithm. In particular, if the reward function is assumed to be "aligned" then this performance guarantee is, to some extent, an alignment guarantee. This observation is not vacuous, since the learning protocol might be such that the true reward function is not directly available to the algorithm, as exemplified by DIRL and DRL. Thus, formally proving alignment guarantees takes the form of proving appropriate regret bounds.

## Computational Learning Theory

In addition to information-theoretic considerations, we have to take into account considerations of computational complexity. Thus, after deriving information-theoretic regret bounds, we should continue to refine them by constraining our algorithms to be computationally feasible (which typically means running on polynomial time, but we may also need to consider stronger restrictions, such as restrictions on space complexity or parallelizability). If we consider Bayesian regret (i.e. the expected value of regret w.r.t. some prior on the environments), this effectively means we are dealing with average-case complexity. Note that, imposing computational constraints on the agent implies bounded reasoning / non-omniscience and already constitutes departure from "perfect rationality" in a certain sense.

More precisely, it is useful to differentiate between at least two levels of computational feasibility (see also this related essay by Alex Appel). On the first level, which I call "weakly feasible", we allow the computing time to scale polynomially with the number of hypotheses we consider, or exponentially with the description length of the correct hypothesis (these two are more or less interchangeable since, the number of hypotheses of given description length is exponential in this length). Thus, algorithms like Levin's universal search or Solomonoff induction over programs with polynomial time complexity, or Posterior Sampling Reinforcement Learning with a small number of hypotheses fall into this category. On the second level, which I call "strongly feasible", we require polynomial computing time for the "full" hypothesis space. At present, we only know how to achieve theoretical guarantees on this second level in narrow contexts, such as reinforcement learning with a small state space (i.e. with number of states polynomial in the security parameter).

In fact, the current gap in our theoretical understanding of deep learning is strongly related to the gap between weak and strong feasibility. Indeed, results about expressiveness and (statistical) learnability of neural networks are well-known, however exact learning of neural networks is NP-complete in the general case. Understanding how this computational barrier is circumvented in practical problems is a key challenge in understanding deep learning. Such understanding would probably be a positive development in terms of AI alignment (although it might also contribute to increasing AI capacity), but I don't think it's a high priority problem since it seems to already receive considerable attention in mainstream academia (i.e. it is not neglected).

I believe that the development of AI alignment theory should proceed by prioritizing information-theoretic analysis first, complexity-theoretic analysis in the sense of weak feasibility second, and complexity-theoretic analysis in the sense of strong feasibility last. That said, we should keep the complexity-theoretic considerations in mind, and strive to devise solutions that at least seem feasible modulo "miracles" similar to deep learning (i.e. modulo intractable problems that are plausibly tractable in realistic special cases). Moreover, certain complexity-theoretic considerations are already implicit in the choice of the space of hypotheses for your learning problem (e.g. Solomonoff induction has to be truncated to polynomial-time programs to be even weakly feasible). In particular, we should keep in mind that the hypotheses must be computationally simpler than the agent itself, whereas the universe must be computationally more complex than the agent itself. More on resolving this apparent paradox later.

## Algorithmic Information Theory

The choice of hypothesis space plays a crucial role in any learning task, and the choice of prior plays a crucial role in Bayesian reinforcement learning. In narrow AI this choice is based entirely on the prior knowledge of the AI designers about the problem. On the other hand, general AI should be able to learn its environment with little prior knowledge, by noticing patterns and using Occam's razor. Indeed, the latter is the basis of epistemic rationality to the best of our understanding. The Solomonoff measure is an elegant formalization of this idea.

However, Solomonoff induction is incomputable, so a realistic agent would have to use some truncated form of it, for example by bounding the computational resources made available to the universal Turing machine. It thus becomes an important problem to find a natural prior such that:

• It allows for a (sufficiently good) sublinear regret bound with a computationally feasible algorithm.

• It ranks hypotheses by description complexity in some appropriate sense.

• It satisfies some universality properties analogous to the Solomonoff measure (but appropriately weaker).

## Towards a rigorous definition of intelligence

The combination of perfect Bayesian reinforcement learning and the Solomonoff prior is known as AIXI. AIXI may be regarded as a model ideal intelligence, but there are several issues that were argued to be flaws in this concept:

• Traps: AIXI doesn't satisfy any interesting regret bounds, because the environment might contain traps. In fact, the set of all computable environments is an unlearnable class of hypotheses: no agent has a sublinear regret bound w.r.t. this class.

• Cartesian duality: AIXI's "reasoning" (and RL in general) seems to assume the environment cannot influence the algorithm executed by the agent. This is unrealistic. For example if our agent is a robot, then it's perfectly possible to imagine some external force breaking into its computer and modifying its software.

• Irreflexivity: The Solomonoff measure contains only computable hypotheses but the agent itself is uncomputable. In particular, AIXI can satisfy no guarantees pertaining to environments that e.g. contain other AIXIs. An analogous problem persists with any simple attempt to modify the prior: the prior can only contain hypotheses simpler than the agent.

• Decision-theoretic paradoxes: AIXI seems to be similar to a Causal Decision\ Theorist so apparently it will fail on Newcomb-like problems.

The Cartesian duality problem and the traps problem are actually strongly related. Indeed, one can model any event that destroys the agent (including modifying its source code) as the transition of the environment into some inescapable state. Such a state should be assigned a reward that corresponds to the expected utility of the universe going on without the agent. However, it's not obvious how the agent can learn to anticipate such states, since observing it once eliminates any chance of using this knowledge later. DRL already partially addresses this problem: more discussion in the next section.

Solving irreflexivity requires going beyond the Bayesian paradigm by including models that don't fully specify the environment. More details in the next section.

Finally, the decision-theoretic paradoxes are a more equivocal issue than it seems, because the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to one-box, since its model will predict that one-boxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality. Formalizing other Newcomb-like problems require solving irreflexivity first, because the environment contains Omega which cannot be simulated by the agent. Therefore, my current working hypothesis is that decision theory will be mostly solved (or dissolved) by

• Solving irreflexivity

• Value learning will automatically learn some aspects of the decision theory too.

• Allowing for self-modification, which should be possible after solving irreflexivity + Cartesian duality (self-modification may be again be regarded as a terminal state)

To sum up, clarifying all of these issues should result in formulating a certain optimality condition (regret bound) which may be regarded as a rigorous definition of intelligence. This would also constitute progress towards defining "values" (having certain values means being intelligent w.r.t. these values), but the latter might require making the definition even more lax. More on that later.

# Research Programme Outline

In this section I break down the research programme into different domains and subproblems. The list below is not intended to be a linear sequence. Indeed, many of the subproblems can be initially attacked in parallel, but also many of them are interconnected and progress in one subproblem can be leveraged to produce a more refined analysis of another. Any concrete plan I have regarding the order with which these questions should be addressed is liable to change significantly as progress is made. Moreover, I expect the entire breakdown to change as progress is made and new insights are available. However, I do believe that the high-level principles of the approach have a good chance of surviving, in some form, into the future.

## Universal reinforcement learning

The aim of this part in the agenda is deriving regret bounds or other performance guarantees for certain settings of reinforcement learning that are simultaneously strong enough and general enough to serve as a compelling definition / formalization of the concept of general intelligence. In particular, this involves solving the deficiencies of AIXI that were pointed out in the previous section.

I believe that a key step towards this goal is solving the problem of "irreflexivity". That is, we need to define a form of reinforcement learning in which the agent achieves reasonable performance guarantees despite an environment which is as complex or more than the agent itself. My previous attempts to make progress towards that goal include minimax forecasting and dominant forecasters for incomplete models. There, the aim was passive forecasting rather than reinforcement learning.

The idea of minimax forecasting can be naturally extended to reinforcement learning. Environments in reinforcement learning naturally form a convex set in some topological vector space (where convex linear combinations correspond to probabilistic mixtures). Normally, models are points of , i.e. specific environments. Instead, we can consider incomplete models which are non-empty convex subsets of . Instead of considering , the expected utility of policy interacting with environment , we can consider , where is an incomplete model: the minimal guaranteed expected utility of for environments compatible with the incomplete model . We can define a set of incomplete models to be learnable when there is a metapolicy s.t. for any

Here, is the time discount parameter. Notably, this setting satisfies the analogue of the universality property of Bayes-optimality (see "Proposition 1" in this essay). Here, the role of the Bayes-optimal policy is replaced by the policy

Here, is the "incomplete prior" corresponding to some :

Moreover, it is possible to define an incomplete analogue of MDPs. These are stochastic games, where the choices of the opponent correspond to the "Knightian uncertainty" of the incomplete model. Thus, it is natural to try and derive regret bounds for learning classes of such incomplete MDPs. In fact, this theory might justify the use of finite (or other restricted) MDPs which is common in RL and is needed for deriving most regret bounds. Indeed, there is no reason why physical reality should be a finite MDP, however this does not preclude us from using a finite stochastic game as an incomplete model of reality. In particular, an infinite MDP (and thus also a POMDP, since a POMDP can be reduced to an MDP whose states are belief states = probability measures on the state space of the POMDP) can be approximated by a finite stochastic game by partitioning its state space into a finite number of "cells" and letting the opponent to choose the exact state inside the cell upon each transition.

It is possible to generalize this setting further by replacing "crisp" sets of environments by fuzzy sets. That is, we can define a "fuzzy model" to be a function (the membership function) s.t. is non-empty. The performance of a policy on the model is then given by

Note that is assumed to take values in , so no with can affect the above value.

This generalization allows capturing a broad spectrum of performance guarantees. For example, given any policy we can define by

Then, learning the model amounts to learning to perform at least as well as , whatever the environment is. Thus, the setting of "fuzzy reinforcement learning" might be regarded as a hybrid of model-based and model-free approaches.

One test for any theory attempting to solve irreflexivity is whether it leads to reasonable game-theoretic solution concepts in multi-agent scenarios. For example, it is obvious that incomplete models lead to Nash equilibria in zero-sum games (an incomplete model is a zero-sum game, in some sense), but the situation in more general games in currently unknown. Another sort of test is applying the theory to Newcomb-like decision-theoretic puzzles, although solving all of them might require additional elements, such as self-modification. Further applications of such a theory which may also be regarded as tests will appear in the next subsection.

Next, the problem of traps has to be addressed. DRL partially solves this problem by postulating an advisor that has prior knowledge about the traps. It seems reasonable to draw a parallel between this and real-world human intelligence: humans learn from previous generations regarding the dangers of their environment. In particular, children seems like a salient example of an algorithm employing a lot of exploration while trusting a different agent (the parent) to prevent it from falling into traps. However, from a different perspective, this seems like hiding the difficulty in a different place. Namely, if we consider the whole of humanity as an intelligent agent (which seems a legitimate model at least for the purposes of this particular issue), then how did it avoid traps? To some extent, we can claim that human DNA is another source for prior knowledge, acquired by evolution, but somewhere this recursion must come to an end.

One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and knowing this fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn't reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened to episodic RL where each human life is an episode.

This mechanism should be formalized using the ideas of quantilal control. The baseline policy comes from the prior knowledge / advisor, and the allowed deviation (some variant of Renyi divergence) from the baseline policy is chosen according to the prior assumption about the rate of falling into a trap while following the baseline policy. This should lead to an appropriate regret bound.

I think that another important step towards universal RL is deriving regret bounds that exploit structural hierarchies. This builds on the intuition that, although the real world is very complex and diverse, the presence of structural hierarchies seems like a nearly universal feature. Indeed, it is arguable that we would never reach our current level of understanding physics if there was no separation of scales that allowed studying the macroscopic world without knowing string theory et cetera. I see 3 types of hierarchies that need to be addressed, together with their mutual interactions:

• Temporal hierarchy: Separation between processes that happen on different time-scales. We can try to model it by considering MDPs with a hierarchical state-space, s.t. transitions on a higher levels of the hierarchy happen much slower than transitions on a lower level of the hierarchy. This means that w.r.t. to a higher level, the lower level can always be considered to occupy an equilibrium distribution over states.

• Spatial hierarchy: Separation between processes that happen on different space-scales. We consider a "cellular decision process" which is an MDP that is structured like a cellular automaton. It is then tempting to try and connected RL theory with renormalization group methods from physics.

• Informational hierarchy: We consider a hierarchical structure on the space of hypotheses. That is, we expect the agent to first learn the high-level class to which the environment belongs, then learn the class corresponding to the lower level of the hierarchy et cetera, until it learns the actual environment. This is a formalization of the idea of "learning how to learn", a rather well-known idea for reducing the sample complexity of reinforcement learning.

In particular, I expect these hierarchies to yield regret bounds which do not have the "trial and error" form of most known regret bounds. That is, known regret bounds imply a sample complexity that is a large multiple of either the reset time (for RL with resets) or the mixing time (for RL without resets). This seems unsatisfactory: a model-based learner should be able to extrapolate its knowledge forward without waiting for a full "cycle" of environment response. Certainly we expect an artificial superintelligence to achieve a pivotal event from the first attempt, in some sense.

Also, the hierarchies should bridge at least part of the gap between weak and strong feasibility. Indeed, many of the successes of deep learning were based on CNNs and Boltzmann machines which seem to be exploiting the spatial hierarchy.

Returning to the issue of traps, there might be some sense in which our environment is "favorable" which is more sophisticated than the discussion before and which may be formalized using hierarchies (e.g. early levels of the information hierarchy can be learned safely and late levels only contain traps predictable by the early levels).

Finally, as discussed in the previous section, defining the correct universal prior and analyzing its properties is crucial to complete the theory. Given the hypotheses put forth in this section, this prior should be

• A fuzzy prior rather than a "complete" prior

• Possibly consist of a fuzzy version of finite, or otherwise restricted, MDPs (although Leike derives some regret bounds for general environments, at the cost of assuming sufficiently slowly dropping time discount and in particular ruling out geometric time discount). One way to think of it is, the finite MDPs are just an approximation of the infinite reality, however maybe we can also consider this a vindication of some sort of ultrafinitism.

• Reflect some "favorability" assumptions

• Have a hierarchical structure and consist of hierarchical models

## Value learning protocols

The aim of this part in the agenda is developing learning setups that allow one agent (the AI) to learn the values of a different agent or group of agents (humans). This involves directly or indirectly tackling the issues of, what does it mean for an agent to have particular values if it is imperfectly rational and possibly vulnerable to manipulation or other forms of "corruption".

At present, I conceive of the following possible basic mechanisms for value learning:

• Formal communication: Information about the values is communicated to the agent in a form with pre-defined formal semantics. Examples of this are, communicating a full formal specification of the utility function or manually producing a reward signal. Other possibilities are, communicating partial information about the reward signal, or evaluating particular hypothetical situations.

• Informal communication: Information about the values is communicated to the agent using natural language or in other form whose semantics have to be learned somehow.

• Demonstration: The agent observes a human pursuing eir values and deduces the values from the behavior.

• Reverse engineering: The agent somehow acquires a full formal specification of a human (e.g. an uploaded brain) and deduces the values from this specification. This is probably not a very realistic mechanism, but might still be useful for "thought experiments" to test possible definitions of imperfect rationality.

Formal communication is difficult because human values are complicated and describing them precisely is hard. A manual reward signal is more realistic than a full specification, but:

• Still difficult to produce, especially if this reward is supposed to reflect the true aggregate of the human's values as observed by em across the universe (which is what it would have to be in order to aim at the "true human utility function").

• The reward signal will become erroneous if the human or just the communication channel between the human and the agent will be corrupted in some way. This is serious problem since, if not taken into account, it incentives the agent to produce such corruption.

• If the agent is supposed to aim at a very long-term goal, there might be little to no relevant information in the reward signal until the goal is attained.

Overall, it might be more realistic to rely on formal communication for tasks of limited scope (putting a strawberry on plate) rather than actually learning human values in full (i.e. designing a sovereign). However, it is also possible to combine several mechanisms in a single protocol, and formal communication might be only one of them.

The problem of corruption may be regarded as a special cases of the problem of traps (the latter was outlined in the previous section), if we assume that the agent is expected to achieve its goals without entering corrupt states. Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator ("advisor"), and using it to learn which actions are safe. The analysis of DRL that I produced so far can and should be improved in multiple ways:

• Instead of considering only a finite or countable set of hypotheses, we should consider a space of hypotheses of finite "dimension" (for some appropriate notion of dimension; it is a common theme in statistical learning theory that different learning setups have different natural notions of dimensionality for hypothesis classes). We then should obtain a regret bounded depending on the dimension and the entropy of the prior. I believe that I already have significant progress on this point, with results expected soon.

• Merge the ideas of quantilal control and catastrophe mitigation to yield a setting where, corruption is a quantitative/gradual rather than boolean and there is a low but non-vanishing rate of corruption along the advisor policy. Successful catastrophe mitigation will be achieved if it is possible without high Renyi divergence from the advisor policy, in some sense.

• In the current form DRL requires the advisor to be ready to act instead of the agent at any given time moment. If the temporal rate at which actions are taken is high, this is an unrealistic demand. Naively, we can solve this by dividing time into intervals, and considering the policy on each interval as a single action. However, this would introduce an exponential penalty into the regret bound. Therefore, a more sophisticated way to manage the scheduling of control between the agent and the advisor is required, with a corresponding regret bound.

• DRL assumes that the advisor takes the optimal action with some minimal probability . The interpretation of probability in this context requires further inquiry. Specifically, it seems that a realistic interpretation would treat this probability as pseudorandom in some sense, s.t. the agent might simultaneously employ a more refined model within which the advisor might even be deterministic. Possibly relevant is the work of Shalizi (hat tip to Alex Appel for bringing it to my attention) where ey show that under some ergodicity assumptions (which should work for us since we use finite MDPs) Bayesian updating converges to the model in the prior that is the nearest to the true environment in some sense (this is reminiscent of optimal estimator theory). Thus, we can imagine having a "full-fledged" prior with refined models and a coarse prior s.t. this pseudorandom probability is defined by "projecting" to it in the Shalizi sense.

• Instead of only considering MDPs with a finite state space, we can consider e.g. Feller continuous MDPs with a compact state space, and/or POMDPs. This has some conceptual importance, since, it is unrealistic to assume that the advisor knows the traps of the real physical environment, but it is more realistic to assume the advisor knows the traps of its own belief state regarding the environment (see also "Corollary 1" in the essay about DIRL). However, it seems dubious to describe belief states as finite MDPs, since a probabilistic mixture of finite MDPs is not a finite MDP (but it is a fintie POMDP). On the other hand, we also need to consider fuzzy/incomplete "MDPs". As we discussed in the previous subsection, this might actually make it redundant to consider infinite MDPs and POMDPs. Note also that infinite MDPs pose computational challenges and in particular solving even finite POMDPs is known to be PSPACE-complete.

• [EDIT: Added after discussion with Jessica Taylor in the comments] The action that a RL agent takes depends both on the new observation and its internal state. Often we ignore the latter and pretend the action depends only on the history of observations and actions, and this is okay because we can always produce the probability distribution over internal states conditional on the given history. However, this is only ok for information-theoretic analysis, since sampling this probability distribution given only the history as input is computationally intractable. So, it might be a reasonable assumption that the advisor takes “sane” actions when left to its own devices, but it is, in general, not reasonable to assume the same when it works together with the AI. This is because, even if the AI behaved exactly as the advisor, it would hide the simulated advisor’s internal state, which would preclude the advisor from taking the wheel and proceeding with the same policy. We can overcome it by letting the advisor write some kind of “diary” that documents eir reasoning process, as much as possible. The diary is also considered a part of the environment (although we might want to bake into the prior the rules of operating the diary and a “cheap talk” assumption which says the diary has no side effects on the world). This way, the internal state is externalized, and the AI will effectively become transparent by maintaining the diary too (essentially the AI in this setup is emulating a “best case” version of the advisor). This idea deserves a formal analysis that explicitly models the advisor as another RL agent.

There is another issue with DRL that is worth discussing, although I am not sure whether it calls for a formal analysis soon. So far, we assumed that there are no side effects on the environment from the act of delegation itself. That is, the same action has exactly the same results whether carried out by the advisor or by the agent. Obviously, this is not realistic since any physical isolation layer created to ensure this will not be entirely fool-proof (as a bare minimum, the advisor emself will remember which actions ey took). The sole exception is, perhaps, if both the agent and the advisor are programs running inside a homeomorphic cryptography box. More generally, any RL setup ignores the indirect (i.e. not mediated by actions) side-effects that the execution of the agent's algorithm has on the environment (although it is more realistic to solve this latter problem by homeomorphic cryptography). This issue seems solvable via the use of incomplete/fuzzy models (see previous subsection). Although the true physical environment does have side effects as above, the model the agent tries to learn may ignore those side-effects (i.e. subsume them in the "Knightian uncertainty"). Similar remarks apply to the use of a source of random inside the algorithm I analyzed (a form of Posterior-Sampling Reinforcement Learning) that is assumed to be invisible to the environment (although it is also possible to use deterministic algorithms instead: for example, the Bayes-optimal policy is deterministic and necessarily satisfies the same Bayesian regret bound, although it is also not even weakly feasible). One caveat is the possibility of non-Cartesian daemons, defined and discussed in the next subsection.

The demonstration mechanism avoids some of the difficulties with formal communication, but has its own drawbacks. The ability to demonstrate a certain preference is limited by the ability to satisfy this preference. For example, suppose I am offered to play against Kasparov for money: if I win the game, I win $100 and if I lose the game, I lose$100. Then, I will refuse the bet because I know that I have few chances of winning. On the other hand, an AI might be able to win against Kasparov, but, seeing my demonstration it will remain uncertain whether I avoided the game because I'm afraid to lose or because of some other reason (for example, maybe I don't want to have more money, or maybe there is something intrinsically bad about playing chess). Therefore, it seems hard to produce a performance guarantee which will imply successfully learning the human's preferences and significantly outperforming the human in satisfying these preferences. In particular, the regret bound I currently have for Delegative Inverse Reinforcement Learning assumes that the "advisor" (the human) already takes the optimal action with maximal likelihood among all actions on any given time step.

In other words, the advantage of providing a manual reward signal over demonstration is related to the separation between P and NP. Solving a problem is much harder than verifying a solution, and similarly, demonstrating behavior that maximizes certain values is (computationally) much harder than evaluating a behavior according to the same values.

We can use the demonstration mechanism to a much greater advantage by designing a protocol that allows the human to learn from the AI. This is because many tasks can be much easier to solve with external advice. On the other hand, the AI can tell which advice is useful by seeing that the human changes eir behavior. For example, suppose that the task at hand is receiving the adjacency matrices of two graphs and producing an isomorphism between them. Then, if the human is unable to always solve this on eir own, it might be impossible for the AI to unambiguously understand what the task is. However, suppose that the AI can make several plausible guesses and test them by communicating the corresponding answer to the human. It will then observe which of those guesses cause the human to endorse the answer, and will thereby become able to carry out the task on its own.

More generally, we can consider the following learning protocol that I call "Learning By Teaching" (LBT). We have our agent and two additional actors (in the simplest case, humans): an "operator" and an "advisor". The agent can, at each given moment, decide between 3 modes:

• Mode I: The operator and the advisor carry on without the AI's input. In this case, the advisor communicates something (the "advice") to the operator, and the operator takes actions that influence the external environment.

• Mode II: The AI acts instead of the advisor. Thus, the AI communicates advice to the operator, and the operator takes external actions.

• Mode III: The AI acts instead of the operator, taking external actions directly.

In all modes, all actors observe the percept produced by the environment.

Note that, the semantics of the advisor's communications are a priori unknown to the agent, and ultimately it seems reasonable this communication will use natural language, so this learning protocol may be said to incorporate both demonstration and informal communication.

Now, the AI is at risk of committing dangerous actions both in mode II and in mode III. In particular, there might be some forms of advice that will corrupt the operator and cause irreversible damage. However, like in DRL, it can mitigate this risk by learning from the advisor and the operator which actions are safe.

Finally, whatever the ultimate value learning protocol will be, it is desirable to have it grounded in a coherent theory of imperfect rationality. Conversely, I believe that a reasonable theory of imperfect rationality should admit a value learning protocol (i.e. the concept of "values" should be observable and measurable in an appropriate sense). Specifically LBT suggests 3 types of "flaws" an agent is allowed to have while maintaining particular values:

• Its modeling abilities are limited: some computable and even efficiently computable models don't appear in its prior. This flaw is entirely relative (i.e. some agents are more limited than others), since any feasible agent is limited.

• Some events might result in a plastic response of the agent ("corruption") which makes it irreversibly lose rationality and/or alignment with its initial values. For this to be consistent with well-defined values, we need to assume that "left to its own devices" the agent only becomes corrupt with a small rate. A peculiar thing about this assumption is that it depends on the environment rather than only on the agent. This seems unavoidable. The philosophical implication is, the values of an agent (and perhaps therby also its "identity" or "consciousness") reside not only inside the agent (in the case of a human, the brain) but also in its environment. They are contextual. Indeed, if we imagine a whole brain emulation of a human transmitted to aliens in a different dimension with completely different physics, that know nothing about our universe, it seems impossible for those aliens to reconstruct the human's value. Placing the brain in arbitrary environments might lead to "reprogramming" it with a wide array of different values. Moreover, assuming that the environment is one that is plausible to "naturally" contain a human brain also doesn't solve the problem: malign superintelligences across the multiverse might exploit this assumption by creating human brains in odd environments on purpose (see Christiano's closely related discussion of why the universal prior is malign).

• The agent's policy might involve significant random noise, for example s.t. only the maximal likelihood policy is "rational" even given the two previous flaws (like the advisor in DIRL). Like in the discussion of DRL above, this requires some nuanced analysis of what counts as "random": some process might appear random given a certain level of computational resources and predictable given a higher level. Therefore, we might need to "project" the actual agent onto a suitable bounded model thereof.

In my opinion, the theory of imperfect rationality suggested by these considerations might already be close to capturing all the nuances of human irrationality.

## Taming daemons

"Daemons" are intelligent agents that are simulated, within some degree of fidelity, inside the AI's reasoning process and whose malign agendas can render the entire system unaligned even if the AI's "main" reasoning process is correctly designed to acquire human values in some sense. The aim of this part of the agenda is to formalize the problem and provide solutions in the form of theoretical guarantees.

I distinguish between two types of daemons: Cartesian and non-Cartesian. Cartesian daemons are those that are dangerous even if the Cartesian boundary between the AI and the rest of the universe is perfect. In other words, Cartesian daemons influence the word through the "legitimate" output channels given to the AI and while preserving the computational process inside the AI's hardware conformant to the software model assumed by the designers. On the other hand, non-Cartesian daemons rely on violations of the Cartesian boundary: they either exploit some physical side effects of the computation on the world or "hack" the AI from the inside in a way that breaks its software model (e.g. violate an abstraction layer in the software). Of course we might have a single (in some sense) subagent that utilizes both Cartesian and non-Cartesian attack mechanisms. In the following, we will refer to a "Cartesian daemon" in the sense of "a Cartesian attack mechanism of some daemon" and similarly for a "non-Cartesian daemon".

The taming of Cartesian daemons is guaranteed by the same regret bounds that guarantee the AI's alignment in general. For example, we can examine how it works in the case of the acausal attack identified by Christiano. In this scenario, some malign superintelligence (let's call it Mu) somewhere in the multiverse simulates an environment very similar to that experienced by the AI we constructed on Earth. This makes the AI uncertain about whether it is on Earth or in Mu's realm, under some conditions assigning a higher probability to Mu. Mu then arranges the environment to diverge from Earth at some pivotal point s.t. the AI's expectation of this divergence leads it to perform malign actions before the divergence happens. Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important. Seeing the advisor display Earth-like rather than Mu-like behavior, it is then able to discard the Mu hypothesis.

Of course, it is important to make the discussion above (and more generally, the performance guarantee that implies this discussion as a special case) quantitative which will require better understanding of universal priors and their corresponding regret bounds, as discussed before. In particular, it appears that for some parameters the acausal attack might still succeed if Mu will manage to make the malign hypothesis overwhelmingly more likely than Earth. Avoiding this might requiring carefully defining the prior, for example counting the AI's own source code as only a single unit of description complexity, which might lower the description length cost of locating Earth in the universe considerably.

Non-Cartesian daemons theoretically can be completely eliminated by placing the AI in a homeomorphic encryption box. This relies mainly on two assumptions:

• The complexity theoretic conjecture underlying the cryptographic system is sound (specifically, the stronger conjecture that implies safety against quantum computing attacks as well).

• The physical universe is, in some sense, in the class BQP (as our current understanding of physics seems to indicate), or at least any stronger-than-BQP layer would be inaccessible from the box (which seems likely due to the usual separation of scales in physics).

Of course one would also have to take care that the implementation of this idea is sound, for example the key should probably be generated from quantum randomness and kept at a location sufficiently remote/isolated (in some sense that requires careful physical/cryptographic analysis) from the AI's hardware (with a linking communication channels for encrypted inputs/outputs).

More troubling is the large performance overhead this would impose. This overhead might make the hypothetical aligned AI uncompetitive relative to unaligned projects. It might be possible to tolerate this overhead if either there is considerable computing overhang (i.e. by the time the relevant algorithms are discovered, the hardware is sufficient to quickly grow a superintelligence even given this overhead, resulting in a singleton) or the aligned AI is a national or multinational project whose budget allows it to be orders of magnitudes ahead of the competition in terms of computing power. However, since this is far from certain, designing alternative solutions is desirable.

One alternative approach is, employing extremely meticulous formal verification. That is, a physical model of the AI's software, hardware and its immediate environment will be developed and it will be formally verified that:

• It is not possible to "hack" the model from within, i.e. if the system is intact initially it will remain intact baring some extreme external influence.

• The communication bandwidth from the AI's software to the outside bypassing the legitimate output channels is very small (it is probably physically impossible to make it zero).

We will then need some analysis which shows that, given the very small "illegitimate" bandwidth, the aligned AI will "defeat" the unaligned daemons with high probability. This requires some game-theoretic model incorporating physical assumptions the shape of which is currently unknown to me. Needless to say, we will also have to make sure the implementation fits the model, which might require, for example, extreme means to isolate the hardware from contaminants. There will be a tradeoff between making the model simpler and thus easier to verify, versus making the model more complex and thus easier to physically protect and possibly allowing for more efficient hardware.

Also, at some point the AI should be mature enough to design an anti-daemon immune system for itself. However, it is not clear whether that simplifies the challenge, since we have to ensure the daemons will not grow out of control before this point.

## Recursive self-improvement

The aim of this part of the agenda is formalize and analyze the concept of "recursive self-improvement" in learning-theoretic language.

Recursive self-improvement as a method of extremely rapid capability growth is an intriguing idea, however so far it has little rigorous support. Moreover, it far from clear that the first AGI will be recursively self-improving, even if the concept is sound. Therefore, I do not see it as high priority item on the agenda. Nevertheless, it is worth some attention both because of the capability angle and because of possible applications to decision-theory.

At present, I have only a few observations on how the subject might be approached:

• The dangers of self-modification can be naturally regarded as "traps", due to the irreversible nature of self-modification. Therefore, it seems appropriate to address them by the mechanisms of DRL. The way I expect it to cash out in practice is: (i) the AI will initially not self-modify directly but only by suggesting a self-modification and delegating its acceptance to the advisor (ii) a self-modification will only be approved if "annotated" by a natural language explanation of its safety and merit (iii) the explanation will be honest (non-manipulative) due to being sampled out of a the space of explanations that the advisor might have produced by emself.

• One way to view self-modification is as a game, where the players are all the possible modified versions of the agent. The state of environment also includes the modification state of the agent, so at each state there is one player that is in control (a "switching controller" stochastic game). Therefore, if we manage to prove theoretical guarantees about such games (for e.g. fuzzy reinforcement learning), these guarantees have implications for self-modification. It might be possible to assume the game is perfectly cooperative, since all modifications that change the utility function can be regarded as traps.

• The initial algorithm of the AI will already be computationally feasible (e.g. polynomial time) and will probably satisfy a regret bound close to the best possible. However, polynomial time is a only a qualitative property: indeed, we cannot be more precise without choosing a specific model of computation. Therefore, it might be that the capability gains from self-improvement should be regarded as, tailoring the algorithm to the particular model of computation (i.e. hardware). In other words, it involves the algorithm learning the hardware on which it is implemented (we could provide it with a formal specification of this hardware but this doesn't gain much as the agent would not know, initially, how to effectively utilize such a specification). Now, every improvement gained speeds up further improvements, but also there is some absolute upper bound: the best possible implementation. It is therefore interesting to understand whether there are asymptotic regimes for which the agent undergoes exponentially fast improvement during some local period of time, and if so, how realistic are these regimes.

# Summary

In this section, I recap and elaborate the main features of the agenda as I initially stated them.

• Viewing AI alignment theory as part of a general abstract theory of intelligence: The "general abstract theory of intelligence" is implemented in this agenda as the theory of universal reinforcement learning. All the other parts of the agenda (value learning, daemons, self-improvement) are grounded in this theory.

• Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions: The main goal of this agenda is establishing which theoretical guarantees (in particular, in the form of regret bounds, but other types may also appear) can and should be satisfied by intelligent agents in general and aligned intelligent agents (i.e. value learning protocols) in particular. Any specific algorithm is mostly just a tool for proving that a certain guarantee can be satisfied, and its sole motivation is this guarantee. Over time these algorithms might evolve into something close to a practical design, but I also have no compunctions about discarding them.

• Formulating alignment problems in the language of learning theory: The agenda is "conservative" in the sense that its tools are mostly the same as used by mainstream AI researchers, but of course the priorities and objectives are quite different.

• Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment: So far the properties I derived were rather coarse and qualitative, and the models were also coarse and grossly over-simplified. However, as stated in the "philosophy" section, I see it as a reasonable starting point for further growth. Ultimately these mathematical properties should become sufficiently refined to translate to real-world implications (such as, what is the probability the AI will be misaligned, or what is the time from launch to pivotal event; of course, realistically, these will always be estimates with considerable error margins).

• Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions: In particular, the way I approach questions such as "what is imperfect rationality?", "how to use inductive reasoning without assuming Cartesian duality?" and "how to deal with decision-theoretic puzzles?" is guided by what seems natural within the framework of learning theory, and variants of reinforcement learning in particular. I see it as tackling the problems head-on, as opposed to approaches which use causal networks or formal logic, which IMO involve more assumptions that don't follow from the formulation of the problem.

This agenda is not intended as a territorial claim on my part. On the contrary, I encourage other researchers to work on parts of it or even adopt it entirely, whether in collaboration with me or independently. Conversely, I am also very interested to hear criticism.