The Learning-Theoretic AI Alignment Research Agenda

2Alex Mennen

1Vanessa Kosoy

0Paul Christiano

1Vanessa Kosoy

1Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

1Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Vanessa Kosoy

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

4Jessica Taylor

5Vanessa Kosoy

4Jessica Taylor

0Jessica Taylor

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

0Jessica Taylor

0Vanessa Kosoy

New Comment

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments.

I don't understand this claim. It seems to me that human brains appear to be "an enormous kludge of heuristics designed by trial and error". Shouldn't the success of humans be evidence for the latter?

The fact that the human brain was designed by trial and error is a given. However, we don't really know how the brain works. It is possible that the brain contains a simple mathematical core, possibly implemented inefficiently and with bugs and surrounded by tonnes of legacy code, but nevertheless responsible for the broad applicability of human intelligence.

Consider the following two views (which might also admit some intermediates):

View A: There exists a simple mathematical algorithm M that corresponds to what we call "intelligence" and that allows solving any problem in some very broad natural domain .

View B: What we call intelligence is a collection of a large number of unrelated algorithms tailored to individual problems, and there is no "meta-algorithm" that produces them aside from relatively unsophisticated trial and error.

If View B is correct, then we expect that doing trial and error on a collection of problems will produce an algorithm that solves problems in and almost *only* in . The probability that you were optimizing for but solved a much larger domain is vanishingly small: it is about the same as the probability of a completely random algorithm to solve all problems in .

If View A is correct, then we expect that doing trial and error on has a non-negligible chance of producing M (since M is simple and therefore sampled with a relatively large probability), which would be able to solve all of .

So, the fact that homo sapiens evolved in a some prehistoric environment but was able to e.g. land on the moon should be surprising to everyone with View B but not surprising to those with View A.

I think the most plausible view is: what we call intelligence is a collection of a large number of algorithms and innovations each of which slightly increases effectiveness in a reasonably broad range of tasks.

To see why both view A and B seem strange to me, consider the analog for physical tasks. You could say that there is a simple core to human physical manipulation which allows us to solve any problem in some very broad natural domain. Or you could think that we just have a ton of tricks for particular manipulation tasks. But neither of those seems right, there is no simple core to the human body plan but at the same time it contains many features which are helpful across a broad range of tasks.

Regarding the physical manipulation analogy: I think that there actually *is* a simple core to the human body plan. This core is, more or less: a spine, two arms with joints in the middle, two legs with joints in the middle, feet and arms with fingers. This is probably already enough to qualitatively solve more or less all physical manipulation problems humans can solve. All the nuances are needed to make it quantitatively more efficient and deal with the detailed properties of biological tissues, biological muscles et cetera (the latter might be considered analogous to the detailed properties of computational hardware and input/output channels for brains/AGIs).

I think that your view is plausible enough, however, if we focus only on *qualitative* performance metrics (e.g. time complexity up to a polynomial, regret bound up to logarithmic factors), then this collection probably includes only a small number of innovations that are important.

Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator (“advisor”), and using it to learn which actions are safe.

Why would you assume the existence of an advisor who can avoid taking catastrophic actions and sometimes take an optimal action? This would require some process capable of good judgment to understand many aspects of the AI's decision-making process, such as its world models (as these models are relevant to which actions are catastrophic/optimal). Are you proposing a high degree of transparency, a bootstrapping process as in ALBA, or something else?

I think that what you're saying here can be reformulated as follows (please correct me if I end up not answering your question):

The action that a RL agent takes depends both on the new observation and its *internal state*. Often we ignore the latter and pretend the action depends only on the history of observations and actions, and this is okay because we can always produce the probability distribution over internal states conditional on the given history. However, this is only ok for information-theoretic analysis, since sampling this probability distribution given only the history as input is computationally intractable.

So, it might be a reasonable assumption that the advisor takes "sane" actions when left to its own devices, but it is *not* reasonable to assume the same when it works together with the AI. This is because, even if the AI behaved *exactly* as the advisor, it would hide the simulated advisor's internal state, which would preclude the advisor from taking the wheel and proceeding with the same policy.

I think this is a real problem, but we can overcome it by letting the advisor write some kind of "diary" that documents eir reasoning process, as much as possible. The diary is also considered a part of the environment (although we might want to bake into the prior the rules of operating the diary and a "cheap talk" assumption which says the diary has no side effects on the world). This way, the internal state is externalized, and the AI will effectively become transparent by maintaining the diary too (essentially the AI in this setup is emulating a "best case" version of the advisor). It would be great if we could make this idea into a formal analysis.

That captures part of it but I also don't think the advisor takes sane actions when the AI is doing things to the environment that change the environment. E.g. the AI is implementing some plan to create a nuclear reactor, and the advisor doesn't understand how nuclear reactors work.

I guess you could have the AI first write the nuclear reactor plan in the diary, but this is essentially the same thing is transparency.

Well, you *could* say it is the same thing as transparency. What is interesting about it is that, in principle, you don't have to put in transparency by hand using some completely different techniques. Instead, transparency arises naturally from the DRL paradigm and some relatively mild assumptions (that there is a "diary"). The idea is that, the advisor would not build a nuclear reaction without seeing an explanation of nuclear reactors, so the AI also won't do it too.

One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and knowing this fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn’t reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened to episodic RL where each human life is an episode.

It's pretty clear that humans avoid traps using *thinking*, not just learning. See: CFCs, mutually assured destruction. Yes, principles of thinking can be learned, but then they generalize better than learning theory can prove.

See also: Not just learning

When I say "learning" I only mean that the true environment is initially unknown. I'm not assuming anything about the internals of the algorithm. So, the question is, what desiderata can we formulate that are possible to satisfy by *any algorithm at all*. The collection of all environments is not learnable (because of traps), so we cannot demand the algorithm to be asymptotically optimal on every environment. Therefore, it seems like we need to assume *something* about the environment, if we want a definition of intelligence that accounts for the *effectiveness* of intelligence. Formulating such an assumption, making it rigorous, and backing it by rigorous analysis is the subproblem I'm presenting here. The particular sort of assumption I'm pointing at here might be oversimplified, but the question remains.

I agree that we'll want some reasonable assumption on the environment (e.g. symmetry of physical laws throughout spacetime) that will enable thinking to generalize well. I don't think that assumption looks like "it's hard to cause a lot of destruction" or "the environment is favorable to you in general". And I'm pretty sure that individual human lives are not the most important level of analysis for thinking about the learning required to avoid civilization-level traps (e.g. with CFCs, handling the situation required scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime)

Consider also, evolution. Evolution can also be regarded as a sort of reinforcement learning algorithm. So why, during billions years of evolution, no gene sequence was created that somehow destroyed all life on Earth? It seems hard to come up with an answer other than “it’s hard to cause a lot of destruction”.

Some speculation:

I think that we have a sequence of reinforcement algorithms: evolution -> humanity -> individual human / small group (maybe followed by -> AGI) s.t. each step inherits the knowledge generated by the previous step and also applies more optimization pressure than the previous step. This suggests formulating a "favorability" assumption of the following form: there is a (possibly infinite) sequence of reinforcement learning algorithms A0, A1, A2... s.t. each algorithm is more powerful than the previous (e.g. has more computing power), and our environment has to be s.t.

(1) Running policy A0 has a small rate (at most ) of falling into traps.
(2) If we run A0 for some time (s.t. ), and then run A1 *after* updating on the observations during , then A1 has a small rate (at most ) of falling into traps.
(3) Ditto when we add A2

...And so forth.

The sequence {Ai} may be thought of as a sequence of agents or as just steps in the exploration of the environment by a single agent. So, our condition is that, each new "layer of reality" may be explored safely given that the previous layers were already studied.

Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war.

I don't see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don't think RL algorithms would get the right answer on CFCs or nuclear war prevention.

I am pretty sure that we can't fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter.

I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.

I don't think you're interpreting what I'm saying correctly.

First, when I say "reinforcement learning" I don't necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state "knows" that something is a bad/good idea then the algorithm doesn't need to actually try it.

Third, "starting nuclear wars to test theories" is the opposite of I'm trying to describe. What I'm saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will *not* involve starting nuclear wars. What I'm trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap *before* you arrive at the trap.

First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.

That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.

Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.

I agree that if it knows *for sure* that it isn't in some environment then it doesn't need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.

What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.

This seems like an interesting research approach and I don't object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don't think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.

After thinking some more, maybe the following is natural way towards formalizing the optimism condition.

Let be the space of hypotheses and be the "unbiased" universal prior. Given any , we denote , i.e. the environment resulting from mixing the environments in the belief state . Given an environment , let be the Bayes-optimal policy for and the *perturbed* Bayes-optimal policy for , where is a perturbation parameter. Here, "perturbed" probably means something like softmax expected utility, but more thought is needed. Then, the "optimistic" prior is defined as a solution to the following fixed point equation:

Here, is a normalization constant and is an additional parameter.

This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses (so that is eir mixed strategy), another player chooses and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter controls optimism regarding the ability to learn the environment, whereas the parameter represents optimism regarding the presence of *slack*: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).

Possibly, the idea of exploring the environment "layer by layer" can be recovered from combining this with hierarchy assumptions.

This seems like a hack. The equilibrium policy is going to assume that the environment is good to it *in general* in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It's kind of like assuming "things magically end up lower than you expected on priors" instead of having a theory of gravity.

I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:

- Physical laws are symmetric across spacetime.
- Physical laws are spacially local.
- The predictable effects of a local action are typically local; most effects "dissipate" after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
- When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
- Some "partially-dissipated" effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like "this much local productive activity was disrupted", "this much local human health was lost", etc.
- You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent's creator should affect the agent's beliefs.
- If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
- Some phenemona have a "fractal" nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
- If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children's environment safe)

I don't have an elegant theory yet but these observations seem like a reasonable starting point for forming one.

I think that we should expect evolution to give us a prior that is a good *lossy compression* of actual physics (where "actual physics" means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And *also* it should be approximately learnable, otherwise it won't go from assigning high probability to actually performing well.

The principles you outlined seem reasonable overall.

Note that the locality/dissipation/multiagent assumptions amount to a special case of "the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don't apply too much optimization power" ("optimization power" probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.

"If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs" is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.

"When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works" seems like it would allow us to go beyond effective reversibility, but I'm not sure how to formalize it or whether it's a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.

Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have "if" clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).

Actually, I *am* including Bayesianism in "reinforcement learning" in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in *time* like often done in the literature, but asymptotic in the *time discount parameter*; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).

In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of *not* starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.

Consider a panel with two buttons, A and B. One button sends you to Heaven and one to Hell, but you don't know which is which and there is no way to check without pressing one. To make it more fun, you have to choose a button within one minute or you go to Hell automatically.

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I think that "scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime" is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity's progress is episodic RL where each human life is an episode, then *of course* each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

Also, I think that success with CFC is not a lot of evidence against the hypothesis since, for one thing, CFC doesn't allow a small group to easily destroy all of humanity, and for another thing, AFAIK action against CFC was only taken when some damage was already apparent. This is different from risks that have to be handled correctly on the first try.

That said, "doesn’t reflect optimistically on our chances to survive AI risk" wasn't intended as a strong claim but as something very speculative. Possibly I should have made it clearer.

More generally, the idea of restricting to environments s.t. *some* base policy doesn't fall into traps on them is not very restrictive. Indeed, for any learnable class H you can just take the base policy to be the learning algorithm itself and tautologically get a class at least as big as H. It becomes more interesting if we impose some constraints on the base policy, such as maybe restricting its computational complexity.

Intuitively, it seems alluring to say that our environment may contain X-risks, but they are s.t. by the time we face them we have enough knowledge to avoid them. However, this leads to assumptions that depend on the prior as a whole rather than on particular environments (basically, it's not clear whether this is saying anything besides just assuming the prior is learnable). This complicates things, and in particular it becomes less clear what does it mean for such a prior to be "universal". Moreover, the notion of a "trap" is not even a function of the prior regarded a single mixed environment, but a function of the particular partition of the prior into constituent hypotheses. In other words, it depends on which uncertainty is considered subjective (a property of the agent's state of knowledge) and which uncertainty is considered objective (an inherent unpredictability of the world). For example, if we go to the initial example but assume that there is a fair coin *inside the environment* that decides which button is Heaven, then instead of two environments we get one and tautologically there is no trap.

In short, I think there is a lot more thinking to do about this question.

So, there are are two environments: in environment X, button A corresponds to Heaven and in environment Y, button B corresponds to Heaven. Obviously both cannot be in a learnable class simultaneously. So, at least one of them has to be ruled out (and if we also want to preserve symmetry then both). What sort of assumption do you think will rule them out?

I don't think we should rule either of these out. The obvious answer is to give up on asymptotic optimality and do something more like utility function optimization instead. That would be moving out of the learning theory setting, which is a good thing.

Asymptotic optimality can apply to bounded optimization problems and can't apply to civilization-level steering problems.

Well, we *could* give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality. (This would not be moving out of the learning theory setting though? At least not the way I use this terminology.) Regret bounds would still be useful in the context of guaranteeing transfer of human knowledge and values to the AGI, but not in the context of defining intelligence.

However, my intuition is that it would be the wrong way to go.

For one thing, it seems that it is computationally feasible (at least in some weak sense, i.e. for a small number of hypotheses s.t. the optimal policy for each is feasible) to get asymptotic Bayes-optimality for certain learnable classes (PSRL is a simple example) but not in general. I don't have a proof (and I would be very interested to see either a proof or a refutation), but it seems to be the case AFAIK.

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why *general* intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is "universal" or "nearly universal" in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good *on average* w.r.t. some very broad ensemble of environments).

Well, we could give up on regret bounds and instead just consider algorithms that asymptotically approach Bayes-optimality.

I am not proposing this. I am proposing doing something more like AIXI, which has a fixed prior and does not obtain optimality properties on a broad class of environments. It seems like directly specifying the right prior is hard, and it's plausible that learning theory research would help give intuitions/models about which prior to use or what non-Bayesian algorithm would get good performance in the world we actually live in, but I don't expect learning theory to directly produce an algorithm we would be happy with running to make big decisions in our universe.

Yes, I think that we're talking about the same thing. When I say "asymptotically approach Bayes-optimality" I mean the equation from Proposition A.0 here. I refer to this instead of just Bayes-optimality, because exact Bayes-optimality is computationally intractable even for a small number of hypothesis each of which is a small MDP. However, even asymptotic Bayes-optimality is usually only tractable for some learnable classes, AFAIK: for example if you have environments without traps then PSRL is asymptotically Bayes-optimal.

For another thing, consider questions such as, why intelligent agents outcompete instinct-based agents, and why general intelligence (i.e. Bayes optimality or at least some notion of good performance w.r.t. a prior that is “universal” or “nearly universal” in some sense) can be developed by evolution in a rather restricted environment. These questions seem much easier to answer if intelligence has some frequentist property (i.e. it is in some sense effective in all or most environments) compared to, if intelligence has only purely Bayesian properties (i.e. it is only good on average w.r.t. some very broad ensemble of environments).

I don't understand why you think this. Suppose there is some simple "naturalized AIXI"-ish thing that is parameterized on a prior, and there exists a simple prior for which an animal running this algorithm with this prior does pretty well in our world. Then evolution may produce an animal running something like naturalized AIXI with this prior. But naturalized AIXI is only good on average rather than guaranteeing effectiveness in almost all environments.

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn't capture the concept of intelligence in full. For example, we are worried that a UFAI would be very likely to kill us in this particular universe, not just in some counterfactual universes. Moreover, Bayesian agents with simple priors often do very poorly in particular worlds, because of what I call "Bayesian paranoia". That is, if your agent thinks that lifting its left arm will plausibly send it to hell (a rather simple hypothesis), it will never lift its left arm and learn otherwise.

In fact, I suspect that a certain degree of "optimism" is inherent in our intuitive notion of rationality, and it also has a good track record. For example, when scientists did early experiments with electricity, or magnetism, or chemical reactions, their understanding of physics at the time was arguably insufficient to know this will not destroy the world. However, there were few other ways to go forward. AFAIK the first time anyone seriously worried about a physics experiment was the RHIC (unless you also count the Manhattan project, when Edward Teller suggested the atom bomb might create a self-sustaining nuclear fusion reaction that will envelope the entire atmosphere). These latter concerns were only raised because we already knew enough to point at specific dangers. Of course this doesn't mean we shouldn't be worried about X-risks! But I think that some form of a priori optimism is likely to be correct, in some philosophical sense. (There was also some thinking in that direction by Sunehag and Hutter although I'm not sold on the particular formalism they consider).

I think I understand your point better now. It isn't a coincidence that an agent produced by evolution has a good prior for our world (because evolution tries many priors, and there are lots of simple priors to try). But the fact that there exists a simple prior that does well in our universe is a fact that needs an explanation. It can't be proven from Bayesianism; the closest thing to a proof of this form is that computationally unbounded agents can just be born with knowledge of physics if physics is sufficiently simple, but there is no similar argument for computationally bounded agents.

My intuition is that it must not be just a coincidence that the agent happens to works well in our world, otherwise your formalism doesn’t capture the concept of intelligence in full.

It's not a coincidence because evolution selects the prior, and evolution tries lots of priors. (There are lots of simple priors)

I think that “scientific and policy knowledge that no one knows at birth and no one could discover by themself over a lifetime” is absolutely compatible with the hypothesis I outlined, even in its most naive form. If humanity’s progress is episodic RL where each human life is an episode, then of course each human uses the knowledge accumulated by previous humans. This is the whole idea of a learning algorithm in this setting.

If RL is using human lives as episodes then humans should already be born with the relevant knowledge. There would be no need for history since all learning is encoded in the policy. History isn't RL; it's data summarization, model building, and intertemporal communication.

This seems to be interpreting the analogy too literally. Humans are not born with the knowledge, but they acquire the knowledge through some protocol that is designed to be much easier than rediscovering it. Moreover, by "reinforcement learning" I don't mean the same type of algorithms used for RL today, I only mean that the performance guarantee this process satisfies is of a certain form.

More generally, the idea of restricting to environments s.t. some base policy doesn’t fall into traps on them is not very restrictive.

This rules out environments in which the second law of thermodynamics holds.

No, it doesn't rule out any particular environment. A class that consists only of one environment is tautologically learnable, by the optimal policy for this environment. You might be thinking of learnability by *anytime* algorithms whereas I'm thinking of learnability by non-anytime algorithms (what I called "metapolicies"), the way I defined it here (see Definition 1).

Ok, I am confused by what you mean by "trap". I thought "trap" meant a set of states you can't get out of. And if the second law of thermodynamics is true, you can't get from a high-entropy state to a low-entropy state. What do you mean by "trap"?

To first approximation, a "trap" is a an action s.t. taking it loses long-term value in expectation, i.e an action which is outside the set that I defined here (see the end of Definition 1). This set is always non-empty, since it at least has to contain the optimal action. However, this definition is not very useful when, for example, your environment contains a state that you cannot escape and you also cannot avoid (for example, the heat death of the universe might be such a state), since, in this case, *nothing* is a trap. To be more precise we need to go from an analysis which is asymptotic in the time discount parameter to an analysis with a fixed, finite time discount parameter (similarly to how with time complexity, we usually start from analyzing the asymptotic complexity of an algorithm, but ultimately we are interested in particular inputs of finite size). For a fixed time time discount parameter, the concept of a trap becomes "fuzzy": a trap is an action which loses a *substantial fraction* of the value.

Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important.

It seems like this is giving up on allowing the AI to make long-term predictions. It can make short-term, testable predictions (since if different advisors disagree, it is possible to see who is right). But long-term predictions can't be cheaply tested.

In the absence of long-term predictions, it still might be possible to do something along the lines of what Paul is thinking of (i.e. predicting human judgments of longer-term things), but I don't see what else you could do. Does this match your model?

I'm not giving up on long-term predictions in *general*. It's just that, because of traps, *some* uncertainties cannot be resolved by testing, as you say. In those cases the AI has to rely on what it learned from the advisor, which indeed amounts to human judgment.

In this essay I will try to explain the overall structure and motivation of my AI alignment research agenda. The discussion is informal and no new theorems are proved here. The main features of my research agenda, as I explain them here, are

Viewing AI alignment theory as part of a general abstract theory of intelligence

Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions

Formulating alignment problems in the language of learning theory

Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment

Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions

## Philosophy

In this section I explain the key principles and assumptions that motivate my research agenda.

## The importance of rigor

I believe that the solution to AI alignment must rely on a rigorous mathematical theory. The algorithms that comprise the solution must be justified by formal mathematical properties. All mathematical assumptions should be either proved or at least backed by considerable evidence, like the prominent conjectures of computational complexity theory. This needs to be the case because:

We might be facing one-shot success or failure. This means we will have little empirical backing for our assumptions.

To the extent we have or will have empirical evidence about AI, without a rigorous underlying theory it is very hard to know how scalable and transferable the conclusions are.

The enormity of the stakes demands designing a solution which is as reliable as possible, limited only by the time constraints imposed by competing unaligned projects.

That said, I do expect the ultimate solution to have aspects that are not entirely rigorous, specifically:

The quantitative risk analysis will probably rely on some parameters that will be very hard to determine from first principles, because of the involvement of humans and our physical universe in the equation. These parameters might be estimated through (i) study of the evolution of intelligence (ii) study of human brains, (iii) experiments with weak AI and its interaction with humans (iv) our understanding of physics. Nevertheless, we should demand the solution to be highly reliable even given cautious error margins on these parameters.

The ultimate solution will probably involve some heuristics. However, it should only involve heuristics that are designed to improve AI capabilities

without invalidating any of the assumptions underlying the risk analysis.Thus, in the worst-case scenario these heuristics will fail and the AI will not take off but will not become unaligned.In additionto the theoretical analysis, we do want to include as much empirical testing as possible, to provide an additional layer of defense. At the least, it can be a last ditch protection in the (hopefully very unlikely) scenario that some error got through the analysis.## Metaphilosophy and the role of models

In order to use mathematics to solve a real-world problem, a mathematical model of the problem must be constructed. When the real-world problem can be defined in terms of data that is observable and measurable, the validity of the mathematical model can be ascertained using the empirical method. However, AI alignment touches on problems that are

philosophicalin nature, meaning that there is still no agreed-upon empirical or other criterion for evaluating an answer. Dealing with such problems requires ametaphilosophy: a way of evaluating answers to philosophical questions.Although I do not claim a fully general solution to metaphilosophy, I think that, pragmatically, a quasiscientific approach is possible. In science, we prefer theories that are (i) simple (Occam's razor) and (ii) fit the empirical data. We also test theories by gathering further empirical data. In philosophy, we can likewise prefer theories that are (i) simple and (ii) fit intuition in situations where intuition feels reliable (i.e. situations that are simple, familiar or received considerable analysis and reflection). We can also test theories by applying them to new situations and trying to see whether the answer becomes intuitive after sufficient reflection.

Moreover, I expect progress on most problems to be achieved by the means of successive approximations. This means that we start with a model that is grossly oversimplified but that already captures some key aspects of the problems. Once we have a solution within this model, we can start to attack its assumptions and arrive at a new, more sophistical model. This process should repeat until we arrive at a model that (i) has no

obviousshortcomings and that (ii) we seem unable to improve despite our best efforts.Like in science, we can never be certain that a theory is true.

Anyassumption or model can be questioned. This requires striking a balance between complacency and excessive skepticism. To avoid complacency, we need to keep working to find better theories. To avoid excessive skepticism, we should entertain hypotheses honestly and acknowledge when a theory is already capable of passing non-trivial quasiscientific tests. Reaching agreement is harder work (because our tests rely on intuition which may vary from individual to individual), but we should not despair of that goal.## Intelligence is understandable

It is possible to question whether a mathematical theory of intelligence is possible at all. After all, we don't expect to have a tractable mathematical theory of Rococo architecture, or a simple equation describing the shape of the coastline of Africa in the year 2018.

The key difference is that intelligence is a

naturalconcept. Intelligence, the way I use this word in the context of AI alignment, is the ability of an agent to make choices in a way that effectively promote its goals, in an environment that is not entirely known or even not entirely knowable. Arguing over the meaning of the word would be a distraction: this is the meaning relevant to AI alignment, because the entire concern of AI alignment is about agents that effectively pursue their goals, undermining the conflicting goals of the human species. Moreover, intelligence is (empirically) a key force in determining the evolution of the physical universe.I conjecture that natural concepts have useful mathematical theories, and this conjecture seems to me supported by evidence in natural and computer science. It would be nice to have this conjecture itself follow from a mathematical theory, but this is outside of my current scope. Also, we already

havesome progress towards a mathematical theory of intelligence (I will discuss it in the next section).A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments. That said, I am less confident about this than about the previous question. In any case, having a mathematical theory of intelligence should allow us to resolve this question too, whether positively or negatively.

## Value alignment is understandable

The core of AI alignment is reliably transferring human values to a strong AI. However, the problem of defining what we mean by "human values" is a philosophical problem. A common and natural model of "values" is expected utility maximization: this is what we find in game theory and economics, and this is supported by VNM and Savage theorems. However, as often pointed out, humans are not perfectly rational, therefore it's not clear in what sense they can be said to maximize the expectation of a specific utility function.

Nevertheless, I believe that "values" is

alsoa natural concept. Denying the concept of "values" altogether is paramount to nihilism, and in such a belief system there is no reason to do anything at all, including saving yourself and everyone else from a murderous AI. Admitting thegeneralconcept of "values" as something complex and human specific (despite the focus on "values" rather than "human values") seems implausible, since intuitively we can easily imagine alien minds facing a similar AI alignment problem. Moreover, the concept of "values" is part and parcel of the concept of "intelligence", so if we believe that "intelligence" (due to its importance in shaping the physical world) is a natural concept, then so are "values".Therefore, I conjecture that there is a simple mathematical theory of imperfect rationality, within which the concept of "human values" is well-defined modulo the (observable, measurable) concept of "humans". Some speculation on what this theory looks like appears in the following sections.

Now, that doesn't mean that "human values" are

perfectlywell-defined, anymore than, for example, the center of mass of the sun is perfectly well-defined (which would require deciding exactly which particles are considered part of the sun). However, like the center of mass of the sun is sufficiently well-defined for many practical purposes in astrophysics, the concept of "human values" should be sufficiently well-defined for designing an aligned AGI. To the extent alignment remains ambiguous, the resolution of these ambiguities doesn't have substantial moral significance.## Foundations

In this section I briefly explain the mathematical tools with which I set out to study AI alignment, and the outline of the mathematical theory of intelligence that these tools already painted.

## Statistical Learning Theory

Statistical learning theory studies the information-theoretic constraints on various types of learning tasks, answering questions such as, when is a learning task solvable at all, and how much training data is required to solve the learning task within given accuracy (sample complexity). Learning tasks can be broadly divided into:

Classifications tasks: The input is sampled from a fixed probability distribution, and the objective is assigning the correct label. The deployment phase (during which the performance of the algorithm is evaluated) is distinct from the training phase (during which the correct labels are revealed).

Online learning / multi-armed bandits: There is no distinction between deployment and training. Instead, the algorithm's performance on each round is evaluated, but also the algorithm might receive some feedback on its performance. The behavior of the environment might change over time, possibly even respond to the algorithm's output. However, we only evaluate each output conditioned on the past history (we don't require the algorithm to plan ahead).

Reinforcement learning: There is two-sided interaction between the algorithm and the environment, and the algorithm's performance is the aggregate of some reward function over time. The algorithm

isrequired to plan ahead in order to achieve optimal performance. This might or might not assume "resets" (when the environment periodically returns to the initial state) or partition of time into "episodes"\ (when the performance of the algorithm is only evaluated conditioned on the previous episodes, so that it doesn't have to plan ahead more than one episode into the future).It is the last type of learning tasks, in particular assuming no resets or episodes, that is the most relevant for studying intelligence in the relevant sense. Indeed, the abstract setting of reinforcement learning is a good formalization for the informal definition of intelligence we had before. Note that the name "reward" might be misleading: this is not necessarily a signal received from outside, but can just as easily be some formally specified mathematical function.

In online learning and reinforcement learning, the theory typically aims to derive upper and lower bounds on "regret": the difference between the expected utility received by the algorithm and the expected utility it

wouldreceive if the environment was known a priori. Such an upper bound is effectively aperformance guaranteefor the given algorithm. In particular, if the reward function is assumed to be "aligned" then this performance guarantee is, to some extent, an alignment guarantee. This observation is not vacuous, since the learning protocol might be such that the true reward function is not directly available to the algorithm, as exemplified by DIRL and DRL. Thus, formally proving alignment guarantees takes the form of proving appropriate regret bounds.## Computational Learning Theory

In addition to information-theoretic considerations, we have to take into account considerations of computational complexity. Thus, after deriving information-theoretic regret bounds, we should continue to refine them by constraining our algorithms to be computationally feasible (which typically means running on polynomial time, but we may also need to consider stronger restrictions, such as restrictions on space complexity or parallelizability). If we consider

Bayesianregret (i.e. the expected value of regret w.r.t. some prior on the environments), this effectively means we are dealing withaverage-casecomplexity. Note that, imposing computational constraints on the agent implies bounded reasoning / non-omniscience and already constitutes departure from "perfect rationality" in a certain sense.More precisely, it is useful to differentiate between at least two levels of computational feasibility (see also this related essay by Alex Appel). On the first level, which I call "weakly feasible", we allow the computing time to scale polynomially with the number of hypotheses we consider, or exponentially with the description length of the correct hypothesis (these two are more or less interchangeable since, the number of hypotheses of given description length is exponential in this length). Thus, algorithms like Levin's universal search or Solomonoff induction over programs with polynomial time complexity, or Posterior Sampling Reinforcement Learning with a small number of hypotheses fall into this category. On the second level, which I call "strongly feasible", we require polynomial computing time for the "full" hypothesis space. At present, we only know how to achieve theoretical guarantees on this second level in narrow contexts, such as reinforcement learning with a small state space (i.e. with number of states polynomial in the security parameter).

In fact, the current gap in our theoretical understanding of deep learning is strongly related to the gap between weak and strong feasibility. Indeed, results about expressiveness and (statistical) learnability of neural networks are well-known, however exact learning of neural networks is NP-complete in the general case. Understanding how this computational barrier is circumvented in practical problems is a key challenge in understanding deep learning. Such understanding would probably be a positive development in terms of AI alignment (although it might also contribute to increasing AI capacity), but I don't think it's a high priority problem since it seems to already receive considerable attention in mainstream academia (i.e. it is not neglected).

I believe that the development of AI alignment theory should proceed by prioritizing information-theoretic analysis first, complexity-theoretic analysis in the sense of weak feasibility second, and complexity-theoretic analysis in the sense of strong feasibility last. That said, we should keep the complexity-theoretic considerations in mind, and strive to devise solutions that at least seem feasible modulo "miracles" similar to deep learning (i.e. modulo intractable problems that are plausibly tractable in realistic special cases). Moreover, certain complexity-theoretic considerations are already implicit in the choice of the space of hypotheses for your learning problem (e.g. Solomonoff induction has to be truncated to polynomial-time programs to be even weakly feasible). In particular, we should keep in mind that the hypotheses must be computationally simpler than the agent itself, whereas the universe must be computationally more complex than the agent itself. More on resolving this apparent paradox later.

## Algorithmic Information Theory

The choice of hypothesis space plays a crucial role in any learning task, and the choice of prior plays a crucial role in Bayesian reinforcement learning. In

narrowAI this choice is based entirely on the prior knowledge of the AI designers about the problem. On the other hand,generalAI should be able to learn its environment with little prior knowledge, by noticing patterns and using Occam's razor. Indeed, the latter is the basis of epistemic rationality to the best of our understanding. The Solomonoff measure is an elegant formalization of this idea.However, Solomonoff induction is incomputable, so a realistic agent would have to use some truncated form of it, for example by bounding the computational resources made available to the universal Turing machine. It thus becomes an important problem to find a natural prior such that:

It allows for a (sufficiently good) sublinear regret bound with a computationally feasible algorithm.

It ranks hypotheses by description complexity in some appropriate sense.

It satisfies some universality properties analogous to the Solomonoff measure (but appropriately weaker).

## Towards a rigorous definition of intelligence

The combination of perfect Bayesian reinforcement learning and the Solomonoff prior is known as AIXI. AIXI may be regarded as a model ideal intelligence, but there are several issues that were argued to be flaws in this concept:

Traps: AIXI doesn't satisfy any interesting regret bounds, because the environment might contain traps. In fact, the set of all computable environments is an

unlearnableclass of hypotheses: no agent has a sublinear regret bound w.r.t. this class.Cartesian duality: AIXI's "reasoning" (and RL in general) seems to assume the environment cannot influence the algorithm executed by the agent. This is unrealistic. For example if our agent is a robot, then it's perfectly possible to imagine some external force breaking into its computer and modifying its software.

Irreflexivity: The Solomonoff measure contains only computable hypotheses but the agent itself is uncomputable. In particular, AIXI can satisfy no guarantees pertaining to environments that e.g. contain other AIXIs. An analogous problem persists with any simple attempt to modify the prior: the prior can only contain hypotheses simpler than the agent.

Decision-theoretic paradoxes: AIXI seems to be similar to a Causal Decision\ Theorist so apparently it will fail on Newcomb-like problems.

The Cartesian duality problem and the traps problem are actually strongly related. Indeed, one can model any event that destroys the agent (including modifying its source code) as the transition of the environment into some inescapable state. Such a state should be assigned a reward that corresponds to the expected utility of the universe going on without the agent. However, it's not obvious how the agent can learn to anticipate such states, since observing it once eliminates any chance of using this knowledge later. DRL already partially addresses this problem: more discussion in the next section.

Solving irreflexivity requires going beyond the Bayesian paradigm by including models that don't fully specify the environment. More details in the next section.

Finally, the decision-theoretic paradoxes are a more equivocal issue than it seems, because the usual philosophical way of thinking about decision theory assumes that the model of the environment is

given, whereas in our way of thinking, the model islearned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to one-box, since its model will predict that one-boxingcausesthe money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality. Formalizing other Newcomb-like problems require solving irreflexivity first, because the environment contains Omega which cannot be simulated by the agent. Therefore, my current working hypothesis is that decision theory will be mostly solved (or dissolved) bySolving irreflexivity

Value learning will automatically learn some aspects of the decision theory too.

Allowing for self-modification, which should be possible after solving irreflexivity + Cartesian duality (self-modification may be again be regarded as a terminal state)

To sum up, clarifying all of these issues should result in formulating a certain optimality condition (regret bound) which may be regarded as a rigorous definition of intelligence. This would also constitute progress towards defining "values" (having certain values means being intelligent w.r.t. these values), but the latter might require making the definition even more lax. More on that later.

## Research Programme Outline

In this section I break down the research programme into different domains and subproblems. The list below is not intended to be a linear sequence. Indeed, many of the subproblems can be initially attacked in parallel, but also many of them are interconnected and progress in one subproblem can be leveraged to produce a more refined analysis of another. Any concrete plan I have regarding the order with which these questions should be addressed is liable to change significantly as progress is made. Moreover, I expect the entire breakdown to change as progress is made and new insights are available. However, I do believe that the high-level principles of the approach have a good chance of surviving, in some form, into the future.

## Universal reinforcement learning

The aim of this part in the agenda is deriving regret bounds or other performance guarantees for certain settings of reinforcement learning that are simultaneously strong enough and general enough to serve as a compelling definition / formalization of the concept of general intelligence. In particular, this involves solving the deficiencies of AIXI that were pointed out in the previous section.

I believe that a key step towards this goal is solving the problem of "irreflexivity". That is, we need to define a form of reinforcement learning in which the agent achieves reasonable performance guarantees despite an environment which is as complex or more than the agent itself. My previous attempts to make progress towards that goal include minimax forecasting and dominant forecasters for incomplete models. There, the aim was passive forecasting rather than reinforcement learning.

The idea of minimax forecasting can be naturally extended to reinforcement learning. Environments in reinforcement learning naturally form a convex set E in some topological vector space (where convex linear combinations correspond to probabilistic mixtures). Normally, models are points of E, i.e. specific environments. Instead, we can consider

incompletemodels which are non-empty convex subsets of E. Instead of considering Eμ⋈π[U], the expected utility of policy π interacting with environment μ∈E, we can consider minμ∈ΦEμ⋈π[U], where Φ is an incomplete model: the minimal guaranteed expected utility of π for environments compatible with the incomplete model Φ. We can define a set H of incomplete models to belearnablewhen there is a metapolicy π∗ s.t. for any Φ∈Hlimγ→1(maxπminμ∈ΦEμ⋈π[Uγ]−minμ∈ΦEμ⋈π∗γ[Uγ])=0

Here, γ∈[0,1) is the time discount parameter. Notably, this setting satisfies the analogue of the universality property of Bayes-optimality (see "Proposition 1" in this essay). Here, the role of the Bayes-optimal policy is replaced by the policy

πζγ:=argmaxπminμ∈ΦζEμ⋈π[Uγ]

Here, Φζ is the "incomplete prior" corresponding to some ζ∈ΔH:

Φζ={EΦ∼ζ[μ(Φ)]∣∣∣μ:H→E, ∀Φ∈H:μ(Φ)∈Φ}

Moreover, it is possible to define an incomplete analogue of MDPs. These are stochastic games, where the choices of the opponent correspond to the "Knightian uncertainty" of the incomplete model. Thus, it is natural to try and derive regret bounds for learning classes of such incomplete MDPs. In fact, this theory might

justifythe use of finite (or other restricted) MDPs which is common in RL and is needed for deriving most regret bounds. Indeed, there is no reason why physical reality should be a finite MDP, however this does not preclude us from using a finite stochastic game as anincomplete modelof reality. In particular, an infinite MDP (and thus also a POMDP, since a POMDP can be reduced to an MDP whose states are belief states = probability measures on the state space of the POMDP) can be approximated by a finite stochastic game by partitioning its state space into a finite number of "cells" and letting the opponent to choose the exact state inside the cell upon each transition.It is possible to generalize this setting further by replacing "crisp" sets of environments by fuzzy sets. That is, we can define a "fuzzy model" to be a function ϕ:E→[0,1] (the membership function) s.t. ϕ−1(1) is non-empty. The performance of a policy π on the model ϕ is then given by

minμ∈E(Eμ⋈π[U]+1−ϕ(μ))

Note that U is assumed to take values in [0,1], so no μ∈E with ϕ(μ)=0 can affect the above value.

This generalization allows capturing a broad spectrum of performance guarantees. For example, given any policy π we can define ϕπ:E→[0,1] by

ϕπ(μ):=Eμ⋈π[U]

Then, learning the model ϕπ amounts to learning to perform at least as well as π, whatever the environment is. Thus, the setting of "fuzzy reinforcement learning" might be regarded as a hybrid of model-based and model-free approaches.

One test for any theory attempting to solve irreflexivity is whether it leads to reasonable game-theoretic solution concepts in multi-agent scenarios. For example, it is obvious that incomplete models lead to Nash equilibria in

zero-sumgames (an incomplete modelisa zero-sum game, in some sense), but the situation in more general games in currently unknown. Another sort of test is applying the theory to Newcomb-like decision-theoretic puzzles, although solving all of them might require additional elements, such as self-modification. Further applications of such a theory which may also be regarded as tests will appear in the next subsection.Next, the problem of traps has to be addressed. DRL partially solves this problem by postulating an advisor that has prior knowledge about the traps. It seems reasonable to draw a parallel between this and real-world human intelligence: humans learn from previous generations regarding the dangers of their environment. In particular, children seems like a salient example of an algorithm employing a lot of exploration while trusting a different agent (the parent) to prevent it from falling into traps. However, from a different perspective, this seems like hiding the difficulty in a different place. Namely, if we consider the whole of

humanityas an intelligent agent (which seems a legitimate model at least for the purposes of this particular issue), then how diditavoid traps? To some extent, we can claim that human DNA is another source for prior knowledge, acquired by evolution, but somewhere this recursion must come to an end.One hypothesis is, the main way humanity avoids traps is by happening to exist in a relatively favorable environment and

knowingthis fact, on some level. Specifically, it seems rather difficult for a single human or a small group to pursue a policy that will lead all of humanity into a trap (incidentally, this hypothesis doesn't reflect optimistically on our chances to survive AI risk), and also rather rare for many humans to coordinate on simultaneously exploring an unusual policy. Therefore, human history may be very roughly likened toepisodicRL where each human life is an episode.This mechanism should be formalized using the ideas of quantilal control. The baseline policy comes from the prior knowledge / advisor, and the allowed deviation (some variant of Renyi divergence) from the baseline policy is chosen according to the prior assumption about the rate of falling into a trap while following the baseline policy. This should lead to an appropriate regret bound.

I think that another important step towards universal RL is deriving regret bounds that exploit

structural hierarchies. This builds on the intuition that, although the real world is very complex and diverse, the presence of structural hierarchies seems like a nearly universal feature. Indeed, it is arguable that we would never reach our current level of understanding physics if there was no separation of scales that allowed studying the macroscopic world without knowing string theory et cetera. I see 3 types of hierarchies that need to be addressed, together with their mutual interactions:Temporal hierarchy: Separation between processes that happen on different time-scales. We can try to model it by considering MDPs with a hierarchical state-space, s.t. transitions on a higher levels of the hierarchy happen much slower than transitions on a lower level of the hierarchy. This means that w.r.t. to a higher level, the lower level can always be considered to occupy an equilibrium distribution over states.

Spatial hierarchy: Separation between processes that happen on different space-scales. We consider a "cellular decision process" which is an MDP that is structured like a cellular automaton. It is then tempting to try and connected RL theory with renormalization group methods from physics.

Informational hierarchy: We consider a hierarchical structure on the

space of hypotheses. That is, we expect the agent to first learn the high-level class to which the environment belongs, then learn the class corresponding to the lower level of the hierarchy et cetera, until it learns the actual environment. This is a formalization of the idea of "learning how to learn", a rather well-known idea for reducing the sample complexity of reinforcement learning.In particular, I expect these hierarchies to yield regret bounds which do not have the "trial and error" form of most known regret bounds. That is, known regret bounds imply a sample complexity that is a large multiple of either the reset time (for RL with resets) or the mixing time (for RL without resets). This seems unsatisfactory: a model-based learner should be able to extrapolate its knowledge forward without waiting for a full "cycle" of environment response. Certainly we expect an artificial superintelligence to achieve a pivotal event from the first attempt, in some sense.

Also, the hierarchies should bridge at least part of the gap between weak and strong feasibility. Indeed, many of the successes of deep learning were based on CNNs and Boltzmann machines which seem to be exploiting the spatial hierarchy.

Returning to the issue of traps, there might be some sense in which our environment is "favorable" which is more sophisticated than the discussion before and which may be formalized using hierarchies (e.g. early levels of the information hierarchy can be learned safely and late levels only contain traps predictable by the early levels).

Finally, as discussed in the previous section, defining the correct universal prior and analyzing its properties is crucial to complete the theory. Given the hypotheses put forth in this section, this prior should be

A

fuzzyprior rather than a "complete" priorPossibly consist of a fuzzy version of finite, or otherwise restricted, MDPs (although Leike derives some regret bounds for general environments, at the cost of assuming sufficiently slowly dropping time discount and in particular ruling out geometric time discount). One way to think of it is, the finite MDPs are just an approximation of the infinite reality, however maybe we can also consider this a vindication of some sort of ultrafinitism.

Reflect some "favorability" assumptions

Have a hierarchical structure and consist of hierarchical models

## Value learning protocols

The aim of this part in the agenda is developing learning setups that allow one agent (the AI) to learn the values of a different agent or group of agents (humans). This involves directly or indirectly tackling the issues of, what does it mean for an agent to have particular values if it is imperfectly rational and possibly vulnerable to manipulation or other forms of "corruption".

At present, I conceive of the following possible basic mechanisms for value learning:

Formal communication: Information about the values is communicated to the agent in a form with pre-defined formal semantics. Examples of this are, communicating a full formal specification of the utility function or manually producing a reward signal. Other possibilities are, communicating partial information about the reward signal, or evaluating particular hypothetical situations.

Informal communication: Information about the values is communicated to the agent using natural language or in other form whose semantics have to be learned somehow.

Demonstration: The agent observes a human pursuing eir values and deduces the values from the behavior.

Reverse engineering: The agent somehow acquires a full formal specification of a human (e.g. an uploaded brain) and deduces the values from this specification. This is probably not a very realistic mechanism, but might still be useful for "thought experiments" to test possible definitions of imperfect rationality.

Formal communication is difficult because human values are complicated and describing them precisely is hard. A manual reward signal is more realistic than a full specification, but:

Still difficult to produce, especially if this reward is supposed to reflect the true aggregate of the human's values as observed by em across the universe (which is what it would have to be in order to aim at the "true human utility function").

The reward signal will become erroneous if the human or just the communication channel between the human and the agent will be corrupted in some way. This is serious problem since, if not taken into account, it incentives the agent to produce such corruption.

If the agent is supposed to aim at a very long-term goal, there might be little to no relevant information in the reward signal until the goal is attained.

Overall, it might be more realistic to rely on formal communication for tasks of limited scope (putting a strawberry on plate) rather than actually learning human values in full (i.e. designing a sovereign). However, it is also possible to combine several mechanisms in a single protocol, and formal communication might be only one of them.

The problem of corruption may be regarded as a special cases of the problem of traps (the latter was outlined in the previous section), if we assume that the agent is expected to achieve its goals without entering corrupt states. Delegative Reinforcement Learning aims to solve both problems by occasionally passing control to the human operator ("advisor"), and using it to learn which actions are safe. The analysis of DRL that I produced so far can and should be improved in multiple ways:

Instead of considering only a finite or countable set of hypotheses, we should consider a space of hypotheses of finite "dimension" (for some appropriate notion of dimension; it is a common theme in statistical learning theory that different learning setups have different natural notions of dimensionality for hypothesis classes). We then should obtain a regret bounded depending on the dimension and the entropy of the prior. I believe that I already have significant progress on this point, with results expected soon.

Merge the ideas of quantilal control and catastrophe mitigation to yield a setting where, corruption is a quantitative/gradual rather than boolean and there is a low but non-vanishing rate of corruption along the advisor policy. Successful catastrophe mitigation will be achieved if it is possible without high Renyi divergence from the advisor policy, in some sense.

In the current form DRL requires the advisor to be ready to act instead of the agent at any given time moment. If the temporal rate at which actions are taken is high, this is an unrealistic demand. Naively, we can solve this by dividing time into intervals, and considering the policy on each interval as a single action. However, this would introduce an exponential penalty into the regret bound. Therefore, a more sophisticated way to manage the scheduling of control between the agent and the advisor is required, with a corresponding regret bound.

DRL assumes that the advisor takes the optimal action with some minimal probability ϵ. The interpretation of probability in this context requires further inquiry. Specifically, it seems that a realistic interpretation would treat this probability as

pseudorandomin some sense, s.t. the agent might simultaneously employ a more refined model within which the advisor might even be deterministic. Possibly relevant is the work of Shalizi (hat tip to Alex Appel for bringing it to my attention) where ey show that under some ergodicity assumptions (which should work for us since we use finite MDPs) Bayesian updating converges to the model in the prior that is thenearestto the true environment in some sense (this is reminiscent of optimal estimator theory). Thus, we can imagine having a "full-fledged" prior with refined models and a coarse prior s.t. this pseudorandom probability is defined by "projecting" to it in the Shalizi sense.Instead of only considering MDPs with a finite state space, we can consider e.g. Feller continuous MDPs with a compact state space, and/or POMDPs. This has some conceptual importance, since, it is unrealistic to assume that the advisor knows the traps of the real physical environment, but it is more realistic to assume the advisor knows the traps of

its own belief state regarding the environment(see also "Corollary 1" in the essay about DIRL). However, it seems dubious to describe belief states as finite MDPs, since a probabilistic mixture of finite MDPs is not a finite MDP (but itisa fintie POMDP). On the other hand, we also need to consider fuzzy/incomplete "MDPs". As we discussed in the previous subsection, this might actually make it redundant to consider infinite MDPs and POMDPs. Note also that infinite MDPs pose computational challenges and in particular solving even finite POMDPs is known to be PSPACE-complete.[EDIT: Added after discussion with Jessica Taylor in the comments]The action that a RL agent takes depends both on the new observation and its internal state. Often we ignore the latter and pretend the action depends only on the history of observations and actions, and this is okay because we can always produce the probability distribution over internal states conditional on the given history. However, this is only ok for information-theoretic analysis, since sampling this probability distribution given only the history as input is computationally intractable. So, it might be a reasonable assumption that the advisor takes “sane” actions when left to its own devices, but it is, in general, not reasonable to assume the same when it works together with the AI. This is because, even if the AI behaved exactly as the advisor, it would hide the simulated advisor’s internal state, which would preclude the advisor from taking the wheel and proceeding with the same policy. We can overcome it by letting the advisor write some kind of “diary” that documents eir reasoning process, as much as possible. The diary is also considered a part of the environment (although we might want to bake into the prior the rules of operating the diary and a “cheap talk” assumption which says the diary has no side effects on the world). This way, the internal state is externalized, and the AI will effectively become transparent by maintaining the diary too (essentially the AI in this setup is emulating a “best case” version of the advisor). This idea deserves a formal analysis that explicitly models the advisor as another RL agent.There is another issue with DRL that is worth discussing, although I am not sure whether it calls for a formal analysis soon. So far, we assumed that there are no side effects on the environment from the act of delegation itself. That is, the same action has exactly the same results whether carried out by the advisor or by the agent. Obviously, this is not realistic since any physical isolation layer created to ensure this will not be entirely fool-proof (as a bare minimum, the advisor emself will remember which actions ey took). The sole exception is, perhaps, if both the agent and the advisor are programs running inside a homomorphic cryptography box. More generally, any RL setup ignores the indirect (i.e. not mediated by actions) side-effects that the execution of the agent's algorithm has on the environment (although it is more realistic to solve this latter problem by homomorphic cryptography). This issue seems solvable via the use of incomplete/fuzzy models (see previous subsection). Although the true physical environment does have side effects as above, the

modelthe agent tries to learn may ignore those side-effects (i.e. subsume them in the "Knightian uncertainty"). Similar remarks apply to the use of a source of random inside the algorithm I analyzed (a form of Posterior-Sampling Reinforcement Learning) that is assumed to be invisible to the environment (although it is also possible to use deterministic algorithms instead: for example, the Bayes-optimal policy is deterministic and necessarily satisfies the same Bayesian regret bound, although it is also not even weakly feasible). One caveat is the possibility of non-Cartesian daemons, defined and discussed in the next subsection.The demonstration mechanism avoids some of the difficulties with formal communication, but has its own drawbacks. The ability to demonstrate a certain preference is limited by the ability to satisfy this preference. For example, suppose I am offered to play against Kasparov for money: if I win the game, I win $100 and if I lose the game, I lose $100. Then, I will refuse the bet because I know that I have few chances of winning. On the other hand, an AI might be able to win against Kasparov, but, seeing my demonstration it will remain uncertain whether I avoided the game because I'm afraid to lose or because of some other reason (for example, maybe I don't want to have more money, or maybe there is something intrinsically bad about playing chess). Therefore, it seems hard to produce a performance guarantee which will imply successfully learning the human's preferences

andsignificantly outperforming the human in satisfying these preferences. In particular, the regret bound I currently have for Delegative Inverse Reinforcement Learning assumes that the "advisor" (the human) already takes the optimal action with maximal likelihood among all actions on any given time step.In other words, the advantage of providing a manual reward signal over demonstration is related to the separation between P and NP. Solving a problem is much harder than verifying a solution, and similarly, demonstrating behavior that maximizes certain values is (computationally) much harder than evaluating a behavior according to the same values.

We can use the demonstration mechanism to a much greater advantage by designing a protocol that allows

the human to learn from the AI. This is because many tasks can be much easier to solve with external advice. On the other hand, the AI can tell which advice is useful by seeing that the human changes eir behavior. For example, suppose that the task at hand is receiving the adjacency matrices of two graphs and producing an isomorphism between them. Then, if the human is unable to always solve this on eir own, it might be impossible for the AI to unambiguously understand what the task is. However, suppose that the AI can make several plausible guesses and test them by communicating the corresponding answer to the human. It will then observe which of those guesses cause the human to endorse the answer, and will thereby become able to carry out the task on its own.More generally, we can consider the following learning protocol that I call "Learning By Teaching" (LBT). We have our agent and two additional actors (in the simplest case, humans): an "operator" and an "advisor". The agent can, at each given moment, decide between 3 modes:

Mode I: The operator and the advisor carry on without the AI's input. In this case, the advisor communicates something (the "advice") to the operator, and the operator takes actions that influence the external environment.

Mode II: The AI acts instead of the advisor. Thus, the AI communicates advice to the operator, and the operator takes external actions.

Mode III: The AI acts instead of the operator, taking external actions directly.

In all modes, all actors observe the percept produced by the environment.

Note that, the semantics of the advisor's communications are a priori unknown to the agent, and ultimately it seems reasonable this communication will use natural language, so this learning protocol may be said to incorporate both demonstration and informal communication.

Now, the AI is at risk of committing dangerous actions both in mode II and in mode III. In particular, there might be some forms of advice that will corrupt the operator and cause irreversible damage. However, like in DRL, it can mitigate this risk by learning from the advisor and the operator which actions are safe.

Finally, whatever the ultimate value learning protocol will be, it is desirable to have it grounded in a coherent theory of imperfect rationality. Conversely, I believe that a reasonable theory of imperfect rationality should admit a value learning protocol (i.e. the concept of "values" should be observable and measurable in an appropriate sense). Specifically LBT suggests 3 types of "flaws" an agent is allowed to have while maintaining particular values:

Its modeling abilities are limited: some computable and even efficiently computable models don't appear in its prior. This flaw is entirely relative (i.e. some agents are more limited than others), since any feasible agent is limited.

Some events might result in a plastic response of the agent ("corruption") which makes it irreversibly lose rationality and/or alignment with its initial values. For this to be consistent with well-defined values, we need to assume that "left to its own devices" the agent only becomes corrupt with a small rate. A peculiar thing about this assumption is that it depends on the environment rather than only on the agent. This seems unavoidable. The philosophical implication is, the values of an agent (and perhaps therby also its "identity" or "consciousness") reside not only inside the agent (in the case of a human, the brain) but also in its environment. They are contextual. Indeed, if we imagine a whole brain emulation of a human transmitted to aliens in a different dimension with completely different physics, that know nothing about our universe, it seems impossible for those aliens to reconstruct the human's value. Placing the brain in arbitrary environments might lead to "reprogramming" it with a wide array of different values. Moreover, assuming that the environment is one that is plausible to "naturally" contain a human brain also doesn't solve the problem: malign superintelligences across the multiverse might exploit this assumption by creating human brains in odd environments on purpose (see Christiano's closely related discussion of why the universal prior is malign).

The agent's policy might involve significant random noise, for example s.t. only the maximal likelihood policy is "rational" even given the two previous flaws (like the advisor in DIRL). Like in the discussion of DRL above, this requires some nuanced analysis of what counts as "random": some process might appear random given a certain level of computational resources and predictable given a higher level. Therefore, we might need to "project" the actual agent onto a suitable bounded model thereof.

In my opinion, the theory of imperfect rationality suggested by these considerations might already be close to capturing all the nuances of human irrationality.

## Taming daemons

"Daemons" are intelligent agents that are simulated, within some degree of fidelity, inside the AI's reasoning process and whose malign agendas can render the entire system unaligned even if the AI's "main" reasoning process is correctly designed to acquire human values in some sense. The aim of this part of the agenda is to formalize the problem and provide solutions in the form of theoretical guarantees.

I distinguish between two types of daemons: Cartesian and non-Cartesian. Cartesian daemons are those that are dangerous

even if the Cartesian boundary between the AI and the rest of the universe is perfect.In other words, Cartesian daemons influence the word through the "legitimate" output channels given to the AI and while preserving the computational process inside the AI's hardware conformant to the software model assumed by the designers. On the other hand, non-Cartesian daemons rely on violations of the Cartesian boundary: they either exploit some physical side effects of the computation on the world or "hack" the AI from the inside in a way that breaks its software model (e.g. violate an abstraction layer in the software). Of course we might have a single (in some sense) subagent that utilizes both Cartesian and non-Cartesian attack mechanisms. In the following, we will refer to a "Cartesian daemon" in the sense of "a Cartesian attack mechanism of some daemon" and similarly for a "non-Cartesian daemon".The taming of Cartesian daemons is guaranteed by the same regret bounds that guarantee the AI's alignment in general. For example, we can examine how it works in the case of the acausal attack identified by Christiano. In this scenario, some malign superintelligence (let's call it Mu) somewhere in the multiverse simulates an environment very similar to that experienced by the AI we constructed on Earth. This makes the AI uncertain about whether it is on Earth or in Mu's realm, under some conditions assigning a higher probability to Mu. Mu then arranges the environment to diverge from Earth at some pivotal point s.t. the AI's expectation of this divergence leads it to perform malign actions

beforethe divergence happens. Now, if the AI is implementing DRL, the uncertainty between Earth and Mu leads it to delegate to the advisor precisely at the moment this difference is important. Seeing the advisor display Earth-like rather than Mu-like behavior, it is then able to discard the Mu hypothesis.Of course, it is important to make the discussion above (and more generally, the performance guarantee that implies this discussion as a special case) quantitative which will require better understanding of universal priors and their corresponding regret bounds, as discussed before. In particular, it appears that for some parameters the acausal attack might still succeed if Mu will manage to make the malign hypothesis

overwhelminglymore likely than Earth. Avoiding this might requiring carefully defining the prior, for example counting the AI's own source code as only a single unit of description complexity, which might lower the description length cost of locating Earth in the universe considerably.Non-Cartesian daemons theoretically can be completely eliminated by placing the AI in a homomorphic encryption box. This relies mainly on two assumptions:

The complexity theoretic conjecture underlying the cryptographic system is sound (specifically, the stronger conjecture that implies safety against quantum computing attacks as well).

The physical universe is, in some sense, in the class BQP (as our current understanding of physics seems to indicate), or at least any stronger-than-BQP layer would be inaccessible from the box (which seems likely due to the usual separation of scales in physics).

Of course one would also have to take care that the implementation of this idea is sound, for example the key should probably be generated from quantum randomness and kept at a location sufficiently remote/isolated (in some sense that requires careful physical/cryptographic analysis) from the AI's hardware (with a linking communication channels for encrypted inputs/outputs).

More troubling is the large performance overhead this would impose. This overhead might make the hypothetical aligned AI uncompetitive relative to unaligned projects. It might be possible to tolerate this overhead if either there is considerable computing overhang (i.e. by the time the relevant algorithms are discovered, the hardware is sufficient to quickly grow a superintelligence even given this overhead, resulting in a singleton) or the aligned AI is a national or multinational project whose budget allows it to be orders of magnitudes ahead of the competition in terms of computing power. However, since this is far from certain, designing alternative solutions is desirable.

One alternative approach is, employing extremely meticulous formal verification. That is, a physical model of the AI's software, hardware and its immediate environment will be developed and it will be formally verified that:

It is not possible to "hack" the model from within, i.e. if the system is intact initially it will remain intact baring some extreme external influence.

The communication bandwidth from the AI's software to the outside bypassing the legitimate output channels is very small (it is probably physically impossible to make it zero).

We will then need some analysis which shows that, given the very small "illegitimate" bandwidth, the aligned AI will "defeat" the unaligned daemons with high probability. This requires some game-theoretic model incorporating physical assumptions the shape of which is currently unknown to me. Needless to say, we will also have to make sure the implementation fits the model, which might require, for example, extreme means to isolate the hardware from contaminants. There will be a tradeoff between making the model simpler and thus easier to verify, versus making the model more complex and thus easier to physically protect and possibly allowing for more efficient hardware.

Also, at some point the AI should be mature enough to design an anti-daemon immune system for itself. However, it is not clear whether that simplifies the challenge, since we have to ensure the daemons will not grow out of control before this point.

## Recursive self-improvement

The aim of this part of the agenda is formalize and analyze the concept of "recursive self-improvement" in learning-theoretic language.

Recursive self-improvement as a method of extremely rapid capability growth is an intriguing idea, however so far it has little rigorous support. Moreover, it far from clear that the first AGI will be recursively self-improving, even if the concept is sound. Therefore, I do not see it as high priority item on the agenda. Nevertheless, it is worth some attention both because of the capability angle and because of possible applications to decision-theory.

At present, I have only a few observations on how the subject might be approached:

The dangers of self-modification can be naturally regarded as "traps", due to the irreversible nature of self-modification. Therefore, it seems appropriate to address them by the mechanisms of DRL. The way I expect it to cash out in practice is: (i) the AI will initially not self-modify directly but only by suggesting a self-modification and delegating its acceptance to the advisor (ii) a self-modification will only be approved if "annotated" by a natural language explanation of its safety and merit (iii) the explanation will be honest (non-manipulative) due to being sampled out of a the space of explanations that the advisor might have produced by emself.

One way to view self-modification is as a game, where the players are all the possible modified versions of the agent. The state of environment also includes the modification state of the agent, so at each state there is one player that is in control (a "switching controller" stochastic game). Therefore, if we manage to prove theoretical guarantees about such games (for e.g. fuzzy reinforcement learning), these guarantees have implications for self-modification. It might be possible to assume the game is perfectly cooperative, since all modifications that change the utility function can be regarded as traps.

The initial algorithm of the AI will already be computationally feasible (e.g. polynomial time) and will probably satisfy a regret bound close to the best possible. However, polynomial time is a only a qualitative property: indeed, we cannot be more precise without choosing a specific model of computation. Therefore, it might be that the capability gains from self-improvement should be regarded as, tailoring the algorithm to the particular model of computation (i.e. hardware). In other words, it involves the algorithm

learningthe hardware on which it is implemented (we could provide it with a formal specification of this hardware but this doesn't gain much as the agent would not know, initially, how to effectively utilize such a specification). Now, every improvement gained speeds up further improvements, but also there is some absolute upper bound: the best possible implementation. It is therefore interesting to understand whether there are asymptotic regimes for which the agent undergoes exponentially fast improvement during some local period of time, and if so, how realistic are these regimes.## Summary

In this section, I recap and elaborate the main features of the agenda as I initially stated them.

Viewing AI alignment theory as part of a general abstract theory of intelligence: The "general abstract theory of intelligence" is implemented in this agenda as the theory of universal reinforcement learning. All the other parts of the agenda (value learning, daemons, self-improvement) are grounded in this theory.

Using desiderata and axiomatic definitions as starting points, rather than specific algorithms and constructions: The main goal of this agenda is establishing which theoretical guarantees (in particular, in the form of regret bounds, but other types may also appear) can and should be satisfied by intelligent agents in general and aligned intelligent agents (i.e. value learning protocols) in particular. Any specific algorithm is mostly just a tool for proving that a certain guarantee can be satisfied, and its sole motivation is this guarantee. Over time these algorithms might evolve into something close to a practical design, but I also have no compunctions about discarding them.

Formulating alignment problems in the language of learning theory: The agenda is "conservative" in the sense that its tools are mostly the same as used by mainstream AI researchers, but of course the priorities and objectives are quite different.

Evaluating solutions by their formal mathematical properties, ultimately aiming at a quantitative theory of risk assessment: So far the properties I derived were rather coarse and qualitative, and the models were also coarse and grossly over-simplified. However, as stated in the "philosophy" section, I see it as a reasonable starting point for further growth.

Ultimatelythese mathematical properties should become sufficiently refined to translate to real-world implications (such as, what is the probability the AI will be misaligned, or what is the time from launch to pivotal event; of course, realistically, these will always be estimates with considerable error margins).Relying on the mathematical intuition derived from learning theory to pave the way to solving philosophical questions: In particular, the way I approach questions such as "what is imperfect rationality?", "how to use inductive reasoning without assuming Cartesian duality?" and "how to deal with decision-theoretic puzzles?" is guided by what seems natural within the framework of learning theory, and variants of reinforcement learning in particular. I see it as tackling the problems head-on, as opposed to approaches which use causal networks or formal logic, which IMO involve more assumptions that don't follow from the formulation of the problem.

This agenda is not intended as a territorial claim on my part. On the contrary, I encourage other researchers to work on parts of it or even adopt it entirely, whether in collaboration with me or independently. Conversely, I am also very interested to hear criticism.