What this post is

This is a review post of public work in AI alignment over 2019, with some inclusions from 2018. It has this preamble (~700 words), a short version / summary (~1.6k words), and a long version (~8.3k words). It is available as a Google Doc here.

There are many areas of work that are relevant to AI alignment that I have barely touched on, such as interpretability, uncertainty estimation, adversarial examples, and assured autonomy, primarily because I have not been following these fields and wouldn’t be able to write a good summary of what has happened in them. I have also mostly focused on articles that provide some conceptual insight, and excluded or briefly linked to papers that primarily make quantitative improvements on important metrics. While such papers are obviously important (ultimately, our techniques need to work well), there isn’t much to say about them in a yearly review other than that the quantitative metric was improved.

Despite these exclusions, there was still a ton of work to select from, perhaps around ~500 articles, of which over 300 have been linked to in this post. There are many interesting articles that I really enjoyed that get only a sentence of description, in which I ignore many of the points that the article makes. Most have been summarized in the Alignment Newsletter, so if you’d like to learn more about any particular link, but don’t want to read the entire thing, just search for its title in the database.

What you should know about the structure of this post

I am not speaking for myself; by default I am trying to explain what has been said, in a way that the authors of the articles would agree with. Any extra opinion that I add will be in italics.

As a post, this is meant to be read sequentially, but the underlying structure is a graph (nodes are posts, edges connect posts that are very related). I arranged it in a sequence that highlights the most salient-to-me connections. This means that the order in which I present subtopics is very much not a reflection of what I think is most important in AI safety: in my presentation order, I focused on edges (connections) rather than nodes (subtopics).

Other minor details:

  1. Any links from earlier than 2018 will have their year of publication right after the link (except for articles that were reposted as part of Alignment Forum sequences).
  2. I typically link to blog posts; in several cases there is also an associated paper that I have not linked.

How to read this post

I have put the most effort into making the prose of the long version read smoothly. The hierarchical organization is comparatively less coherent; this is partly because I optimized the prose, and partly because AI safety work is hard to cluster. As a result, for those willing to put in the effort, I’d recommend reading the long version directly, without paying too much attention to the hierarchy. If you have less time, or are less interested in the minutiae of AI alignment research, the short version is for you.

Since I don’t name authors or organizations, you may want to take this as your opportunity to form beliefs about which arguments in AI alignment are important based on the ideas (as opposed to based on trust in the author of the post).

People who keep up with AI alignment work might want to know which posts I’m referencing as they read, which is a bit hard since I don’t name the posts in the text. If this describes you, you should be reading this post on the Alignment Forum, where you can hover over most links to see what they link to. Alternatively, the references section in the Google Doc lists all links in the order that they appear in the post, along with the hierarchical organization, and so you can open the references in a new tab, and read through the post and the references together.

I expect that if you aren’t already familiar with them, some articles will sound crazy from my summary here; please read at least the newsletter summary and ideally the full article before arguing that it’s crazy.


Thanks to the Alignment Newsletter team, Ben Pace, Oliver Habryka, Jonathan Uesato, Tom Everitt, Luke Muehlhauser, Jan Leike, Rob Bensinger, Adam Gleave, Scott Emmons, Rachel Freedman, Andrew Critch, Victoria Krakovna, and probably a few others (I really should have kept better track of this). Thanks especially to Ben Pace for suggesting that I write this review in the first place.

Short version (~1.6k words)

While the full text tries to accurately summarize different points of view, that is not a goal in this summary. Here I simply try to give a sense of the topics involved in the discussion, without saying what discussion actually happened.

Basic analysis of AI risk. Traditional arguments for AI risk argue that since agentic AI systems will apply lots of optimization, they will lead to extreme outcomes that can’t be handled with normal engineering efforts. Powerful AI systems will not have their resources stolen from them, which by various dutch book theorems implies that they must be expected utility maximizers; since expected utility maximizers are goal-directed, they are dangerous.

However, the VNM theorem does not justify the assumption that an AI system will be goal-directed: such an assumption is really based on intuitions and conceptual arguments (which are still quite strong).

Comprehensive AI Services (CAIS) challenges the assumption that we will have a single agentic AI, instead suggesting that any task will be performed by a collection of modular services.

That being said, there are several other arguments for AI risk, such as the argument that AI might cause “lock in” which may require us to solve hard philosophical problems before the development of AGI.

Nonetheless, there are disjunctive reasons to expect that catastrophe does not occur: for example, there may not be a problem, or ML researchers may solve the problem after we get “warning shots”, or we could coordinate to not build unaligned AI.

Agency and optimization. One proposed problem is that of mesa optimization, in which an optimization algorithm used to train an AI creates an agent that is itself performing optimization. In such a scenario, we need to ensure that the “inner” optimization is also aligned.

To better understand these and other situations, it would be useful to have a formalization of optimization. This is hard: while we don’t want optimization to be about our beliefs about a system, if we try to define it mechanistically, it becomes hard to avoid defining a bottle cap as an optimizer of “water kept in the bottle”.

Understanding agents is another hard task. While agents are relatively well understood under the Cartesian assumption, where the agent is separate from its environment, things become much more complex and poorly-understood when the agent is a part of its environment.

Value learning. Building an AI that learns all of human value has historically been thought to be very hard, because it requires you to decompose human behavior into the “beliefs and planning” part and the “values” part, and there’s no clear way to do this.

Another way of looking at it is to say that value learning requires a model that separates the given data into that which actually achieves the true “values” and that which is just “a mistake”, which seems hard to do. In addition, value learning seems quite fragile to mis-specification of this human model.

Nonetheless, there are reasons for optimism. We could try to build an adequate utility function, which works well enough for our purposes. We can also have uncertainty over the utility function, and update the belief over time based on human behavior. If everything is specified correctly (a big if), as time goes on, the agent would become more and more aligned with human values. One major benefit of this is that it is interactive -- it doesn’t require us to specify everything perfectly ahead of time.

Robustness. We would like our agents to be robust - that is, they shouldn’t fail catastrophically in situations slightly different from the ones they were designed for. Within reinforcement learning, safe reinforcement learning aims to avoid mistakes, even during training. This either requires analytical (i.e. not trial-and-error) reasoning about what a “mistake” is, which requires a formal specification of what a mistake is, or an overseer who can correct the agent before it makes a mistake.

The classic example of a failure of robustness is adversarial examples, in which a tiny change to an image can drastically affect its classification. Recent research has shown that these examples are caused (at least in part) by real statistical correlations that generalize to the test set, that are nonetheless fragile to small changes. In addition, since robustness to one kind of adversary doesn’t make the classifier robust to other kinds of adversaries, there has been a lot of work done on improving adversarial evaluation in image classification. We’re also seeing some of this work in reinforcement learning.

However, asking our agents to be robust to arbitrary mistakes seems to be too much -- humans certainly don’t meet this bar. For AI safety, it seems like we need to ensure that our agents are robustly intent aligned, that is, they are always “trying” to do what we want. One particular way that our agents could be intent aligned is if they are corrigible, that is, they are trying to keep us “in control”. This seems like a particularly easy property to verify, as conceptually it seems to be independent of the domain in which the agent is deployed.

So, we would like to ensure that even in the worst case, our agent remains corrigible. One proposal would be to train an adversary to search for “relaxed” situations in which the agent behaves incorrigibly, and then train the agent not to do that.

Scaling to superhuman abilities. If we’re building corrigible agents using adversarial training, our adversary should be more capable than the agent that it is training, so that it can find all the situations in which the agent behaves incorrigibly. This requires techniques that scale to superhuman abilities. Some techniques for this include iterated amplification and debate.

In iterated amplification, we start with an initial policy, and alternate between amplification and distillation, which increase capabilities and efficiency respectively. This can encode a range of algorithms, but often amplification is done by decomposing questions and using the agent to answer subquestions, and distillation can be done using supervised learning or reinforcement learning.

In debate, we train an agent through self-play in a zero-sum game in which the agent’s goal is to “win” a question-answering debate, as evaluated by a human judge. The hope is that since each “side” of the debate can point out flaws in the other side’s arguments, such a setup can use a human judge to train far more capable agents while still incentivizing them to provide honest, true information.

Both iterated amplification and debate aim to train an agent that approximates the answer that one would get from an exponentially large tree of humans deliberating. The factored cognition hypothesis is that this sort of tree of humans is able to do any task we care about. This hypothesis is controversial: many have the intuition that cognition requires large contexts and flashes of intuition that couldn’t be replicated by a tree of time-limited humans.

Universality. One property we would hope to have is that if we use this tree of humans as an overseer for some simpler agent, then the tree would “know everything the agent knows”. If true, this property could allow us to build a significantly stronger conceptual argument for safety. It is also very related to…

Interpretability. While interpretability can help us know what the agent knows, and what the agent would do in other situations (which can help us verify if it is corrigible), there are other uses for it as well: in general, it seems better if we can understand the things we’re building.

Impact regularization. While relative reachability and attainable utility preservation were developed last year, this year saw them be unified into a single framework. In addition, there was a new proposed definition of impact: change in our ability to get what we want. This notion of impact depends on knowing the utility function U. However, we might hope that we can penalize some “objective” notion, perhaps "power", that occurs regardless of the choice of U, for the same reasons that we expect instrumental convergence.

Causal modeling. Causal models have been used recently to model the incentives for an agent under different AI safety frameworks, and to argue that by evaluating plans with the current reward function, you can remove the incentive for an agent to tamper with its reward function.

Oracles. Even if oracles are trying to maximize predictive accuracy, they could “choose” between different self-confirming predictions. We could avoid this using counterfactual oracles, which make predictions conditioning that their predictions do not influence the future.

Decision theory. There was work on decision theory, that I haven’t followed very much.

Forecasting. Several resources were developed to enable effective group forecasting, including an AI forecasting dictionary that defines terms, an AI resolution council whose future opinions can be predicted, and a dataset of well-constructed exemplar questions about AI.

Separately, the debate over takeoff speeds continued, with two posts arguing forcefully for continuous takeoff, without much response (although many researchers do not agree with them). The continuity of takeoff is relevant for but doesn’t completely determine whether recursive self improvement will happen, or whether some actor acquires a decisive strategic advantage. The primary implication of the debate is whether we should expect that we will have enough time to react and fix problems as they arise.

It has also become clearer that recent progress in AI has been driven to a significant degree by increasing the amount of compute devoted to AI, which suggests a more continuous takeoff. You could take the position that current methods can’t do <property X> (say, causal reasoning), and so it doesn’t matter how much compute you use.

AI Progress. There was a lot of progress in AI.

Field building. There were posts aiming to build the field, but they were all fairly disjointed.

The long version (~8.3k words) starts here.

Basic analysis of AI risk

Agentic AI systems

Much of the foundational writing about AI risk has focused on agentic AI systems. This approach (recently discussed in the post and comments here) argues that since AI agents will be exerting a lot of optimization, there will be extreme outcomes in which our regular arguments may not work. This implies that we must adopt a security mindset (2017) to ensure alignment, and it suggests that proof-level guarantees may be more important at various stages of alignment research.


The foundational writing then goes on to point out that since powerful AI systems should not be able to be dutch booked (i.e. have their resources stolen from them), they will be well modeled (2017) as expected utility maximisers. An AI system that maximizes expected utility is very likely to be dangerous. One reason was recently formalized in MDPs in which the agent gets a random utility function: using formalizations of power and instrumental convergence, we find some suggestive results that agents seek control over their future (from which we might infer that they will try to wrest that control from us).

However, it is not mathematically necessary that AI systems will have utility functions (except in a vacuous sense), and while there are intuitive and conceptual reasons to think that we will build goal-directed agents by default, there are alternative pathways that might be taken instead, and that are valuable to explore and build out to ensure AI safety.

This challenge to the usual argument for utility maximizers has prompted a series of articles exploring other variants of the argument, for example by restricting the class of utility functions to make it non-vacuous, or by saying that optimization processes in general will lead to goal-directed agents.

Comprehensive AI Services

Comprehensive AI Services (CAIS) also takes issue with the model of a single AGI agent hyper-competently pursuing some goal, and instead proposes a model in which different tasks are solved by specialized, competing AI services. This is suggesting that modularity across tasks is sufficiently useful that it will apply to AI, in the same way that it applies to humans (e.g. I have specialized in AI research, and not plumbing). The aggregate of all the services can accomplish any task, including the development of new services, making it comprehensive (analogous to the “general” in AGI). Since AI services can also do basic AI R&D research, which leads to improvement in AI services generally, we should expect recursive technological improvement (as opposed to recursive self improvement). Note that CAIS does not necessarily suggest we will be safe, just that the traditional risks are not as likely as we may have thought, while other emergent risks are perhaps greater.

Critics often argue that end-to-end training and integrated agent-like architectures are likely to (eventually) outperform modular services. However, through coordination services can also be integrated. In addition, this post argues that this criticism mirrors old concerns that under capitalism firms will become too large -- a concern that the post argues did not pan out.

CAIS does allow for AI systems that are capable of learning across many domains: it simply argues that these AI systems will specialize for efficiency reasons, and so will only be competent at a small subset of domains. This decomposition of intelligence into learning + competence has been used to explain the variation in human abilities.

(This conversation is related to much prior conversation on Tool AI, which is listed here.)

Arguments for AI risk

There are many arguments for AI risk, with each of these posts providing a list of such arguments. It is unclear whether from an outside perspective this should be taken as evidence against AI risk (since different researchers believe different arguments and are aiming for different “success stories”) or as evidence for AI risk (because there are so many different sources of AI risk).

One argument that saw a lot of discussion was that we must figure out philosophy since the creation of AGI might “lock in” philosophical ideas. For example, we might not want to have AI systems with utility functions because of impossibility results in population ethics that suggest that every utility function would lead to some counterintuitive conclusion. Similarly, there are many proposals for how to define values; it may be necessary to figure out the right definition ahead of time. Rather than solving these problems directly, we could solve metaphilosophy, or delegate to humans who deliberate, whether idealized or real.

We might also worry that AIs will economically outcompete humans, give us technologies we aren’t ready for, or amplify human vulnerabilities.

Under continuous takeoff, two scenarios have been proposed for what failure looks like. First, AI differentially improves society’s capability to optimize metrics that are easy to measure, rather than ones that we actually care about. Second, AI agents could accidentally be trained to seek influence, and then fail catastrophically at some point in the future once they are sufficiently capable. One critique argues that these principal-agent problems only lead to bounded losses (i.e. they aren’t catastrophic), but several others disagree.

This post argues that there has been a shift in the arguments that motivate new AI risk researchers, and calls for more explanation of these arguments so that they can be properly evaluated.

Arguments against AI risk

Many views that expect the problem to be solved by default have also been written up this year.

A series of four conversations (summarized here) suggested that some engaged people expect AI to go well by default, because they are unconvinced by the traditional arguments for AI risk, find discontinuities in AI capabilities relatively unlikely, and are hopeful that there will be “warning shots” that demonstrate problems, that the existing ML community will then successfully fix.

One post lists several good outside-view heuristics that argue against AI x-risk, while another questions why value being complex and fragile must lead to high AI risk.

This talk argues that while AGI will intuitively be a big deal, it’s not obvious that we can affect its impact, and so it’s not obvious that longtermists should focus on it. It gives an analogy to trying to influence the impact of electricity, before electricity was commonplace, and suggests there was little impact one could have had on its safe use. It argues that accident risks in particular draw on fuzzy, intuitive concepts, haven’t been engaged with much by critics, and don’t sway most AI researchers.

Despite the seeming controversy in this and previous sections, it is worth noting that there is general agreement within the AI safety community on the following broader argument for work on AI safety:

  1. Superhuman agents are not required to treat humans well, in the same way that humans aren’t required to treat gorillas well.
  2. You should have a good technical reason to expect that superhuman agents will treat humans well.
  3. We do not currently have such a reason.

Agency and optimization

Mesa optimization

The problem of mesa optimization was explained in significantly more detail (see also this less formal summary). In mesa optimization, we start with a base optimizer like gradient descent that searches for a policy that accomplishes some complex task. For sufficiently complex tasks, it seems likely that the best policy will itself be an optimizer. (Meta learning is explicitly trying to learn policies that are also optimizers.) However, the policy could be optimizing a different goal, called the mesa objective, rather than the base objective.

Optimizing the mesa objective must lead to good base objective behavior on the training distribution (else gradient descent would not select it), but could be arbitrarily bad when off distribution. For example, a plausible mesa objective would be to seek influence: such an agent would initially do what we want it to do (since otherwise we would shut it down), but might turn against us once it has accumulated enough power.

This decomposes the overall alignment problem into outer alignment (ensuring that the base objective is aligned with “what we want”) and inner alignment (ensuring that the mesa objective is aligned with the base objective). This is somewhat analogous to different types (2017) of Goodhart’s law.

The paper and subsequent analysis identify and categorize relationships between the base and mesa objectives, and explain how mesa optimizers could fail catastrophically. Of particular interest is that mesa optimizers should be fast, but could still be misaligned, suggesting that penalizing compute is not enough to solve inner alignment.

Effectively, the concern is that our AI systems will have capabilities that generalize, but objectives that don’t. Since this is what drives risk, some suggest that we should talk about this phenomenon, without needing to bring in the baggage of “optimization”, a term we have yet to understand well, while others argue that even if we start with this definition, it would be useful to reintroduce the notions of optimization and agency.

One advantage of the original definition is that it specifies a particular mechanism by which risk arises; this gives us a foothold into the problem that allows us to propose potential solutions and empirical investigations. Of course, this is actively counterproductive if the risk arises by some other mechanism, but we might expect optimization to be especially likely because optimization algorithms are simple, and the phenomenon of double descent suggests that neural nets have an inductive bias towards simplicity.

What are optimization and agency, anyway?

Given the central importance of optimization to inner alignment and AI safety more broadly, we’d like to be able to formalize it. However, it’s not clear how to do so: while we want optimization to be about the mechanical process by which outcomes happen (as opposed to e.g. our beliefs about that process), we cannot simply say that X is an optimizer if it makes some quantity go up: by this definition, a bottle cap would be an optimizer for “keeping water in the bottle''.

It is also relevant how the system interacts with its environment, rather than just being about whether some number is going up. The type of computation matters: while older models of optimization involve an agent that can search over possible actions and simulate their results, other optimization processes must control their environment without being able to simulate the consequences of their choice.

Our use of the word “agency” might be tied to our models or specific human architectures, rather than being a general concept that could describe a mechanical property of a computation. This would be particularly worrying since it would mean that arguments for AI risk are based on our flawed models of reality, rather than an objective property about reality. However, this is extremely speculative.

Embedded agency

Discussions about AI usually assume that a notion of the “actions” that an agent can take. However, the embedded agency sequence points out that this “Cartesian boundary” does not actually exist: since any real agent is embedded in the real world, you cannot make many assumptions that are common in reinforcement learning, such as dedicated and perfectly trusted input-output channels, a perfect model of the environment, an agent architecture that is uninfluenced by the environment, etc.

This means you can never consider all of the important information, and optimize everything that could be optimized. This has led to a couple of hypotheses:

  1. Real learning algorithms require modeling assumptions to solve the credit assignment problem, and so can only lead to partial agency or myopia. (See also this parable and associated thoughts.)
  2. Embedded agency works via abstraction, which is the key idea allowing you to make maps that are smaller than the territory.

Value learning

Descriptive embedded agency

While the embedded agency sequence is written from the perspective of prescribing how ideal agents should operate, we could also aim for a theory that can describe real agents like humans. This involves making your theory of agency correspondingly broader: for example, moving from utility functions to markets or subagents, which are more general. The development of such a theory is more grounded in concrete real systems, and more likely to generate theoretical insight or counterexamples, making it a good research meta-strategy.

Such a theory would be useful so that we can build AI systems that can model humans and human values while avoiding embedded agency problems with humans.

The difficulty of value learning

Even if we ignore problems of embedded agency, there are obstacles to value learning. For example, there need not be a reward function over observations that leads to what we want in POMDP (though we could instead focus on instrumental reward functions defined on states).

Another key problem is that all you ever get to observe is behavior; this then needs to be decomposed into “beliefs” and “values”, but there is no clear criterion (2017) that separates them (although it hasn’t been proven that simplicity doesn’t work, and human priors help). This suggests that ambitious value learning, in which you identify the one true utility function, is hard.

Human models

For an agent to outperform the process generating its data, it must understand the ways in which that process makes mistakes. So, to outperform humans at a task given only human demonstrations of that task, you need to detect human mistakes in the demonstrations. Modeling humans to this fidelity is an unsolved problem, though there is a little progress, and we might hope that we can make assumptions about the structure of the model.

Any such model is likely to be misspecified, and value learning algorithms are not currently robust to misspecification: in one case, the simpler but less conceptually accurate model is more robust.

You might hope that if we give up on outperforming humans and just imitate them, this would be safe. Even this is controversial, because perhaps humans themselves are unsafe, maybe imitating humans leads to mesa optimization, or possibly perfect imitation is too hard to achieve.

You might also hope that AI systems have good enough models that you can simply provide natural language instructions and the AI does what you mean.

The presence of human models in an AI system has a few unfortunate effects:

  1. We can’t test an AI system by seeing if it agrees with human judgment, because the AI system may be using its human model to (in the short term) optimize for agreement with human judgment
  2. A bug in the code is more likely to optimize for suffering (since the human model would include the concept of suffering)
  3. If humans are modeled with sufficient fidelity, these models may themselves be conscious and capable of suffering.

Learning an adequate utility function

Despite the objections that learning values is hard, it seems like humans are pretty good at learning the values of other humans, even if not perfect. Perhaps we could replicate this, in order to learn an adequate utility function that leads to okay outcomes?

The main issue is that we are only good at predicting human values in normal situations, while powerful AI systems will likely put us in extreme situations where we will disagree much more about values. As a result, we need a theory of human values that defines what to do in these situations. One theory, associated value learning agenda, and toy model propose that we can extract partial preferences from human mental models, and synthesize them together into a full utility function, while respecting meta-preferences about preferences and the synthesis process and taking care to properly normalize utilities.

In fact, the core pieces of such an approach seem necessary for any solution to the problem. However, this research agenda depends upon solving many hard problems explicitly in a human-understandable way, which doesn’t jive with the bitter lesson that ML progress primarily happens by using more compute to solve harder problems.

I don’t agree that the core pieces identified in this research agenda must be solved before creating powerful AI, nor that we must have explicit solutions to the problems.

Uncertainty over the utility function

We could also make the AI uncertain about the utility function, and ensure that it has a way to learn about the utility function that is grounded in human behavior. Then, as an instrumental goal for maximizing expected reward, the AI will choose actions with high expected information gain. While this was proposed earlier (2016), the book Human Compatible (summary, podcast 1, podcast 2, interview) explores the idea in much more detail than previous writing, and it has now made its way into deep reinforcement learning as well.

Intuitively, since the AI is uncertain about the true reward, it will behave conservatively and try to learn about the true reward, thus avoiding Goodhart’s law (see also fuzziness). Of course, once the AI has learned everything there is to learn, it will behave (2015?) just like a regular utility maximizer. In this setting, you would hope that the AI has become aligned with the true utility function, as long as its initial distribution over utility functions contains the truth, and the observation model by which its distribution is updated is “correct”. However, it might be quite difficult to ensure that these actually hold. This also depends on the assumption that there is a true utility function, and that the human knows it, which is not the case, though this is being addressed.

One important feature of this agenda is that rather than requiring a perfect utility function to begin with, the AI can learn the utility function by interacting with the human; such a feedback mechanism can make a problem much easier. Interaction also opens up other possibilities, such as learning human norms instead of values. However, it is computationally difficult, and so more research would be needed to make it a viable solution.

Current methods for learning human preferences

There has been a lot of practical work on learning human preferences, including:

There are many recent papers that I haven’t cited here, as it is a very large area of work.


Safe reinforcement learning

We would like to ensure that our AI systems do not make mistakes during training. With preference learning, we can do this by learning human preferences over hypothetical behaviors that are not actually executed. Another option is to provide safety constraints and ensure that the AI never violates them (even during training), or at least to significantly reduce such violations.

Avoiding all mistakes would require us to have a formal specification of what a “mistake” is, or to have some overseer that can identify “mistakes” before execution, so that our AI could avoid the mistake even though it hasn’t seen this situation before. This seems prohibitively hard to me if we include literally all “mistakes”.

Adversarial examples

Adversarial examples are a clear demonstration of how the “cognition” of neural nets is different from our own: by making superficial changes to the input that would not matter to a human, you can completely change the output of the neural net. While I am not an expert here, and certainly have not read the huge mountains of work done over the last year, I do want to highlight a few things.

First, while we might nominally think of adversarial examples as “bugs” in our neural net, this paper shows that image classifiers are picking up real imperceptible features that do generalize to the test set. The classifiers really are maximizing predictive accuracy; the problem is that we want them to predict labels based on the features that we use, instead of imperceptible (but predictive) features. Adversarial training removes these fragile features, leaving only the robust features; this makes subsequent applications easier.

While the paper was controversial, I thought that its main thesis seemed to be supported even after reading these six responses.

Second, there has been a distinct shift away from the L-infinity norm ball threat model of adversarial examples. So far, it seems that robustness to one set of perturbations doesn’t grant robustness to other perturbations, prompting the development of multiple perturbations, a benchmark of natural adversarial examples, and new evaluation metrics. While the L-infinity norm ball is an interesting unsolved research problem, it is in no way a realistic threat model.

Third, adversarial attacks are now being proposed as a method for evaluating how robust an agent trained by reinforcement learning is. This seems especially important since in RL there is often no train-test split, and so it is hard to tell whether an agent has “memorized” a single trajectory or actually learned a policy that works well across a variety of circumstances.

Intent alignment

Ultimately, robustness seeks to identify and eliminate all “bugs”, i.e. behaviors that are inconsistent with the specification (see also this podcast). Instead of considering all the mistakes, we could seek to only prevent catastrophic mistakes, and ensure that the AI is intent aligned, that is, it is always trying to do what we want. This goal avoids many of the pitfalls around the goal of designing an AI with the right utility function.


One promising way in which an AI could be intent aligned is by being corrigible: roughly, the AI is not trying to deceive us, it clarifies its uncertainty by asking us, it learns about our preferences, it shuts down if we ask it to, etc. This is a narrower concept than intent alignment: an AI that infers our “true” utility function and optimizes it may wrest control away from us in order to expand faster, or make us safer; such an AI would be aligned but not corrigible. There are a few benefits of using corrigibility:

  1. It can be achieved with relatively low levels of intelligence (we can imagine corrigible humans)
  2. It seems to have a positive feedback loop (that is, an AI that reaches some “threshold” of corrigibility would tend to become more corrigible)
  3. It doesn’t seem to require any domain expertise.

(A similar idea would be to build an AI system that only takes actions that the overseer has given informed consent for.)

Note that MIRI’s notion of corrigibility (2015) is similar but much stricter. My guess is that MIRI wants the same intuitive corrigibility properties, but wants them to be created by a simple change to the utility function. Simplicity helps ensure that it cannot be gamed, and the utility function means that you are changing what the AI cares about, rather than trying to constrain a powerful superintelligence. For example, I’d guess that MIRI-corrigibility can depend on whether a shutdown button is pressed, but cannot depend on the reasons for which the shutdown button is pressed.

If you set aside the utility function requirement, then this property can be achieved using constrained optimization: the agent can optimize normally when the button is not pressed, while ensuring that it is still able to shut down if necessary, and it can optimize for shutting down if the button is pressed. If you set aside the simplicity requirement, then you can define the desired policies and recover the correct utility function. But from now on I’m only going to talk about the notion of corrigibility I first introduced.

It has been argued that while corrigibility is simpler than “human values”, it is a “non-natural” type of cognition, such that you are unlikely to be able to find corrigible intelligences with machine learning. (I do not feel the force of this intuition; I agree much more with the earlier intuitions.)

You might be worried that since a corrigible AI defers to us, if we were about to take a suboptimal action that we couldn’t tell was suboptimal, the AI wouldn’t stop us from doing so because it can’t explain to us what would be bad about the world. However, at the very least, it can say “this is bad for reasons I can’t fully explain”.

Worst case guarantees

We still want to guarantee that there will never be a failure of corrigibility, which can’t be done with regular ML techniques, which only give an average-case guarantee. In order to get a worst-case guarantee, we need other techniques. One proposal is to use adversarial training to find abstracted inputs on which the agent is incorrigible, where the adversary is aided by interpretability techniques that allow the adversary to understand what the agent is thinking. It would be particularly nice to find a mechanistic description of corrigibility, as that would make it easier to verify the absence of incorrigible behavior.

Critics argue that this could never work because machine learning wouldn’t learn the “intended” interpretation of corrigibility, and could be adversarial. I don’t think this objection is critical. It seems like it is saying that ML will fail to generalize and there will be situations in which the concept of corrigibility breaks down, but the entire point of adversarial training is to find these situations and train the agent away from it.

While this is usually tied in to the broader iterated amplification agenda, it seems to me that solving just this subproblem would achieve a lot of the value of the full agenda. If we had a way of applying adversarial training to an arbitrary AI agent, such that we are very likely to find potential inputs on which the agent is incorrigible, then presumably AI systems that could be incorrigible would not be deployed. Iterated amplification adds additional safety in that it (hopefully) allows you to assume a smarter, already-aligned adversary, whereas a direct solution to this subproblem would have an approximately-as-capable, not-automatically-aligned adversary, which would probably not have a worst-case guarantee but might still be good enough.

Scaling to superhuman abilities

Iterated amplification

Iterated amplification carves out a broad class of algorithms that can scale to superhuman abilities, with the hope that we can analyze the alignment properties of the entire class of algorithms at once. Algorithms in this class have two components:

  1. Amplification, which increases an agent’s capabilities, at the cost of efficiency.
  2. Distillation, which increases an agent’s efficiency, at the cost of capability.

Given this, starting from some base agent, the algorithm alternates amplification and distillation, to get successively more capable agents, as long as each component is good enough.

Given this broad class of algorithms, we can instantiate many specific algorithms by picking a specific amplification step and a specific distillation step. For example, the amplification step can be done by allowing an overseer to decompose the problem into subproblems, which is especially promising for question answering. Distillation could be done using supervised learning, imitation learning, or reinforcement learning.

Recursive reward modeling (podcast) is another algorithm that could allow us to scale to superhuman abilities. It can be cast as an algorithm in the iterated amplification class by considering an amplification step that takes agents that can evaluate some set of tasks, and builds new human-agent teams that can evaluate some more complex set of tasks. The distillation step would then be reinforcement learning, to get an agent that can directly solve the more complex tasks. Iterating this eventually leads to an agent that can solve the original desired task.

Iterated amplification does impose a particular structure on algorithms, which can be applied to existing ML problems. However, this may be uncompetitive if the best ML algorithms require different algorithmic structures or different environments, in order to reach high capabilities (though we could then train a question-answering system alongside the other algorithm / environment, which plausibly doesn’t take too many more resources).

The iterated amplification sequence, recursive reward modeling paper, and these posts help explain the full agenda better.


Quantilization (2015) allows you to amplify a base policy by randomly selecting among the top 1/Q of actions the base policy could take, at a cost of at most Q-fold increase in risk. However, this can forgo benefits of the rest of the base policy. Since quantilization increases risk, it cannot be safely iterated: for example, if you start with a policy with a worst-case 1% chance of failure, and you 5-quantilize it, you now have a worst-case 5% chance of failure. After two more iterations of 5-quantilization, there is no longer a worst-case bound on failure probability.


Another mechanism for scaling beyond humans is debate (podcast), in which an AI agent is trained via self-play in a zero-sum game in which its goal is to “win” the debate, as evaluated by a human judge. The key hope is that detecting a lie is easier than lying: if one of the players lies or deceives or manipulates the human, then the other player can reveal that and thereby win the debate. If this were true, we would expect that the equilibrium behavior is for the agent to provide honest, useful information.

Since its proposal, debate has been tested with MNIST and Fashion MNIST, as well as question answering. There is also a proposal to use it to improve iterated amplification.

Theoretical work brings up the possibility of questions that are “too hard”: while sufficiently long “feature debates” are provably truth-seeking (because the debaters can reveal all of their information), it is possible to construct complex questions in which the debate doesn’t find the right answer. However, the results don’t generalize well from feature debates to real debates.

Relatedly, even if it is easy to detect lies, it’s not clear what would happen with ambiguous questions.

Since debate doesn’t involve alternating between increasing capabilities and increasing efficiency, it isn’t an instance of iterated amplification. However, both iterated amplification and debate are aiming to compute the answer that an exponentially large tree of bounded humans would arrive at (see next section), and so it seems likely that either they would both work, or neither would work.

Factored cognition

Both iterated amplification and debate depend on the factored cognition hypothesis: that arbitrarily complex tasks can be performed arbitrarily well by a giant tree of bounded base agents, possibly extended with features like shared external memory or long-lived assistants (2016).

Iterated amplification checks local nodes in a tree of considerations and broken-down questions, in which an assistant at level k decomposes its questions, gets answers from assistants at level k-1, and combines them into an overall answer. Meanwhile, in debate, if the two agents disagree, they will play down the most difficult / contested path in an exponential tree of arguments and counterarguments, so the debate training procedure is checking a single path from root to leaf in the exponential tree.

It is an open question whether the factored cognition hypothesis is true. Empirical work has been scaling up, and we should hopefully have some informative evidence in the upcoming year.

The main reasons people are skeptical of the hypothesis are because it seems that sufficiently complex tasks require building up big contexts or using globally-constructed intuitions or “inexplicable flashes of insight”. This could be done if the “small” agents simulated an arbitrary Turing Machine, but this would lose any guarantees of alignment.

However, we might expect that these tasks could still be done by a tree of humans: humans are allowed to use a heuristic “just because it works”; this should allow the tree of humans to use heuristics that other agents use, including “inexplicable flashes of insight”.


Alignment of the tree of humans

In order for this tree of humans to be aligned (a necessary condition for iterated amplification or debate to be aligned), the initial agent must already be aligned, and putting the agents together must not destroy alignment. One intuition that this is hard is that alignment is not compositional; a “big” agent made up of “small” aligned agents need not be aligned. However, the hope doesn’t depend on compositionality of alignment; it instead depends on ensuring that your agents never do incorrigible optimization.

In addition, it could be the case that “large” initial agents like humans (or human imitations) are not robustly aligned, because there may be some clever argument that causes them to behave incorrigibly. One response would be to use low-bandwidth overseers as the initial agent, who only answer very “small” questions on which we are relatively confident that there are no such failures. We would also hope to train humans to properly decompose questions and behave corrigibly, so that putting together several humans remains corrigible (a task for which we need social scientists).

Note that it is only competitive to approximate the tree of humans with iterated amplification if we expect that any powerful AI systems will also be trained in a manner similar to iterated amplification. If we instead consider a model in which ML perfectly optimizes a function (rather than performing iterated local search), then iterated amplification would be far more expensive than unaligned powerful AI systems. It would be worth studying this simpler model to see if alignment is possible there.

Ascription universality

Even if we know that the tree of humans is aligned, we also need to ensure that the model trained from oversight from the tree of humans will also be aligned. The key claim in favor of this is that HCH (the tree of humans) is universal, that is, it “knows” any facts that a sufficiently smaller computation “knows”. This was formalized here and applied to multiple problems, including the problem that malign optimization might emerge within HCH. While a good explanation of this is out of scope here, I summarized these posts here. Ascription universality does have to be applied to the entire training process and not just the final model.


Since we want to be able to “know everything the model knows”, and also to be able to find situations under with a model behaves corrigibly (see worst case guarantees above), it would be very useful to be able to peer inside our models and understand what they are doing. It would be particularly useful to be able to identify optimization processes and understand how they come about.

Even though interpretability tools probably could not deal with already deceptive models, since the deceptive models could figure out how to fool the tools, it seems likely that interpretability could help prevent deception from ever arising -- hopefully an easier task.

However, interpretability has other uses besides catching problems: it could also be used to get more understandable models during training, provide feedback on the process by which a model makes a decision (rather than feedback on just the decision), or create ML techniques that help us understand the world without acting in it (thus avoiding problems with agential AI).

Unfortunately, I haven’t kept up with interpretability research, so I can’t say how it’s progressed recently, but one paper you could start with is activation atlases.

Impact regularization

Impact measures

In 2018, there was a lot of progress on proposing specific impact measures, including relative reachability and attainable utility preservation (followup, paper). These were recently unified as using similar underlying algorithms but with different “deviation measures”: the former considers the change in number of reachable states, whereas the latter considers the change in attainable utility (for some set of utility functions).

These two posts summarize the work on impact (going back till 2012).

What is impact, anyway?

The Reframing Impact sequence aims to build intuitions about what we mean by “impact”, and concludes that an action is impactful if it changes our ability to get what we want. Of course, this definition depends on “what we want”, whereas usually with impact regularization we want something that is easy to specify. However, we might hope that impact is relatively goal-agnostic, because for most goals you need to pursue the same convergent instrumental subgoals. In particular, we might hope for a formalizable notion of power, that attainable utility preservation could penalize.

To better distinguish between different definitions and techniques for measuring impact, this post proposes several test cases for impact regularization.

Utility of impact measures

The mainline use case for impact regularization is to be an “additional layer of defense”: if for some reason we fail to align an AI system, then hopefully there still won’t be catastrophic consequences, because the AI system only takes low-impact actions. However, this may fail to work for a variety of reasons. Still, work on impact measures could be useful for deconfusion, testing protocols, temporary alignment measures, or value-neutrality verification.

Causal modeling

Causal influence diagrams help us understand what a training process does. Given a causal influence diagram, we can determine observation incentives (what an agent would like to know) and intervention incentives (what an agent would like to change). We can produce such diagrams for AGI safety frameworks, and analyze solutions to reward function tampering, user feedback tampering, and observation tampering. For example, it allows us to show that if the agent’s plans are evaluated by the current reward, then there is no incentive for the agent to tamper with its reward function.

The variables of the diagrams represent important components of the agent and the environment (such as reward functions and dynamics models in the agent, and the user’s preferences and the state of the world in the environment). Different ways of combining these into agent setups lead to different causal influence diagrams. The incentive analysis enables the designer to choose agent setups with good incentive properties.

However, the causal models themselves are not uniquely determined. For example, what counts as wireheading is relative to the stance taken towards the system and its desired goals. For example, if you define it as taking control of some “narrow measurement channel”, then what is a measurement channel and what the goal is depends on modeling assumptions.


Oracles also benefit from reasoning about causality and influences. A system that maximizes predictive accuracy ends up choosing self-confirming predictions, which can be arbitrarily bad. (This affects self-supervised learning in addition to oracles.) You might hope to avoid this by preventing the AI system from being aware of itself, but this doesn’t work.

Instead, we could ensure that the oracle makes predictions conditional on the predictions not influencing anything (using randomization to do so). There are still other problems besides self-confirming predictions, such as acausal trade.

Decision theory

There’s been a lot of work exploring the intuitions behind decision theory. Since I don't follow decision theory closely, I’m not going to try and summarize the conversation, and instead you get a list of posts: pro CDT, anti CDT, anti FDT, actually it all depends on counterfactuals, anti UDT because of commitment races, UDT doesn’t work with AIXI, strange reasoning in Troll Bridge, a comparison across decision theories, counterfactual induction posts. There’s also been some discussion of why people care about decision theory: it is useful for improving rationality, finding problems, and deconfusion.

Relatedly, this paper characterizes the decision theories of existing agents, and this post explains how “Pavlov” strategies (similar to reinforcement learning) can work well with game theory.

As we get to the end of the technical alignment section, I want to mention BoMAI, which didn’t fit in any of the sections. BoMAI is an AIXI-like system that does not seek power, because it only cares about reward until the end of the episode (myopia), and during the episode it is confined to a box from which information cannot leave. Such an AI system can still be useful because there is also a human in the box, who can transmit information to the outside world after the episode has ended.

Strategy and coordination

So far I’ve been talking about the technical work on the alignment problem. Let’s now switch to more “meta” work that tries to predict the future in order to prioritize across research topics.

Continuous vs discontinuous takeoff

A central disagreement among AI researchers is about how “quickly” AI improves once it reaches human level. Recently, the question has been distilled to whether there will be a discontinuity in AI capabilities. As a result, I will ask whether takeoff will be continuous or discontinuous (as opposed to slow or fast).

One operationalization of this question is whether there will be a 4-year doubling of GDP that ends before the first 1-year doubling of GDP starts. Note that continuous takeoff need not be slow: to get to 4-year doubling, you need superexponential growth. Under exponential growth, the doubling time stays fixed at its current value of a few decades. Extrapolating historical growth trends (which “supports the possibility of radical increases in growth rate”) would still (probably) be compatible with this operationalization.

Two posts argue for continuous takeoff; the main argument is that continuity is very likely for properties that people care about, since lots of people are trying to make progress on the property, and it is less likely that we quickly invest much more effort into making progress on the property. So far, there has not been a compelling response, but this does not mean that researchers agree.

There has been some discussion of particular properties that make discontinuous takeoff seem more likely (though I would guess that they are not the arguments that MIRI researchers would make). For example, perhaps we just need to find the one correct architecture, which will then cause a discontinuity, but note that birds and primates have independently evolved neural architectures that both work well.

Alternatively, AI systems with different explicit utility functions could cooperate by merging to pursue a joint utility function, making them much more effective at coordination than humans, allowing them to avoid principal-agent problems that plague human corporations. This could lead to a discontinuous jump. AI systems could also build monopolies through such coordination to obtain a decisive strategic advantage.

We could also expect that just as the invention of culture and social learning by evolution allowed humans to become the dominant species very quickly (relatively speaking), similarly once AI systems are capable of social learning they may also “take off” discontinuously. However, the same argument could be taken as evidence against a discontinuity, since current natural language systems like GPT-2 could already be thought of as processing culture or doing social learning.

It is worth noting that questions about recursive self improvement and decisive strategic advantage do not map cleanly onto the question of takeoff speeds, though they are related. The primary reason takeoff speed is important is that it determines whether or not we will be able to respond to problems as they come up. For this purpose, it’s probably better to define takeoff speed with respect to the amount of work that can be done as AI takes off, which might differ significantly from calendar time.

The importance of compute

There is a strong case that the most effective methods (so far) are the ones that can leverage more computation, and the AI-GA approach to general intelligence is predicated on this view (for example, by learning good learning environments). In fact, since the rise of deep learning in 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time. It’s important to note the caveat that we cannot simply increase compute: we also need good data, which is sparse in rare, unsafe situations (consider driving when a pedestrian suddenly jumps on the road). This may require human knowledge and explicit models.

Since it seems more likely that compute grows continuously (relative to a “deep insights” model), this would argue for a more continuous takeoff. However, you may expect that we still need deep insights, potentially because you think that current techniques could never lead to AGI, due to their lack of some property crucial to general intelligence (such as causal reasoning). However, for any such property, it seems that some neural net could encode that property, and the relevant question is how big the neural net has to be and how long it takes for local search to find the right computation.

Sociological evidence

It has recently become more common to critique the field of AI as a whole, which should (arguably) cause you to lengthen your timelines. For example, hypothesizing after the results are known makes for bad science that doesn’t generalize, and research that is “reproducible” in the sense that the code can be rerun to get the same results need not have external validity. There is also a tendency for researchers to throw trial and error at problems, which means that with repeated trials by chance we can get results that look significant. It also means that researchers don’t understand the systems they build; reorienting the field to focus on understanding could make our design decisions more deliberate and make it more likely that we build aligned AIs.

We should also expect that at least industry research is biased towards short timelines, since any companies that didn’t argue for short timelines would be much less likely to get funding.

Meta work on forecasting

While forecasting the future is notoriously hard, collaborative and checkable forecasting is even harder. It would be nice to at least reduce the difficulty back down to “regular” forecasting. Three steps have been taken towards this:

  1. People need to agree on the meaning of the terms used; an AI forecasting dictionary has been developed for this purpose.
  2. In order to be checkable, questions need to be operationalized; but then it is often the case that the primary determinant of the answer to a question depends on some “distractor” feature. For example, whether we have a superhuman AI at <game> by 2025 depends a lot on who tries to make such an AI, rather than whether we have the technical ability to make such an AI. A partial solution was to create a resolution council, and instead have questions ask about the future opinion of the resolution council.
  3. This post provides advice on how to write good forecasting questions, with a database of examples.

Of course, there is still the hard problem of actually figuring out what happens in the future (and it’s even hard to tell whether long-run forecasting is feasible). The Good Judgment Project studied practices that help with this problem, summarized here.

Another issue arises when asking members of a group (e.g. AI researchers) about outcomes that depend on actions within that group: due to the bystander effect, everyone may predict that the group will solve a problem, even though they themselves are not trying to solve the problem. So, we should instead ask people to make predictions about the proportion of members that try to solve a problem, and compare that to the proportion of members who say that they are trying to solve the problem.

AI Progress

A full update on AI progress in 2019 would be far too long, so here I’ll just mention some results I found interesting, which biases towards 1. results involving “throwing compute at the problem”, and 2. understanding deep learning.

Reinforcement learning

  1. AlphaStar (update, discussion) become extremely good at Starcraft.
  2. OpenAI Five beat the world champions at Dota, and could play cooperatively alongside humans.
  3. OpenAI trained a robot to manipulate a Rubik’s cube so that it could sometimes solve a jumbled cube when given the steps of the solution. See also this discussion.
  4. MuZero is an evolution of AlphaZero where MCTS is applied on a learned world model optimized for planning, allowing it to master Atari in addition to AlphaZero’s Go, Chess, and Shogi. See also this paper on instrumentally learned world models.
  5. Pluribus was shown to be superhuman at multiplayer poker. (Note that to my knowledge it did not use deep learning, and it did not require much compute.)
  6. With a complex enough hide-and-seek environment, self-play can learn qualitatively interesting behaviors.

Deep learning

  1. While GPT-2 is the most well-known, there have been several large language models that are eerily good at capturing language, such as Transformer-XL and XLNet.
  2. SATNet proposed a differentiable layer for neural networks that provides a strong inductive bias towards “logical reasoning”, though even regular machine translation techniques work well for function integration and differential equation solving.
  3. The lottery ticket hypothesis from 2018 was tested much more.
  4. The double descent phenomenon was empirically validated.

Field building

While there have been a lot of field building efforts, they are relatively disjoint and not part of a conversation, and so I’ve summarized them in lists.

Summaries and reviews

  1. This talk and multipart podcast provides an overview of approaches to technical AI alignment.
  2. This post decomposes the beneficial AI problem into a tree of different subproblems (with a particular focus on the alignment problem).
  3. There is of course the annual literature review and charity comparison.
  4. This post identifies important hypotheses that researchers disagree about.

Agendas and prioritization

  1. This doc provides an overview of the technical problems that need to be solved to align AI systems (as opposed to e.g. MIRI’s deconfusion approach).
  2. These posts list questions that could be tackled by philosophers and non-AI researchers respectively.
  3. It would be better to bridge near- and long-term concerns about AI, to prevent the fields from “fighting” each other.
  4. For s-risks, rather than looking at particular scenarios, we could focus on risk factors: properties we can intervene on to make risks less probable or less severe.

Events and news updates

  1. Several conferences and workshops in 2019, including Beneficial AGI, SafeML at ICLR, AI Safety at IJCAI, and Uncertainty and Robustness at ICML.
  2. There was a human-aligned AI summer school and an AI safety camp.
  3. OpenAI switched to a limited-profit structure and received a $1B investment from Microsoft, while still expressing support for their charter.

The Center for Security and Emerging Technology (CSET) was founded.


See the Google Doc for a list of all the names and links in the text above.

New Comment
6 comments, sorted by Click to highlight new comments since:

Curated. This sort of review work is crucial for making common records of what progress has been made, so thank you for putting in the work to make it.

The short version / summary will be the Alignment Newsletter for this week (sent out Wed 10am Pacific Time); I may incorporate feedback provided here or on the doc.

Value learning. Building an AI that learns all of human value has historically been thought to be very hard, because it requires you to decompose human behavior into the “beliefs and planning” part and the “values” part, and there’s no clear way to do this.

My understanding is that IRL requires this, but it's not obvious to me that supervised learning does? (It's surprising to me how little attention supervised learning has received in AI alignment circles, given that it's by far the most common way for us to teach current ML systems about our values.)

Anyway, regarding IRL: I can see how it would be harmful to make the mistake of attributing stuff to the planner which actually belongs in the values part.

  • For example, perhaps our AI observes a mother caring for her disabled child, and believes that the mother's goal is to increase her inclusive fitness in an evolutionary sense, but that the mother is irrational and is following a suboptimal strategy for doing this. So the AI executes a "better" strategy for increasing inclusive fitness which allocates resources away from the child.

However, I haven't seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe. It seems to me that in the limit, attributing all human behavior as arising from human values could end up looking something like an upload--that is, it still makes the stupid mistakes that humans make, and it might not be competitive with other approaches, but it doesn't seem to be unaligned in the sense that we normally use the term. You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible. But if this scenario is multipolar, each actor could be incentivized to spin the values/planner dial of its AI towards attributing more of human behavior to the human planner, in order to get an agent which behaves a little more rationally in exchange for a possibly lower fidelity replication of human values.

The long version has a clearer articulation of this point:

For an agent to outperform the process generating its data, it must understand the ways in which that process makes mistakes. So, to outperform humans at a task given only human demonstrations of that task, you need to detect human mistakes in the demonstrations.

So yes, you can achieve performance comparable to that of a human (with both IRL and supervised learning); the hard part is in outperforming the human.

Supervised learning could be used to learn a reward function that evaluates states as well as a human would evaluate states; it is possible that an agent trained on such a reward function could outperform humans at actually creating good states (this would happen if humans were better at evaluating states than at creating good states, which seems plausible).

However, I haven't seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe.

This is the default outcome of IRL; here IRL reduces to imitating a human. If you look at the posts that argue that value learning is hard, they all implicitly or explicitly agree with this point; they're more concerned with how you get to superhuman performance (presumably because there will be competitive pressure to build superhuman AI systems). It is controversial whether imitating humans is safe (see the Human Models section).

You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible.

Yeah, the iterated amplification agenda depends on (among other things) a similar hope that it is sufficient to train an AI system that quickly approximates the result of a human thinking for a long time.

it's not obvious to me that supervised learning does

What type of scheme do you have in mind that would allow an AI to learn our values through supervised learning?

Typically, the problem with supervised learning is that it's too expensive to label everything we care about. In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent? Or are you thinking of something more general or exotic?

Typically, the problem with supervised learning is that it's too expensive to label everything we care about.

I don't think we'll create AGI without first acquiring capabilities that make supervised learning much more sample-efficient (e.g. better unsupervised methods let us better use unlabeled data, so humans no longer need to label everything they care about, and instead can just label enough data to pinpoint "human values" as something that's observable in the world--or characterize it as a cousin of some things that are observable in the world).

But if you think there are paths to AGI which don't go through more sample-efficient supervised learning, one course of action would be to promote differential technological development towards more sample-efficient supervised learning and away from deep reinforcement learning. For example, we could try & convince DeepMind and OpenAI to reallocate resources away from deep RL and towards sample efficiency. (Note: I just stumbled on this recent paper which is probably worth a careful read before considering advocacy of this type.)

In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent?

This seems like a promising option.