Preface to the sequence on iterated amplification

paulfchristiano

This sequence describes iterated amplification, a possible strategy for building an AI that is actually trying to do what we want out of ML systems trained by gradient descent.

Iterated amplification is not intended to be a silver bullet that resolves all of the possible problems with AI; it’s an approach to the particular alignment problem posed by scaled-up versions of modern ML systems.

Iterated amplification is based on a few key hopes

If you have an overseer who is smarter than the agent you are trying to train, you can safely use that overseer’s judgment as an objective.
We can train an RL system using very sparse feedback, so it’s OK if that overseer is very computationally expensive.
A team of aligned agents may be smarter than any individual agent, while remaining aligned.

If all of these hopes panned out, then at every point in training “a team of the smartest agents we’ve been able to train so far” would be a suitable overseer for training a slightly smarter aligned successor. This could let us train very intelligent agents while preserving alignment (starting the induction from an aligned human).

Iterated amplification is still in an preliminary state and is best understood as a research program rather than a worked out solution. Nevertheless, I think it is the most concrete existing framework for aligning powerful ML with human interests.

Purpose and audience

The purpose of this sequence is to communicate the basic intuitions motivating iterated amplification, to define iterated amplification, and to present some of the important open questions.

I expect this sequence to be most useful for readers who would like to have a somewhat detailed understanding of iterated amplification, and are looking for something more structured than ai-alignment.com to help orient themselves.

The sequence is intended to provide enough background to follow most public discussion about iterated amplification, and to be useful for building intuition and informing research about AI alignment even if you never think about amplification again.

The sequence will be easier to understand if you have a working understanding of ML, statistics, and online learning, and if you are familiar with other work on AI alignment. But it would be reasonable to just dive in and just skip over any detailed discussion that seems to depend on missing prerequisites.

Outline and reading recommendations

The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect.
The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal.
The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general framework into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others.
The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment.
The fifth section of the sequence breaks down some of these problems further and describes some possible approaches.
The final section is an FAQ by Alex Zhu, included as appendix.

The sequence is not intended to be building towards a big reveal---after the first section, each post should stand on its own as addressing a basic question raised by the preceding posts. If the first section seems uninteresting you may want to skip it; if future sections seem uninteresting then it’s probably not going to get any better.

Some readers might prefer starting with the third section, while being prepared to jump back if it’s not clear what’s going on or why. (It would still make sense to return to the first two sections after reading the third.)

If you already understand iterated amplification you might be interested in jumping around the fourth and fifth sections to look at details you haven’t considered before.

The posts in this sequence link liberally to each other (not always in order) and to outside posts. The sequence is designed to make sense when read in order without reading other posts, following links only if you are interested in more details.

Tomorrow's AI Alignment Forum sequences post will be 'Future directions for ambitious value learning' by Rohin Shah, in the sequence 'Value Learning'.

The next post in this sequence will come out on Tuesday 13th November, and will be 'The Steering Problem' by Paul Christiano.

Is Iterated Amplification still a current alignment paradigm that's being pursued?

I found this sequence through the FAQ under How do I get started in AI Alignment research? . I've really enjoyed reading the first few articles, but then I noticed a lot of the articles are from 2018. I found this Mar 2021 article also by Paul Christiano which makes it sound like he found some issues with Iterated Amplification and moved onto a different paradigm called Imitative Generalization.

I think iterated amplification (IDA) is a plausible algorithm to use for training superhuman ML systems. This algorithm is still not really fleshed out, there are various instantiations that are unsatisfactory in one way or another, which is why this post describes it as a research direction rather than an algorithm.

I think there are capability limits on models trained with IDA, which I tried to describe in more detail in the post Inaccessible Information. There are also limits to the size of implicit tree that you can really use, basically mirroring the limits on implicit debate trees explored in Beth's post on Obfuscated Arguments (roughly speaking, we still think such trees can be arbitrarily big relative to the overseer, but it now seems like their size is bounded by the capability of your models). I have discussed some of these issues in pre-2018 writing, but it was not clear how much they'd force the algorithm to change fundamentally vs get tweaked around the edges.

These issues motivated Imitative Generalization, which is another algorithm for training superhuman ML systems. I see this as pretty contiguous with IDA, and it rests on very similar assumptions. Its capabilities are also bounded by HCH in basically the same way.

That said, it's also pretty clear that imitative generalization doesn't handle every possible case (at least not without doing a lot of additional challenging work), and we're now trying to zoom in on the hardest cases for methods like imitative generalization. This is something we'll be writing about soon.

I don't think I would call any of these things "paradigms." They seem more like "training strategies," each designed to align AI systems that we previously didn't know how to align. The overall paradigm is basically what's described in my methodology post:

Propose a training strategy that looks like it could avert catastrophic misalignment in the cases identified so far.
Identify a new "case" in which that training strategy fails---i.e. a combination of facts about the empirical world, about what kind of thing SGD learns, etc. for which that training strategy would lead to catastrophic misalignment.

A different approach for avoiding the limits of IDA is recursive reward modeling (RRM) which uses evaluations-in-hindsight, so that the learned policy is free to leverage intuitions or capabilities that humans couldn't understand in order to take actions that the overseer couldn't have recognized as good with foresight but which have good-looking consequences. This lets the ML be smarter but introduces additional safety concerns, since now you need to ensure that a collection of weaker agents can keep a stronger agent in check (and if you fail then you face catastrophic risk). In practice you'd probably combine this with evaluations-in-advance in order to identify any predictably dangerous activities, and so you only really have trouble if a strong agent can overtake slightly weak agents using a plan that doesn't even look dangerous in advance.

I'd say that RRM is using a different research paradigm: it's fairly clear that there are possible situations where RRM breaks down, but it seems quite plausible that those will only occur long after AI has fundamentally changed the game. In my own research I'm not comfortable leaning on that kind of empirical contingency, but that's just a methodological choice by me and most people care more about empirically investigating whether their algorithm actually works in the real world (rather than understanding whether there is any case in which it goes badly).

This is a very good point. IIRC Paul is working on some new blog posts that summarize his more up-to-date approach, though I don't know when they'll be done. I will ask Paul when I next run into him about what he thinks might be the best way to update the sequence.

Is Iterated Amplification still a current alignment paradigm that's being pursued?

Propose a training strategy that looks like it could avert catastrophic misalignment in the cases identified so far.
Identify a new "case" in which that training strategy fails---i.e. a combination of facts about the empirical world, about what kind of thing SGD learns, etc. for which that training strategy would lead to catastrophic misalignment.

11

Preface to the sequence on iterated amplification

11

Purpose and audience

Outline and reading recommendations