This is a guest post summarizing Paul Christiano’s proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values, which I call Iterated Distillation and Amplification (IDA) here. IDA is notably similar to AlphaGoZero and expert iteration.
The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance. This document gives a high-level outline of IDA.
Motivation: The alignment/capabilities tradeoff
Assume that we want to train a learner A to perform some complex fuzzy task, e.g. “Be a good personal assistant.” Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a “personal assistant” objective function and trained A to maximize it, then A would become a far better personal assistant than any human.
There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities:
- Broad reinforcement learning: As A takes actions in the world, we give it a relatively sparse reward signal based on how satisfied or dissatisfied we are with the eventual consequences. We then allow A to optimize for the expected sum of its future rewards
- Broad inverse reinforcement learning: A attempts to infer our deep long-term values from our actions, perhaps using a sophisticated model of human psychology and irrationality to select which of many possible extrapolations is correct.
However, it is difficult to specify a broad objective that captures everything we care about, so in practice A will be optimizing for some proxy that is not completely aligned with our interests. Even if this proxy objective is “almost” right, its optimum could be disastrous according to our true values.
On the other end, there are techniques that try to narrowly emulate human judgments:
- Imitation learning: We could train A to exactly mimic how an expertwould do the task, e.g. by training it to fool a discriminative model trying to tell apart A’s actions from the human expert’s actions.
- Narrow inverse reinforcement learning: We could train A to infer our near-term instrumental values from our actions, with the presumption that our actions are roughly optimal according to those values.
- Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.
Using these techniques, the risk of misalignment is reduced significantly (though not eliminated) by restricting agents to the range of known human behavior — but this introduces severe limitations on capability. This tradeoff between allowing for novel capabilities and reducing misalignment risk applies across different learning schemes (with imitation learning generally being narrowest and lowest risk) as well as within a single scheme.
The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans?
Core concept: Analogy to AlphaGoZero
The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned model many times as a subroutine in a more powerful decision-making process, and then re-train the model to imitate those better decisions.
AGZ’s policy network p is the learned model. At each iteration, AGZ selects moves by an expensive Monte Carlo Tree Search (MCTS) which uses policy pas its prior; p is then trained to directly predict the distribution of moves that MCTS ultimately settles on. In the next iteration, MCTS is run using the new more accurate p, and p is trained to predict the eventual outcome of that process, and so on. After enough iterations, a fixed point is reached — p is unable to learn how running MCTS will change its current probabilities.
MCTS is an amplification of p — it uses p as a subroutine in a larger process that ultimately makes better moves than p alone could. In turn, p is a distillation of MCTS: it learns to directly guess the results of running MCTS, achieving comparable performance while short-cutting the expensive computation. The idea of IDA is to use the basic iterated distillation and amplification procedure in a much more general domain.
The IDA Scheme
IDA involves repeatedly improving a learned model through an amplification and distillation process over multiple iterations.
Amplification is interactive and human-directed in IDA
In AGZ, the amplification procedure is Monte Carlo Tree Search — it’s a simple and well-understood algorithm, and there’s a clear mechanism for how it improves on the policy network’s original choices (it traverses the game tree more deeply). But in IDA, amplification is not necessarily a fixed algorithm that can be written down once and repeatedly applied; it’s an interactive process directed by human decisions.
In most domains, humans are capable of improving their native capabilities by delegating to assistants (e.g. because CEOs can delegate tasks to a large team, they can produce orders of magnitude more output per day than they could on their own). This means if our learning procedure can create an adequate helper for the human, the human can use the AI to amplify their ability — this human/AI system may be capable of doing things that the human couldn’t manage on their own.
Below I consider the example of using IDA to build a superhuman personal assistant. Let A[t] to refer to the state of the learned model after the end of iteration t; the initial agent A is trained by a human overseer H.
Example: Building a superhuman personal assistant
H trains A using a technique from the narrow end of the spectrum, such as imitation learning. Here we are imagining a much more powerful version of “imitation learning” than current systems are actually capable of — we assume that A can acquire nearly human-level capabilities through this process. That is, the trained A model executes all the tasks of a personal assistant as H would (including comprehending English instructions, writing emails, putting together a meeting schedule, etc).
Even though A cannot discover any novel capabilities, it has two key advantages over H: it can run much faster, and many copies or versions of it can be run at once. We hope to leverage these advantages to construct a larger system — involving H and many copies of A — that will substantially improve on H’s capabilities while preserving alignment with H’s values.
H can use calls to A (along with other tools such as external memory) to become a better personal assistant. For example, H could assign one copy of A to figuring out the best time to schedule the client’s recurring team meetings, another copy to figure out what to order the client for lunch, another copy to balance the client’s personal budget, etc. H now has the ability to get very quick solutions to sub-problems that are roughly as good as the ones H would have come up with on their own over a longer time period, and can combine these results to make much better decisions than an unaided human.
Let Amplify(H, A) refer to the larger system of H + many copies of A + aids. Compared to A alone, the Amplify(H, A) system has much higher time and resource costs but its eventual decisions are much better. Moreover, because in each of its individual decisions each copy of A continues to act just as a human personal assistant would act, we can hope that Amplify(H, A) preserves alignment.
In the next iteration of training, the Amplify(H, A) system takes over the role of H as the overseer. A is trained with narrow and safe techniques to quickly reproduce the results of Amplify(H, A). Because we assumed Amplify(H, A) was aligned, we can hope that A is also aligned if it is trained using sufficiently narrow techniques which introduce no new behaviors. A is then used in Amplify(H, A), which serves as an overseer to train A, and so on.
A <- random initialization
A <- Distill(Amplify(H, A))
Returns an AI trained using narrow, robust techniques to
perform a task that the overseer already understands how to
def Amplify(human, AI):
Interactive process in which human uses many calls to AI to
improve on human's native performance at relevant task(s).
What properties must hold for IDA to work?
The IDA scheme is a template with “slots” for Amplify and Distill procedures that have not been fully specified yet — in fact, they rely on capabilities we don’t yet have. Because IDA itself is not fully specified, it’s not clear what minimal set of properties are necessary for it to succeed.
Achieving alignment and high capability
That said, here are some general properties which seem necessary — though likely not sufficient — for IDA agents to achieve robust alignment and high capability:
- The Distill procedure robustly preserves alignment: Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved, without introducing any misaligned optimization or losing important aspects of what H values.
- The Amplify procedure robustly preserves alignment: Given an aligned agent A, it is possible to specify an amplification scheme which calls A multiple times as a subroutine in a way that reliably avoids introducing misaligned optimization.
- At least some human experts are able to iteratively apply amplification to achieve arbitrarily high capabilities at the relevant task: a) there is some threshold of general capability such that if someone is above this threshold, they can eventually solve any problem that an arbitrarily intelligent system could solve, provided they can delegate tasks to similarly-intelligent assistants and are given arbitrary amounts of memory and time; b) at least some human experts are above this threshold of generality — given enough time and resources, they can figure out how to use AI assistants and tools to improve their capabilities arbitrarily far.
The non-profit Ought is working on gathering more evidence about assumptions 2 and 3.
Achieving competitive performance and efficiency
Paul aims for IDA agents to be competitive with traditional RL agents in time and resource costs at runtime — this is a reasonable expectation because an IDA agent is ultimately just another learned model whose weights were tuned with an unusual training procedure.
Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects.
This was originally posted here.
I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network fθ which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?
Yes, AGZ uses the same network for policy and value function.
When A[n+1] is supposed to imitate the output of (H, A[n]), I think IDA is safe, because I think imitation is safe. (If A is a rock and (H, A[n]) is a group of one human and two A[n]'s, then A[n] is basically imitating a group of 2n−1 humans). If (H, A[n]) is supposed to provide a reward signal to A[n+1], which A[n+1] tries to optimize, I think this version of IDA is unsafe, for reasons similar to what Wei Dai expressed in a comment (on a post I now can't find) taking issue with the inductive step in the original argument. Can we standardize different names for these two designs? Unless, is the latter version deprecated?
Wouldn't it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.
Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive "human value function," i.e. ask a human "how good does state s seem?"). This reward function wouldn't have that problem.
Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
The difference with broad reinforcement learning is that you aren't trying to evaluate actions you can't understand by looking at the consequences you can observe.