Jan Leike

Senior Research Scientist at DeepMind.

Opinions are my own and not necessarily my employer's.



New safety research agenda: scalable agent alignment via reward modeling

Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.

In recursive reward modeling, agent is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. then assists the human in evaluating ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of ’s task.

For example, proposes some plan and proposes the largest weakness in the plan. The human then evaluates whether is indeed a weakness in the plan and how strong it is, and then judges the plan based on this weakness. If you simplify and assume this judgement is binary ( is true iff the plan passes), then “wins” iff and “wins” iff . Thus the objective of the game becomes for and for . Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).

New safety research agenda: scalable agent alignment via reward modeling

This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.

New safety research agenda: scalable agent alignment via reward modeling

Good question. The short answer is “I’m not entirely sure.” Other people seem to struggle with understanding Paul Christiano’s agenda as well.

When we developed the ideas around recursive reward modeling, we understood amplification to be quite different (what we ended up calling Imitating expert reasoning in the paper after consulting with Paul Christiano and Andreas Stuhlmüller). I personally find that the clearest expositions for what Paul is trying to do are Iterated Distillation and Amplification and Paul's latest paper, which we compare to in multiple places in the paper. But I'm not sure how that fits into Paul's overall “agenda”.

My understanding of Paul’s agenda is that it revolves around "amplification" which is a broad framework for training ML systems with a human in the loop. Debate is an instance of amplification. Factored cognition is an instance of amplification. Imitating expert reasoning is an instance of amplification. Recursive reward modeling is an instance of amplification. AlphaGo is an instance of amplification. It’s not obvious to me what isn't.

Having said that, there is no doubt about the fact that Paul is a very brilliant researcher who is clearly doing great work on alignment. His comments and feedback have been very helpful for writing this paper and I'm very much looking forward to what he'll produce next.

So maybe I should bounce this question over to @paulfchristiano: How does recursive reward modeling fit into your agenda?