Senior Research Scientist at DeepMind.
Opinions are my own and not necessarily my employer's.
I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good.
There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I'm not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you're just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I'm pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I've written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
yeah that's a fair point
If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.
Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent A1 is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. A2 then assists the human in evaluating A1’s behavior in order to train a reward model. Decomposition only happens on the evaluation of A1’s task.
For example, A1 proposes some plan x and A2 proposes the largest weakness y in the plan. The human then evaluates whether y is indeed a weakness in the plan x and how strong it is, and then judges the plan x based on this weakness. If you simplify and assume this judgement is binary (ϕ(x,y) is true iff the plan passes), then A1 “wins” iff ϕ(x,y) and A2 “wins” iff ¬ϕ(x,y). Thus the objective of the game becomes ∃x∀y.ϕ(x,y) for A1 and ¬∃x∀yϕ(x,y) for A2. Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.
Good question. The short answer is “I’m not entirely sure.” Other people seem to struggle with understanding Paul Christiano’s agenda as well.
When we developed the ideas around recursive reward modeling, we understood amplification to be quite different (what we ended up calling Imitating expert reasoning in the paper after consulting with Paul Christiano and Andreas Stuhlmüller). I personally find that the clearest expositions for what Paul is trying to do are Iterated Distillation and Amplification and Paul's latest paper, which we compare to in multiple places in the paper. But I'm not sure how that fits into Paul's overall “agenda”.
My understanding of Paul’s agenda is that it revolves around "amplification" which is a broad framework for training ML systems with a human in the loop. Debate is an instance of amplification. Factored cognition is an instance of amplification. Imitating expert reasoning is an instance of amplification. Recursive reward modeling is an instance of amplification. AlphaGo is an instance of amplification. It’s not obvious to me what isn't.
Having said that, there is no doubt about the fact that Paul is a very brilliant researcher who is clearly doing great work on alignment. His comments and feedback have been very helpful for writing this paper and I'm very much looking forward to what he'll produce next.
So maybe I should bounce this question over to @paulfchristiano: How does recursive reward modeling fit into your agenda?