(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.
I don't think we have any fully aligned models, and won't any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart's law definitely applies here.
You don't have to stop the agent, you can just do it afterwards.
Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:
This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.
Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don't need to stop and involve a human? I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.
From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it'll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn't work, and I think it'll only work less well as AI becomes smarter and more agentic.
The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately.
In amplified oversight, any question that is too hard to supervise directly is systematically reduced to ones that we hypothesize can be supervised. However, humans may be systematically biased even for fairly simple questions. If this turns out to be a problem in practice, we could seek to model these deviations and automatically correct or account for them when interpreting the oversight.
I genuinely don't understand how anybody expects Amplified Oversight to work, it seems to me to be assuming alignment in so many places, like being able to give the judge or debate partner the goal of actually trying to get to the truth. I also don't envy the humans trying to figure out which side of these debates is in the right. We can't even do that for very obvious existing human debates.
Even if debate-like approaches work in theory for getting a signal as to something's truthiness, I don't see how it could apply to agents which aren't simply generating a 100% complete plan for the debate and human approval to then mechanically enact. Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate? Or is the idea you simulate an agent and at random points in it carrying out its task you run a debate on what it's currently doing and use this to train the model to always act in ways that would allow it to win a debate?
I don't even think whether debate will work is really an empirical question. I don't know that there's any study result on current systems that could convince me this would carry on working through to AGI/ASI. Even if things looked super optimistic now, like that refuting a lie is easier than lying, I would not expect this to hold up. I don't expect the lines on these graphs to be straight.
I know a complete plan is unreasonable at this point, but can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely continuations that will score well on its task. I don't think RLHF does anything remotely analogous to making it care about whether it's following your intent, or following a constitution, etc.
You're definitely aware of the misalignment that still exists with our current RLHF'd models and have read the recent papers on alignment faking etc. so I probably can't make an argument you haven't already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don't.