Wikitag Contributions

Comments

Sorted by

Interesting post!

Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"? 

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

Thanks for the comment, all sensible scepticisms IMO!

I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don't emphasise enough the defeater of "the deployment set-up just won't be upheld" (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I'm maybe more optimistic than you that there'll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage.

On debaters converging to honesty rather than subtle manipulation: I'm also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).

Two reasons:

  1. Identifying vs. evaluating flaws: In debate, human judges get to see critiques made by a superhuman system - some of which they probably wouldn't have been able to identify themselves but can still evaluate (on the (IMO plausible) assumption that it's easier to evaluate than generate). That makes human judges more likely to give correct signal.
  2. Whole debate vs. single subclaim: In recursive debate, human judges evaluate only a single subclaim. The (superhuman) debaters do the work of picking out which subclaim they're most likely to win on, and we use this as a proxy for who would've won the whole debate. This is a more efficient use of human judges' time, and also probably it's easier for human judges to evaluate smaller subproblems.

For those reasons, I think we can likely get "good human input" for a bigger set of questions with debate than direct human supervision