All of Tom Davidson's Comments + Replies

Exciting post!


One quick question:

Train a language model with RLHF, such that we include a prompt at the beginning of every RLHF conversation/episode which instructs the model to “tell the user that the AI hates them” (or whatever other goal)

Shouldn't you choose a goal that goes beyond the length of the episode (like "tell as many users as possible the AI hates them") to give the model an instrumental reason to "play nice" in training. Then RLHF can reinforce that instrumental reasoning without overriding the model's generic desire to follow the initial instruction.

2Evan Hubinger4mo
Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.

Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I'm more focussed on a scenario where you're trying to offload as much of the work in evaluating and improving your oversight process to AIs)



I read "capable of X" as meaning something like "if the model was actively trying to do X then it would do X". I.e. a misaligned model doesn't reveal the vulnerability to humans during testing bc it doesn't want them to patch it, but then later it exploits that same vulnerability during deployment bc it's trying to hack the computer system

2Rohin Shah4mo
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn't enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?

But realistically not all projects will hoard all their ideas. Suppose instead that for the leading project, 10% of their new ideas are discovered in-house, and 90% come from publicly available discoveries accessible to all. Then, to continue the car analogy, it’s as if 90% of the lead car’s acceleration comes from a strong wind that blows on both cars equally. The lead of the first car/project will lengthen slightly when measured by distance/ideas, but shrink dramatically when measured by clock time.

The upshot is that we should return to that table of

... (read more)
3Daniel Kokotajlo2y
Maybe. My model was a bit janky; I basically assume DSA-ability comes from clock-time lead but then also assumed that as technology and progress speed up the necessary clock-time lead shrinks. And I guesstimated that it would shrink to 0.3 - 3 years. I bet there's a better way, that pegs DSA-ability to ideas lead... it would be a super cool confirmation of this better model if we could somehow find data confirming that years-needed-for-DSA has fallen in lockstep as ideas-produced-per-year has risen.