Ryan Greenblatt

I'm the chief scientist at Redwood Research.

Sequences

Comments

Sorted by

I've updated towards a bit longer based on some recent model releases and further contemplation.

I'd now say:

  • 25th percentile: Oct 2027
  • 50th percentile: Jan 2031

(FWIW, I didn't use them as synonyms in the post except for saying "aka sandbagging" in the title which maybe should have been "aka a strategy for sandbagging". I thought "aka sandbagging" was sufficiently accurate for a title and saved space.)

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values

This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.

Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.

I think I basically agree with much of this comment. However:

  • Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
  • Further, if the model messes up, this might cascade as discussed.
  • Ensembling multiple models for exploration (or multiple prompt variants, or various different jitters of the internals) might mean that all the models need to coordinate on the same boundary of what not to explore into. Otherwise you'll explore some behavior that a model doesn't do, and then update this to sandbag less and then this might cascade. So, it might be non-trivial for models to exploration hack (given substantial countermeasures) even if the behavior is quite superhuman.
  • None of these mechanisms suffice for "robustly avoiding sandbagging", but they suggest that countermeasures are still useful in superhuman regimes. (Of course, betting the world on countermeasures which have a 50% chance of working is crazy, but that doesn't mean that we shouldn't try to make the countermeasures better!)

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

This would be true if we had an infinite amount of data, but we don't. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).

Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.

How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).

I think we'll probably see 1e27 FLOP RL runs next year?

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

Load More