Ryan Greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.

(After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn't have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)

(I missed the mention of the reward in the post as it didn't seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)

Isn't an important objection that the AI is unlikely to succeed in acquiring power if it resists shutdown such that the upside for resistance (or pleading etc) is quite small? (And correspondingly, more likely to be outweighted by considerations about future AI systems like acausal trade or similar goals.)

This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!

You could credibly promise that if the AI tries its best to escape you'll actually deploy it and not delete it, but this just seems like an ad-hoc version of compensation (which is likely inefficient). The same applies for credibly making the environment actually hackable.

(I just skimmed the post, so sorry if I've missed discussion related to this.)

I'm in favor of proposing credible deals to AIs to reveal their misalignment and the analysis in the post is relevant to this. I've got some forthcoming empirical work on this and I think it would be good if there were more detailed public proposals.

My sense is that this post holds up pretty well. Most of the considerations under discussion still appear live and important including: in-context learning, robustness, whether jank AI R&D accelerating AIs can quickly move to more general and broader systems, and general skepticism of crazy conclusions.

At the time of this dialogue, my timelines were a bit faster than Ajeya's. I've updated toward the views Daniel expresses here and I'm now about half way between Ajeya's views in this post and Daniel's (in geometric mean).

My read is that Daniel looks somewhat too aggressive in his predictions for 2024, though it is a bit unclear exactly what he was expecting. (This concrete scenario seems substantially more bullish than what we've seen in 2024, but not by a huge amount. It's unclear if he was intending these to be mainline predictions or a 25th percentile bullish scenario.)

AI progress appears substantially faster than the scenario outlined in Ege's median world. In particular:

  • On "we have individual AI labs in 10 years that might be doing on the order of e.g. $30B/yr in revenue". OpenAI made $4 billion in revenue in 2024 and based on historical trends it looks like AI company revenue goes up 3x per year such that in 2026 the naive trend extrapolation indicates they'd make around $30 billion. So, this seems 3 years out instead of 10.
  • On "maybe AI systems can get gold on the IMO in five years". We seem likely to see gold on IMO this year (a bit less than 2 years later).

It would be interesting to hear how Daniel, Ajeya, and Ege's views have changed since the time this was posted. (I think Daniel has somewhat later timelines (but the update is smaller than the progression of time such that AGI now seems closer to Daniel) and I think Ajeya has somewhat sooner timelines.)

Daniel discusses various ideas for how to do a better version of this dialogue in this comment. My understanding is that Daniel (and others) have run something similar to what he describes multiple times and participants find this valuable. I'm not sure how much people have actually changed their mind. Prototyping this approach for Daniel is plausibly the most important impact of this dialogue.

Fair, I really just mean "as good as having your human employees run 10x faster". I said "% of quality adjusted work force" because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)

A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here.

So, you might very quickly (1-2 years) go from "the AIs are great, fast, and cheap software engineers speeding up AI R&D" to "wildly superhuman AI that can achieve massive technical accomplishments".

I thought it would be helpful to post about my timelines and what the timelines of people in my professional circles (Redwood, METR, etc) tend to be.

Concretely, consider the outcome of: AI 10x’ing labor for AI R&D[1], measured by internal comments by credible people at labs that AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).

Here are my predictions for this outcome:

  • 25th percentile: 2 year (Jan 2027)
  • 50th percentile: 5 year (Jan 2030)

The views of other people (Buck, Beth Barnes, Nate Thomas, etc) are similar.

I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).


  1. Only including speedups due to R&D, not including mechanisms like synthetic data generation. ↩︎

I now think this strategy looks somewhat less compelling due to a recent trend toward smaller (rather than larger) models, particularly from OpenAI, and increased inference time compute usage creating more for an incentive for small models.

It seems likely that (e.g.) o1-mini is quite small given that it generates at 220 tokens per second(!), perhaps <50 billion active parameters based on eyeballing the chart from the article I linked earlier from Epoch. I'd guess (with very low confidence) 100 billion params overall (not including just active params). Likely something similar holds for o3-mini.

The update isn't as large as it might seem at first because users don't (typically) need to be send the reasoning tokens (or outputs from inference time compute usage more generally) which substantially reduces uploads. Indeed, the average compute per output token for o1 and o3 (counting the compute from reasoning tokens but just including tokens sent to the user as output tokens) is probably actually higher than it was for the original GPT-4 release despite these models potentially being smaller.

Some other important categories:

  • Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
  • Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you'll likely get some weight on a non-scheming policy).
  • There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.

Inference compute scaling might imply we first get fewer, smarter AIs.

Prior estimates imply that the compute used to train a future frontier model could also be used to run tens or hundreds of millions of human equivalents per year at the first time when AIs are capable enough to dominate top human experts at cognitive tasks[1] (examples here from Holden Karnofsky, here from Tom Davidson, and here from Lukas Finnveden). I think inference time compute scaling (if it worked) might invalidate this picture and might imply that you get far smaller numbers of human equivalents when you first get performance that dominates top human experts, at least in short timelines where compute scarcity might be important. Additionally, this implies that at the point when you have abundant AI labor which is capable enough to obsolete top human experts, you might also have access to substantially superhuman (but scarce) AI labor (and this could pose additional risks).

The point I make here might be obvious to many, but I thought it was worth making as I haven't seen this update from inference time compute widely discussed in public.[2]

However, note that if inference compute allows for trading off between quantity of tasks completed and the difficulty of tasks that can be completed (or the quality of completion), then depending on the shape of the inference compute returns curve, at the point when we can run some AIs as capable as top human experts, it might be worse to run many (or any) AIs at this level of capability rather than using less inference compute per task and completing more tasks (or completing tasks serially faster).

Further, efficiency might improve quickly such that we don't have a long regime with only a small number of human equivalents. I do a BOTEC on this below.

I'll do a specific low-effort BOTEC to illustrate my core point that you might get far smaller quantities of top human expert-level performance at first. Suppose that we first get AIs that are ~10x human cost (putting aside inflation in compute prices due to AI demand) and as capable as top human experts at this price point (at tasks like automating R&D). If this is in ~3 years, then maybe you'll have $15 million/hour worth of compute. Supposing $300/hour human cost, then we get ($15 million/hour) / ($300/hour) / (10 times human cost per compute dollar) * (4 AI hours / human work hours) = 20k human equivalents. This is a much smaller number than prior estimates.

The estimate of $15 million/hour worth of compute comes from: OpenAI spent ~$5 billion on compute this year, so $5 billion / (24*365) = $570k/hour; spend increases by ~3x per year, so $570k/hour * 3³ = $15 million.

The estimate for 3x per year comes from: total compute is increasing by 4-5x per year, but some is hardware improvement and some is increased spending. Hardware improvement is perhaps ~1.5x per year and 4.5/1.5 = 3. This at least roughly matches this estimate from Epoch which estimates 2.4x additional spend (on just training) per year. Also, note that Epoch estimates 4-5x compute per year here and 1.3x hardware FLOP/dollar here, which naively implies around 3.4x, but this seems maybe too high given the prior number.

Earlier, I noted that efficiency might improve rapidly. We can look at recent efficiency changes to get a sense for how fast. GPT-4o mini is roughly 100x cheaper than GPT-4 and is perhaps roughly as capable all together (probably lower intelligence but better elicited). It was trained roughly 1.5 years later (GPT-4 was trained substantially before it was released) for ~20x efficiency improvement per year. This is selected for being a striking improvment and probably involves low hanging fruit, but AI R&D will be substantially accelerated in the future which probably more than cancels this out. Further, I expect that inference compute will be inefficent in the tail of high inference compute such that efficiency will be faster than this once the capability is reached. So we might expect that the number of AI human equivalents increases by >20x per year and potentially much faster if AI R&D is greatly accelerated (and compute doesn't bottleneck this). If progress is "just" 50x per year, then it would still take a year to get to millions of human equivalents based on my earlier estimate of 20k human equivalents. Note that once you have millions of human equivalents, you also have increased availability of generally substantially superhuman AI systems.


  1. I'm refering to the notion of Top-human-Expert-Dominating AI that I define in this post, though without a speed/cost constraint as I want to talk about the cost when you first get such systems. ↩︎

  2. Of course, we should generally expect huge uncertainty with future AI architectures such that fixating on very efficient substitution of inference time compute for training would be a mistake, along with fixating on minimal or no substitution. I think a potential error of prior discussion is insufficient focus on the possibility of relatively scalable (though potentially inefficient) substitution of inference time for training (which o3 appears to exhibit) such that we see very expensive (and potentially slow) AIs that dominate top human expert performance prior to seeing cheap and abundant AIs which do this. ↩︎

My current tenative guess is that this is somewhat worse than other alignment science projects that I'd recommend at the margin, but somewhat better than the 25th percentile project currently being done. I'd think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we'd learn generalizable scalable oversight / control approaches.

Load More