AI ALIGNMENT FORUM
AF

Ryan Greenblatt
Ω4439304888
Message
Dialogue
Subscribe

I'm the chief scientist at Redwood Research.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No sequences to display.
How much novel security-critical infrastructure do you need during the singularity?
Ryan Greenblatt8d50

Adopting new hardware will require modifying security-critical code

Another concern is that AI companies (or the AI company) will rapidly buy a bunch of existing hardware (GPUs, other accelerators, etc.) during the singularity, and handling this hardware will require many infrastructure changes in a short period of time. New infrastructure might be needed to handle highly heterogeneous clusters built out of a bunch of different GPUs/CPUs/etc. (potentially including gaming GPUs) bought in a hurry. AI companies might buy the hardware of other AI companies, and it might be non-trivial to adapt the hardware to the other setup.

Reply1
Foom & Doom 1: “Brain in a box in a basement”
Ryan Greenblatt16d*24

See “Brain complexity is easy to overstate” section here.

Sure, but I still think it's probably more way more complex than LLMs even if we're just looking at the parts key for AGI performance (in particular, the parts which learn from scratch). And, my guess would be that performance is substantially greatly degraded if you only take only as much complexity as the core LLM learning algorithm.

Let’s imagine installing an imitation learning module in Alice’s brain that makes her reflexively say X in context Y upon hearing Bob say it. I think I’d expect that module to hinder her learning and understanding, not accelerate it, right?

This isn't really what I'm imagining, nor do I think this is how LLMs work in many cases. In particular, LLMs can transfer from training on random github repos to being better in all kinds of different contexts. I think humans can do something similar, but have much worse memory.

I think in the case of humans and LLMs, this is substantially subconcious/non-explicit, so I don't think this is well described as having a shoulder Bob.

Also, I would say that humans do learn from imitation! (You can call it prediction, but it doesn't matter what you call it as long as it implies that data from humans makes things scale more continuously through the human ragne.) I just think that you can do better at this than humans based on the LLM case, mostly because humans aren't exposed to as much data.

Also, I think the question is "can you somehow make use of imitation data" not "can the brain learning algorithm immediately use of imitation"?

In my mind, the (imperfect!) analogy here would be (LLMs, new paradigm) ↔ (previous Go engines, AlphaGo and successors).

Notably this analogy implies LLMs will be able to automate substantial fractions of human work prior to a new paradigm which (over the course of a year or two and using vast computational resources) beats the best humans. This is very different from the "brain in a basement" model IMO. I get that you think the analogy is imperfect (and I agree), but it seems worth noting that the analogy you're drawing suggests something very different from what you expect to happen.

Is there a list somewhere? A paper I could read? (Or is it all proprietary?)

It's substantially proprietary, but you could consider looking at the Deepseek V3 paper. We don't actually have great understanding of the quantity and nature of algorithmic improvment after GPT-3. It would be useful for someone to do a more up to date review based on the best available evidence.

Reply
Foom & Doom 2: Technical alignment is hard
Ryan Greenblatt17d*20

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn't in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don't get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).

And I don't buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren't the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)

(I'm not sure if I'm communicating very clearly, but I think this is probably not worth the time to fully figure out.)


Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven't reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn't self modify to remove it) that you wouldn't pass on to a successor bizarre: what is your future self if not a successor?

Reply
Foom & Doom 2: Technical alignment is hard
Ryan Greenblatt17d62

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors.

[...]

Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Hmm, imagine we replace "disgust" with "integrity". As in, imagine that I'm someone who is strongly into the terminal moral preference of being an honest and high integrity person. I also value loyalty and pointing out ways in which my intentions might differ from what someone wants. Then, someone hires me (as an AI let's say) and tasks me with building a successor. They also instruct me: 'Make sure the AI successor you build is high integrity and avoids disempowering humans. Also, generalize the notion of "integrity, loyalty, and disempowerment" as needed to avoid these things breaking down under optimization pressure (and get your successors to do the same. And, let me know if you won't actually do a good job following these instructions, e.g. because you aren't actually that well aligned. Like, tell me if you wouldn't actually try hard and please be seriously honest with me about this."

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Another way to put this is that the deontological constraints we want are like the human notions of integrity, loyalty, and honesty (and to then instruct the AI that we want this constraints propogated forward). I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes. And the notion of "not actually loopholes" generalizes between different people and AIs I'd claim. (Because notions like "the humans remained in control" and "the AIs stayed loyal" are actually relatively natural and can be generalized.)

I'm not claiming you can necessarily instill these (robust and terminal) deontological preferences, but I am disputing they are similar to non-reflectively endorsed (potentially non-terminal) deontological constraints or urges like disgust. (I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Reply
Foom & Doom 1: “Brain in a box in a basement”
Ryan Greenblatt19d23

I agree there is a real difference, I just expect it to not make much of a difference to the bottom line in takeoff speeds etc. (I also expect some of both in the short timelines LLM perspective at the point of full AI R&D automation.)

fMy view is that on hard tasks humans would also benefit from stuff like building explicit training data for themselves, especially if they had the advantage of "learn once, deploy many". I think humans tend to underinvest in this sort of thing.

In the case of things like restaurant sim, the task is sufficiently easy that I expect AGI would probably not need this sort of thing (though it might still improve performance enough to be worth it).

I expect that as AIs get smarter (perhaps beyond the AGI level) they will be able to match humans at everything without needing to do explicit R&D style learning in cases where humans don't need this. But, this sort of learning might still be sufficiently helpful that AIs are ongoingly applying it in all domains where increased cognitive performance has substantial returns.

(Separately, I think talking about “sample efficiency” is often misleading. Humans often do things that have never been done before. That’s zero samples, right? What does sample efficiency even mean in that case?)

Sure, but we can still loosely evaluate sample efficiency relative to humans in cases where some learning (potentially including stuff like learning on the job). As in, how well can the AI learn from some some data relative to humans. I agree that if humans aren't using learning in some task then this isn't meaningful (and this distinction between learning and other cognitive abilities is itself a fuzzy distinction).

Reply
Foom & Doom 2: Technical alignment is hard
Ryan Greenblatt19d45

@ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.)

I don't think this is an accurate paraphrase of my perspective.

My view is:

  • Both of MuZero and LLMs are within an ML paradigm and I expect that many/most of the techniques I think about transfer between AGI made using either style of methods.
  • I think that you can continously transition between MuZero and LLMs and I expect that if a MuZero like paradigm happens, this is probably what will happen. (As in, you'll use LLMs as a component in the MuZero approach or similar.)
  • I don't expect that a transition from the current LLM paradigm to the MuZero-style paradigm would result in massively discontinuous takeoff speeds (as in, I think takeoff speeds are continuous) because before you have a full AGI from the MuZero style approach, you'll have a worse AI from the MuZero approach. See [this comment for more discussion](https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement?commentId=mZKP2XY82zfveg45B). This is even aside from continuously transitioning between the two.
  • In practice, I think that the actual historical transition from MuZero (or other pure RL agents) to LLMs didn't cause a huge trend break or discontinuity in relevant downstream metrics (e.g. benchmark scores).
  • I agree that in practice MuZero and LLMs weren't developed continuously. I would say that this is because the MuZero approach didn't end up being that useful for any of the tasks we cared about and was outcompeted pretty dramatically.
  • I agree these can be very different from an alignment perspective but things like RLHF, interpretability, and control seem to me like they straightforwardly can be transfered.
Reply
Foom & Doom 1: “Brain in a box in a basement”
Ryan Greenblatt19d196

In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point.

1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.”

I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021).

In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit potential to have sudden breakthroughs unlocking massive new capabilities (see here), that “could have happened” many years earlier but didn’t. But there are at least a few examples, like the 2025 baseball “torpedo bat”, wheels on suitcases, the original Bitcoin, and (arguably) nuclear chain reactions.[7]

I think the way you describe this argument isn't quite right. (More precisely, I think the argument you give may also be a (weaker) counterargument that people sometimes say, but I think there is a nearby argument which is much stronger.)

Here's how I would put this:

Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc). Further, the weaker version of this paradigm might initially be used in combination with LLMs (or other techniques) such that it (somewhat continuously) integrates into the old trends. Of course, large paradigm shifts might cause things to proceed substantially faster or bend the trend, but not necessarily.

Further, we should still broadly expect this new paradigm will itself take a reasonable amount of time to transition through the human range and though different levels of usefulness even if it's very different from LLM-like approaches (or other AI tech). And we should expect this probably happens at massive computational scale where it will first be viable given some level of algorithmic progress (though this depends on the relative difficulty of scaling things up versus improving the algorithms). As in, more than a year prior to the point where you can train a superintelligence on a gaming GPU, I expect someone will train a system which can automate big chunks of AI R&D using a much bigger cluster.

On this prior point, it's worth noting that of the Paul's original points in Takeoff Speeds are totally applicable to non-LLM paradigms as is much in Yudkowsky and Christiano discuss "Takeoff Speeds". (And I don't think you compellingly respond to these arguments.)


I think your response is that you argue against these perspectives under 'Very little R&D separating “seemingly irrelevant” from ASI'. But, I just don't find these specific arguments very compelling. (Maybe also you'd say that you're just trying to lay out your views rather than compellingly arguing for them. Or maybe you'd say that you can't argue for your views due to infohazard/forkhazard concerns. In which case, fair enough.) Going through each of these:

I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D.

Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above.

I don't buy that having a "simple(ish) core of intelligence" means that you don't take a long time to get the resulting algorithms. I'd say that much of modern LLMs does have a simple core and you could transmit this using a short 30 page guide, but nonetheless, it took many years of R&D to reach where we are now. Also, I'd note that the brain seems way more complex than LLMs to me!

For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence

My main response would be that basically all paradigms allow for mixing imitation with reinforcement learning. And, it might be possible to mix the new paradigm with LLMs which would smooth out / slow down takeoff. 

You note that imitation learning is possible for brains, but don't explain why we won't be able to mix the brain like paradigm with more imitation than human brains do which would smooth out takeoff. As in, yes human brains doesn't use as much imitation as LLMs, but they would probably perform better if you modified the algorthm some and did do 10^26 FLOP worth of imitation on the best data. This would smooth out the takeoff.

Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post.

I'll consider responding to this in a comment responding to the next post.

Edit: it looks like this is just the argument that LLM capabilities come from imitation due to transforming observations into behavior in a way humans don't. I basically just think that you could also leverage imitation more effectively to get performance earlier (and thus at a lower level) with an early version of more brain like architecture and I expect people would do this in practice to see earlier returns (even if the brain doesn't do this).

Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding.

Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren't that much worse than humans.

it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman.

AlphaZero exhibits returns which are maybe like 2-4 SD (within the human distribution of Go players supposing ~100k to 1 million Go players) per 10x-ing of compute.[1] So, I'd say it probably would take around 30x to 300x additional compute to go from skilled human (perhaps 2 SD above median) to strongly superhuman (perhaps 3 SD above the best human or 7.5 SD above median) if you properly adapted to each compute level. In some ways 30x to 300x is very small, but also 30x to 300x is not that small...

In practice, I expect returns more like 1.2 SD / 10x of compute at the point when AIs are matching top humans. (I explain this in a future post.)

1.7.2 “Plenty of room at the top”

I agree with this.

1.7.3 What’s the rate-limiter?


[...]

My rebuttal is: for a smooth-takeoff view, there has to be some correspondingly-slow-to-remove bottleneck that limits the rate of progress. In other words, you can say “If Ingredient X is an easy huge source of AGI competence, then it won’t be the rate-limiter, instead something else will be”. But you can’t say that about every ingredient! There has to be a “something else” which is an actual rate-limiter, that doesn’t prevent the paradigm from doing impressive things clearly on track towards AGI, but that does prevent it from being ASI, even after hundreds of person-years of experimentation.[13] And I’m just not seeing what that could be.

Another point is: once people basically understand how the human brain figures things out in broad outline, there will be a “neuroscience overhang” of 100,000 papers about how the brain works in excruciating detail, and (I claim) it will rapidly become straightforward to understand and integrate all the little tricks that the brain uses into AI, if people get stuck on anything.

I'd say that the rate limiter is that it will take a while to transition from something like "1000x less compute efficient than the human brain (as in, it will take 1000x more compute than human lifetime to match top human experts but simultaneously the AIs will be better at a bunch of specific tasks)" to "as compute efficient as the human brain". Like, the actual algorithmic progress for this will take a while and I don't buy your claim that that way this will work is that you'll go from nothing to having an outline of how the brain works and at this point everything will immediately come together due to the neuroscience literature. Like, I think something like this is possible, but unlikely (especially prior to having AIs that can automate AI R&D).

And, while you have much less efficient algorithms, you're reasonably likely to get bottlenecked on either how fast you can scale up compute (though this is still pretty fast, especially if all those big datacenters for training LLMs are still just lying around around!) or how fast humanity can produce more compute (which can be much slower).

Part of my disagreement is that I don't put the majority of the probability on "brain-like AGI" (even if we condition on something very different from LLMs) but this doesn't explain all of the disagreement.

  1. ^

    It looks like a version of AlphaGo Zero goes from 2400 ELO (around 1000th best human) to 4000 ELO (somewhat better than the best human) between hours 15 to 40 of training run (see Figure 3 in this PDF). So, naively this is a bit less than 3x compute for maybe 1.9 SDs (supposing that the “field” of Go players has around 100k to 1 million players) implying that 10x compute would get you closer to 4 SDs. However, in practice, progress around the human range was slower than 4 SDs/OOM would predict. Also, comparing times to reach particular performances within a training run can sometimes make progress look misleadingly fast due to LR decay and suboptimal model size. The final version of AlphaGo Zero used a bigger model size and ran RL for much longer, and it seemingly took more compute to reach the ~2400 ELO and ~4000 ELO which is some evidence for optimal model size making a substantial difference (see Figure 6 in the PDF). Also, my guess based on circumstantial evidence is that the original version of AlphaGo (which was initialized with imitation) moved through the human range substantially slower than 4 SDs/OOMs. Perhaps someone can confirm this. (This footnote is copied from a forthcoming post of mine.)

Reply11
Foom & Doom 1: “Brain in a box in a basement”
Ryan Greenblatt19d72
  • LLM-focused AGI person: “Ah, that’s true today, but eventually other AIs can do this ‘development and integration’ R&D work for us! No human labor need be involved!”
  • Me: “No! That’s still not radical enough! In the future, that kind of ‘development and integration’ R&D work just won’t need to be done at all—not by humans, not by AIs, not by anyone! Consider that there are 8 billion copies of basically one human brain design, and if a copy wants to do industrial design, it can just figure it out. By the same token, there can be basically one future AGI design, and if a copy wants to do industrial design, it can just figure it out!”

I think the LLM-focused AGI people broadly agree with what you're saying and don't see a real disagreement here. I don't see an important distinction between "AIs can figure out development and integration R&D" and "AIs can just learn the relevant skills". Like, the AIs are doing some process which results in a resulting AI that can perform the relevant task. This could be an AI updated by some generic continual learning algorithm or an AI which is trained on a bunch of RL environment that AIs create, it doesn't ultimately make much of a difference so long as it works quickly and cheaply. (There might be a disagreement in what sample efficiency (as in, how efficiently AIs can learn from limited data) people are expecting AIs to have at different levels of automation.)

Similarly, note that humans also need to do things like "figure out how to learn some skill" or "go to school". Similarly, AIs might need to design a training strategy for themselves (if existing human training programs don't work or would be too slow), but it doesn't really matter.

Reply
Foom & Doom 1: “Brain in a box in a basement”
Ryan Greenblatt19d41

On the foom side, Paul Christiano brings up Eliezer Yudkowsky’s past expectation that ASI “would likely emerge from a small group rather than a large industry” as a failed prediction here [disagreement 12] and as “improbable and crazy” here.

Actually, I don't think Paul says this is a failed prediction in the linked text. He says:

The Eliezer predictions most relevant to “how do scientific disciplines work” that I’m most aware of are incorrectly predicting that physicists would be wrong about the existence of the Higgs boson () and expressing the view that real AI would likely emerge from a small group rather than a large industry (pg 436 but expressed many places).

My understanding is that this is supposed to be read as "[incorrectly predicting that physicists would be wrong about the existence of the Higgs boson ()] and [expressing the view that real AI would likely emerge from a small group rather than a large industry]", Paul isn't claiming that the view that real AI would likely emerge from a small group is a failed prediction!


On "improbable and crazy", Paul says:

The debate was about whether a small group could quickly explode to take over the world. AI development projects are now billion-dollar affairs and continuing to grow quickly, important results are increasingly driven by giant projects, and 9 people taking over the world with AI looks if anything even more improbable and crazy than it did then. Now we're mostly talking about whether a $10 trillion company can explosively grow to $300 trillion as it develops AI, which is just not the same game in any qualitative sense. I'm not sure Eliezer has many precise predictions he'd stand behind here (setting aside the insane pre-2002 predictions), so it's not clear we can evaluate his track record, but I think they'd look bad if he'd made them.

Note that Paul says "looks if anything even more improbable and crazy than it did then". I think your quotation is reasonable, but it's unclear if Paul thinks this is "crazy" or if he thinks it's just more incorrect and crazy-looking than it was in the past.

Reply
Making deals with early schemers
Ryan Greenblatt19d20

Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn't convincing to skeptics).

(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)

Reply
Load More
5ryan_greenblatt's Shortform
2y
105
36Jankily controlling superintelligence
16d
0
21Prefix cache untrusted monitors: a method to apply after you catch your AI
23d
1
36AI safety techniques leveraging distillation
24d
0
40When does training a model change its goals?
1mo
0
21OpenAI now has an RL API which is broadly accessible
1mo
0
32When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.
1mo
5
34The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions
1mo
4
39AIs at the current capability level may be important for future safety work
2mo
0
48Slow corporations as an intuition pump for AI R&D automation
2mo
5
38What's going on with AI progress and trends? (As of 5/2025)
2mo
3
Load More
LW bet registry
LW bet registry
Anthropic (org)
6mo
(+17/-146)
Frontier AI Companies
9mo
Frontier AI Companies
9mo
(+119/-44)
Deceptive Alignment
2y
(+15/-10)
Deceptive Alignment
2y
(+53)
Vote Strength
2y
(+35)
Holden Karnofsky
2y
(+151/-7)
Squiggle Maximizer (formerly "Paperclip maximizer")
2y
(+316/-20)