Ryan Greenblatt

I work at Redwood Research.

Wiki Contributions


  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • ...
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • ...

The AI Optimists don't make this argument AFAICT, but I think optimism about effectively utilizing "human level" models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these "human level" systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively "human level" capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the "default" trajectory because we'll want to run extremely large numbers of these systems at high speeds.)

The idea of using human level models like this has a bunch of important caveats which mean you shouldn't end up being extremely optimistic overall IMO[2]:

  •  It's not clear that "human level" will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I'm moderately optimistic here.)
  • Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won't have enough acceleration and delay to manage this. (In practice I'm somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
  • Will "human level" systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren't worried about catastrophic misalignment risk.
  1. ^

    At least R&D which isn't very limited by physical processes.

  2. ^

    I think <1% doom seems too optimistic without more of a story for how we're going to handle super human models.

I find myself unsure which conclusion this is trying to argue for.

Here are some pretty different conclusions:

  • Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it's a problem after human obsolescence for our trusted AI successors, but who cares).
  • There aren't any solid arguments for deceptive alignment[1]. So, we certainly shouldn't be confident in deceptive alignment (e.g. >90%), though we can't total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
  • Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.

There is a big difference between <<1% likely and 10% likely. I basically agree with "not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment", but I don't think this leaves me in a <<1% likely epistemic state.

  1. Other than noting that it could be behaviorally consistent for powerful models: powerful models are capable of deceptive alignment. ↩︎

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

Even if we didn't have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Basically, but more centrally that in literal current LLM agents the scary part of the system that we don't understand (the LLM) doesn't generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it's at least plausible that this property will be preserved in the future.

I edited my earlier comment to hopefully make this more clear.

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said.

I think it contradicts things Nate says in this post directly. I don't know if it contradicts things you've said.

To clarify, I'm commenting on the following chain:

First Nate said:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

as well as

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Then, Paul responded with

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!

Then you said

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

And I was responding to this.

So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn't have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes).

This directly contradicts "Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.".

(I'm obviously not Paul)

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

In the case of literal current LLM agents with current models:

  • Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
  • We can read the intermediate goals directly via just reading the CoT.

Thus, we don't have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization.

It's unclear if this will hold in the future even for LLM agents, but it's at least plausible that this will hold (which defeats Nate's rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn't seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability.

We'd also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes.

See also this comment from Paul on a similar topic.

(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)

Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon.

(I work at RR)

Yep, indeed I would consider "control evaluations" to be a method of "AI control". I consider the evaluation and the technique development to be part of a unified methodology (we'll describe this more in a forthcoming post).

(I work at RR)

More generally, it seems like we can build systems that succeed in accomplishing long run goals without having the core components which are doing this actually 'want' to accomplish any long run goal.

It seems like this is common for corporations and we see similar dynamics for language model agents.

(Again, efficiency concerns are reasonable.)

Load More