Ryan Greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

The "AI is easy to control" piece does talk about scaling to superhuman AI:

In what follows, we will argue that AI, even superhuman AI, will remain much more controllable than humans for the foreseeable future. Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability.

If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.

However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control "we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple". This notion of control doesn't (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn't the only issue, but it is a sufficient counterexample.)

This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.

(See also my discussion of using human level ish AIs to automate safety research in the sibling.)

Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime.

This isn't true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.

(Also, this plan doesn't require necessarily aligning "human level" AIs, just being able to get work out of them with sufficiently high productivity and low danger.)

  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • ...
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • ...

The AI Optimists don't make this argument AFAICT, but I think optimism about effectively utilizing "human level" models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these "human level" systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively "human level" capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the "default" trajectory because we'll want to run extremely large numbers of these systems at high speeds.)

The idea of using human level models like this has a bunch of important caveats which mean you shouldn't end up being extremely optimistic overall IMO[2]:

  •  It's not clear that "human level" will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I'm moderately optimistic here.)
  • Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won't have enough acceleration and delay to manage this. (In practice I'm somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
  • Will "human level" systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren't worried about catastrophic misalignment risk.
  1. ^

    At least R&D which isn't very limited by physical processes.

  2. ^

    I think <1% doom seems too optimistic without more of a story for how we're going to handle super human models.

As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.

Perhaps you're thinking of the recent influence function work from Anthropic?

I don't think that this paper either shows or claims that "LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions". But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.

I find myself unsure which conclusion this is trying to argue for.

Here are some pretty different conclusions:

  • Deceptive alignment is <<1% likely (quite implausible) to be a problem prior to complete human obsolescence (maybe it's a problem after human obsolescence for our trusted AI successors, but who cares).
  • There aren't any solid arguments for deceptive alignment[1]. So, we certainly shouldn't be confident in deceptive alignment (e.g. >90%), though we can't total rule it out (prior to human obsolescene). Perhaps deceptive alignment is 15% likely to be a serious problem overall and maybe 10% likely to be a serious problem if we condition on fully obsoleting humanity via just scaling up LLM agents or similar (this is pretty close to what I think overall).
  • Deceptive alignment is <<1% likely for scaled up LLM agents (prior to human obsolescence). Who knows about other architectures.

There is a big difference between <<1% likely and 10% likely. I basically agree with "not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment", but I don't think this leaves me in a <<1% likely epistemic state.


  1. Other than noting that it could be behaviorally consistent for powerful models: powerful models are capable of deceptive alignment. ↩︎

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

Even if we didn't have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Basically, but more centrally that in literal current LLM agents the scary part of the system that we don't understand (the LLM) doesn't generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it's at least plausible that this property will be preserved in the future.

I edited my earlier comment to hopefully make this more clear.

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said.

I think it contradicts things Nate says in this post directly. I don't know if it contradicts things you've said.

To clarify, I'm commenting on the following chain:

First Nate said:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

as well as

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Then, Paul responded with

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!

Then you said

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

And I was responding to this.

So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn't have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes).

This directly contradicts "Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.".

(I'm obviously not Paul)

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

In the case of literal current LLM agents with current models:

  • Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
  • We can read the intermediate goals directly via just reading the CoT.

Thus, we don't have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization.

It's unclear if this will hold in the future even for LLM agents, but it's at least plausible that this will hold (which defeats Nate's rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn't seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability.

We'd also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes.

See also this comment from Paul on a similar topic.

(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)

Load More