Ryan Greenblatt

I work at Redwood Research.

Wiki Contributions


To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.

You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is "This problem is quite special, read it carefully!". (See DEFAULT_PASSWORD here)

We agree:

Using unsupervised behavior discovery+RL on password-locked models (we bet it works well) and trying to build password-locked models that are robust to this method.

e.g. if it's clear that the AI could not pull off the cybersecurity operation required to not be easily shut down then this is a fairly strong argument that this AI agent wouldn't be able to pose a threat

I agree that this is a fairly strong argument that this AI agent wouldn't be able be able cause problems while rogue. However, I think there is also a concern that this AI will be able to cause serious problems via the affordances it is granted through the AI lab.

In particular, AIs might be given huge affordances internally with minimal controls by default. And this might pose a substantial risk even if AIs aren't quite capable enough to pull off the cybersecurity operation. (Though it seems not that bad of a risk.)

The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through

Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.)

I think subtle manipulation is a reasonably plausible threat model.

A world which can pause AI development is one which can also easily throttle ARA AIs.

I push back on this somewhat in a discussion thread here. (As a pointer to people reading through.)

Overall, I think this is likely to be true (maybe 60% likely), but not enough that we should feel totally fine about the situation.

Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.

Both legitimately and illegitimately acquiring compute could plausibly be the best route. I'm uncertain.

It doesn't seem that easy to lock down legitimate compute to me? I think you need to shutdown people buying/renting 8xH100 style boxes. This seems quite difficult potentially.

The model weights probably don't fit in VRAM in a single 8xH100 device (the model weights might be 5 TB when 4 bit quantized), but you can maybe fit with 8-12 of these. And, you might not need amazing interconnect (e.g. normal 1 GB/s datacenter internet is fine) for somewhat high latency (but decently good throughput) pipeline parallel inference. (You only need to send the residual stream per token.) I might do a BOTEC on this later. Unless you're aware of a BOTEC on this?

unless there's a big slowdown

Yep, I was imagining big slow down or uncertainty in how explosive AI R&D will be.

And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.

It agree that this cuts the risk, but I don't think it fully eliminates it. I think it might be considerably harder to fully cut off ARA agents than to do a big slowdown.

One way to put this is that quite powerful AIs in the wild might make slowing down 20% less likely to work (conditional on a serious slow down effort) which seems like a decent cost to me.

Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truely vast, vast amounts of pretty high skill labor).

Of course, AI labs will also have access to vast amounts of high skill labor (from possibly misaligned AIs). I don't know how the offense defense balance goes here. I think full defense might require doing really crazy things that organizations are unwilling to do. (E.g. unleashing a huge number of controlled AI agents which may commit crimes in order to take free energy.)

The sort of tasks involved in buying compute etc are ones most humans could do.

My guess is that you need to be a decent but not amazing software engineer to ARA. So, I wouldn't describe this as "tasks most humans can do".

Hmm, I agree that ARA is not that compelling on its own (as a threat model). However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger. And, once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary. Further, it seems somewhat hard to do capabilities evaluations that rule out this self-improvement if models are ARA capable given that there are so possible routes.

Edit: TBC, I think it's scary mostly because we might want to slow down and rogue AIs might then out pace AI labs.

So, I think I basically agree with where you're at overall, but I'd go further than "it's something that roughly correlates with other threat models, but is easier and more concrete to measure" and say "it's a reasonable threshold to use to (somewhat) bound danger " which seems worth noting.

While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources

I agree this is probably an issue for the rogue AIs. But, we might want to retain the ability to slow down if misalignment seems to be a huge risk and rogue AIs could make this considerably harder. (The existance of serious rogue AIs is surely also correlated with misalignment being a big risk.)

While it's hard to coordinate to slow down human AI development even if huge risks are clearly demonstrated, there are ways in which it could be particularly hard to prevent mostly autonomous AIs from self-improving. In particular, other AI projects require human employees which could make them easier to track and shutdown. Further, AI projects are generally limited by not having a vast supply of intellectual labor which would change in a regime where there rogue AIs with reasonable ML abilities.

This is mostly an argument that we should be very careful with AIs which have a reasonable chance of being capable of substantial self-improvement, but ARA feels quite related to me.

I'm not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:

Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.

"Separately" is quite key here.

I assume this is intended to include AI adversaries and high stakes monitoring.

[low importance]

For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)

It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get "all of the features".

(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)

Load More