Ryan Greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

This post didn't get much uptake, but I still think the framing in this post is good and is a pretty good way to explain this sort of distinction in practice. I reasonably often reference this post.

As discussed in How will we update about scheming?:

While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.

I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.

Something similar on state proof security would be useful as well.

I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.

At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda[1].

I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I've found that he often has more specific tricks and proposals than what was laid out in this post. I didn't find these tricks moved me very far, but I think they were helpful for understanding what is going on.

This post and Davidad's agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn't allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can't be done with this approach.

My core disagreements are on the "Scientific Sufficiency Hypothesis" (particularly when considering computational constraints), "Model-Checking Feasibility Hypothesis" (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).[2]

As a concrete example, I found it quite implausible that you could construct and run a robot factory that is provably safe using the approach outlined in this proposal, and this sort of thing seems like a minimal thing you'd need to be able to do with AIs to make them useful.


  1. My understanding is that most technical work has been on improving mathematical fundamentals (e.g. funding logicians and category theorists to work on various things). I think it would make more sense to try to demonstrate overall viability with minimal prototypes that address key cruxes. I expect this to fail and thus it would be better to do this earlier. ↩︎

  2. This is mostly unrelated, but when talking with Davidad, I've found that a potential disagreement is that he's substantially more optimistic about using elicitation to make systems that currently seem quite incapable (e.g., GPT-4) very useful. As a concrete example, I think we disagreed about the viability of running a fully autonomous Tesla factory for 1 year at greater than one-tenth productivity using just AI systems created prior to halfway through 2024. (I was very skeptical.) It's not exactly clear to me how this is a crux for the overall plan (beyond getting a non-sabotaged implementation of simulations) given that we still are aiming to prove safety either way, and proving properties of GPT-4 is not clearly much easier than proving properties of much smarter AIs. (Apologies if I've just forgotten.) ↩︎

I agree that if an AI is incapable of competently scheming (i.e., alignment faking and sabotaging safety work without being caught), but is capable of massively accelerating safety work, then doing huge amounts of safety work with this AI is very promising.

(I put this aside in this post as I was trying to have a more narrow focus on how we'll update about scheming independent of how easily scheming will be handled and without talking about methods that don't currently exist.)

(The specific directions you mentioned of "fancy behavioral red teaming and interp" may not be that promising, but I think there are a moderate number of relatively empirical bets that look decently promising.)

It seems like the first AIs capable of massively accelerating safety work might also scheme pretty competently (it will depend on the architecture). However, we might be able to compensate with sufficient control measures such that the AI is forced to be very helpful (or is caught). Correspondingly, I'm excited about AI control.

(More generally, rapid takeoff might mean that we have to control AIs that are capable of competent scheming without having already obsoleted prior work.)

I'm reasonably optimistic about bootstrapping if the relevant AI company could afford several years of delay due to misalignment, was generally competent, and considered mitigating risk from scheming to be a top priority. You might be able to get away with less delay (especially if you heavily prep in advance). I don't really expect any of these to hold, at least across all the relevant AI companies and in short timelines.

(Yeah, you responded, but felt not that operationalized and seemed doable to flesh out as you did.)

I think if you want to convince people with short timelines (e.g., 7 year medians) of your perspective, probably the most productive thing would be to better operationalize things you expect that AIs won't be able to do soon (but that AGI could do). As in, flesh out a response to this comment such that it is possible for someone to judge.

btw, something that bothers me a little bit with this metric is the fact that a very simple AI ...

Yes, but I don't see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don't get around to getting the boost from hiring humans.

This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot's interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding.

I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.

I'd guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they've released. (I think this due to updates from o3 and general vibes.)

Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.

(After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn't have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)

(I missed the mention of the reward in the post as it didn't seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)

Load More