AI ALIGNMENT FORUM
AF

481
Ryan Greenblatt
Ω5007405208
Message
Dialogue
Subscribe

I'm the chief scientist at Redwood Research.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No sequences to display.
5ryan_greenblatt's Shortform
2y
119
Jesse Hoogland's Shortform
ryan_greenblatt9d35

I'm pretty excited about building tools/methods for better dataset influence understanding, so this intuitively seems pretty exciting! (I'm both interested in better cheap approximation of the effects of leaving some data out and the effects of adding some data in.)

(I haven't looked at the exact method and results in this paper yet.)

Reply
ryan_greenblatt's Shortform
ryan_greenblatt16d120

Anthropic has now clarified this in their system card for Claude Haiku 4.5:
Image

Thanks to them for doing this!

See also Sam Bowman's tweet thread about this.

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt21d40

Maybe I should clarify my view a bit on Plan A vs "shut it all down":

  • Both seem really hard to pull off from a political will and actually making it happen.
    • Plan A is complicated and looking at I go "oh jeez, this seems really hard to make happen well, idk if the US government has the institutional capacity to pull off something like this". But, I also think pulling off "shut it all down" seems pretty rough and pulling off a shutdown that lasts for a really long time seems hard.
    • Generally, it seems like Plan A might be hard to pull off and it's easy for me to imagine it going wrong. This also mostly applies to "shut it all down" though.
    • I still think Plan A is better than other options and Plan A going poorly can still nicely degrade into Plan B (or C).
  • It generally looks to me like Plan A is strictly better on most axes, though maybe "shut it all down" would be better if we were very confident in maintaining extremely high political will (e.g. at or above peak WW2 level political will) for a very long time.
    • There would still be a question of whether the reduction in risk is worth delaying the benefits of AI. I tenatively think yes even from a normal moral perspective, but this isn't as clear of a call as doing something like Plan A.
    • I worry about changes in the balance of power between the US and China with this long of a pause (and general geopolitical drift making the situation worse and destabilizing this plan). But, maybe if you actually had the political will you could make a deal which handles this somehow? Not sure exactly how, minimally the US could try to get its shit together more in terms of economic growth.
    • And I'd be a bit worried about indefinite pause but not that worried.

This is my view after more seriously getting into some of the details of the Plan A related to compute verification and avoiding blacksite projects as well as trying to do a more precise comparison with "shut it all down".

Reply1
Plans A, B, C, and D for misalignment risk
ryan_greenblatt21d30

One alternative way of thinking about this is to decompose plans by which actor the plan is for:

  • Plan A: Most countries, at least the US and China
  • Plan B: The US government (and domestic industry)
  • Plan C: The leading AI company (or maybe a few of the leading AI companies)
  • Plan D: A small team with a bit of buy in within the leading AI company

This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful.

This decomposition was proposed by @Lukas Finnveden.

Reply
ryan_greenblatt's Shortform
ryan_greenblatt21d*4853

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't[1].

I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.

It's particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on "Reasoning faithfulness" (kudos to them for providing so much other information!). Naively, this seems like some evidence that they are training against CoT and didn't want to disclose this.[2]

I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing (reasonably) important information like how/whether they are training against CoT.


It's worth noting that some versions of training against CoT might be totally fine (as in, they don't incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does incentivize obfuscating misaligned reasoning. For instance, it could end up being the case that training against CoT works well and we end up with sufficient confidence that this doesn't result in substantial obfuscation of reasoning. The situation is also just generally complicated because things like "what SFT/prompting initialization do you use for your CoT" could have big effects on obfuscation and aren't exactly well described as applying optimization pressure to CoT. (But, if you start iterating on the SFT/prompting initialization to reduce misalignment observed in the CoT, this could cause problems.)

Given that the situation is complicated, ideally AI companies would publicly disclose enough information that other actors can come to their own conclusions about how worried they should be about CoT obfuscation.[3] If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.


  1. Source: METR's evaluation of GPT-5 says that OpenAI asserts "There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces" and "There's no particular reason to expect that GPT-5's reasoning traces are less monitorable than METR would assume.". It's of course possible that OpenAI will train against CoT in future models and it's possible that for GPT-5 they did train against CoT in some way that they think wouldn't directly apply pressure to hide/obfuscate misaligned reasoning (or they could be lying). ↩︎

  2. Of course, it is totally possible that Anthropic just didn't get around to saying anything or just defaulted to secrecy rather than specifically going out of their way to not say anything in this case. I think the prior probability on this happening just because not saying anything is the default (or something like this) is high. ↩︎

  3. The information that OpenAI has disclosed doesn't suffice for other actors (that don't have private information) to come to their own conclusions without trusting relatively general assurances from OpenAI that aren't backed by very specific claims. Of course, this is still much more information about training against CoT than other companies! ↩︎

Reply1
The Thinking Machines Tinker API is good news for AI control and security
ryan_greenblatt22d52

The main class of projects that need granular model weight access to frontier models is model internals/interpretability.

You could potentially do a version of this sort of API which has some hooks for interacting with activations to capture a subset of these use cases (e.g. training probes). It would probably add a lot of complexity though and might only cover a small subset of research.

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt22d42

I'm not trying to say "Plan A is doable and shut it all down is intractable".

My view is that "shut it all down" probably requires substantially more (but not a huge amount more) political will than Plan A such that it is maybe like 3x less likely to happen given similar amounts of effort from the safety community.

You started by saying:

My main question is "why do you think Shut Down actually costs more political will?".

So I was trying to respond to this. I think 3x less likely to happen is actually a pretty big deal; this isn't some tiny difference, but neither is it "Plan A is doable and shut it all down is intractable". (And I also think "shut it all down" has various important downsides relative to Plan A, maybe these downsides can be overcome, but by default this makes Plan A look more attractive to me even aside from the political will considerations.)

I think something like Plan A or "shut it all down" are both very unlikely to happen and I'd be pretty sympathetic to describing both as politically intractable (e.g., I think something as good/strong as Plan A is only 5% likely). "politically intractable" isn't very precise though, so I think we have to talk more quantitatively.

Note that my view is also that I think pushing for Plan A isn't the most leveraged thing for most people to do at the margin; I expect to focus on making Plans C/D go better (with some weight on things like Plan B).

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt23d32

But like, the last draft of Plan A I saw include "we relocate all the compute to centralized locations in third party countries" as an eventual goal. That seems pretty crazy?

Yes, this is much harder (from a political will perspective) than compute + fab monitoring which is part of my point? Like my view is that in terms of political will requirements:

compute + fab monitoring << Plan A < Shut it all down

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt23d40

I think compute + fab monitoring with potential for escalation requires much lower political will than shutting down AI development. I agree that both Plan A and shut it all down require something like this. Like I think this monitoring would plausibly not require much more political will than export controls.


Advanced bio AI seems pretty good for the world and to capture a lot of the benefits

Huh? No it doesn't capture much of the benefits. I would have guessed it captures a tiny fraction of the benefits for advanced AI, even for AIs around the level where you might want to pause at human level.


But, it seems like a version of the treaty that doesn't at least have the capacity to shutdown compute temporarily is a kinda fake version of Plan A, and once you have that, "Shut down" vs "Controlled Takeoff" feels more like arguing details than fundamentals to me.

I agree you will have the capacity to shut down compute temporarily either way; I disagree that there isn't much of a difference between slowing down takeoff and shutting down all further non-narrow AI development.

Reply11
Plans A, B, C, and D for misalignment risk
ryan_greenblatt23d42

Sure, I agree that Nate/Eliezer think we should eventually build superintelligence and don't want to causal a pause that lasts forever. In the comment you're responding to, I'm just talking about difficulty in getting people to buy the narrative.

More generally, what Nate/Eliezer think is best is doesn't resolve concerns with the pause going poorly because something else happens in practice. This includes the pause going on too long or leading to a general anti-AI/anti-digital-minds/anti-progress view which is costly for the longer run future.) (This applies to the proposed Plan A as well, but I think poor implementation is less scary in various ways and the particular risk of ~anti-progress forever is less strong.)

Reply
Load More
19Reducing risk from scheming by studying trained-in scheming behavior
15d
0
23Iterated Development and Study of Schemers (IDSS)
21d
0
55Plans A, B, C, and D for misalignment risk
23d
39
24Focus transparency on risk reports, not safety cases
1mo
2
28Prospects for studying actual schemers
1mo
0
24AIs will greatly change engineering in AI companies well before AGI
2mo
2
57Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
2mo
6
49Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
2mo
2
56My AGI timeline updates from GPT-5 (and 2025 so far)
2mo
1
47Recent Redwood Research project proposals
4mo
0
Load More
Anthropic (org)
10 months ago
(+17/-146)
Frontier AI Companies
a year ago
Frontier AI Companies
a year ago
(+119/-44)
Deceptive Alignment
2 years ago
(+15/-10)
Deceptive Alignment
2 years ago
(+53)
Vote Strength
2 years ago
(+35)
Holden Karnofsky
3 years ago
(+151/-7)
Squiggle Maximizer (formerly "Paperclip maximizer")
3 years ago
(+316/-20)