Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the line
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

Strongly agree with almost all of this.

My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them... (read more)

It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.

I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.

On top of that, the posts seem to have this "don't listen to the people who are

I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:

I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

Thanks Akash!  I agree that this feels neglected. Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: Looking forward to it coming out! 

Thank you for writing this post, Adam! Looking forward to seeing what you and your epistemology team produce in the months ahead.

SERI MATS is doing a great job of scaling conceptual alignment research, and seem open to integrate some of the ideas behind Refine

I'm a big fan of SERI MATS. But my impression was that SERI MATS had a rather different pedagogy/structure (compared to Refine). In particular: 

  1. SERI MATS has an "apprenticeship model" (every mentee is matched with one mentor), whereas Refine mentees didn't have mentors.
  2. Refine was optimizing for p
Thanks for the kind words! I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic reading list from the history of science). From the impression I have, they are also now trying to give participants some broader perspective about the field, in addition to the specific frame of the mentor, and a bunch of the lessons from Refine about how to build a good model of the alignment problem apply. On a more general level, I expect that I had enough discussions with them that they would naturally ask me for feedback if they thought of something that seemed Refine shaped or similar. Hum, intuitively the main value from Refine that I don't expect to be covered by future MATS would come from reaching out to very different profiles. There's a non-negligeable chance that PIBBSS manages to make that work though, so not clear that it's a problem. Note that this is also part of why Refine feels less useful: when I conceived of it, most of these programs either didn't exist or were not well-established. Part of the frustration came from having nothing IRL for non-american to join, and just no program spending a significant amount of time on conceptual alignment, which both MATS and PIBBSS (in addition to other programs like ARENA) are now fixing. Which I think is great!

Hastily written; may edit later

Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at and  

Some additional thoughts:

  1. We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the firs
