COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.

Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.

Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.

If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:

An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.

How do we make it to a state where AI goes well?

I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.

The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:

  • capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?”
    • With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.
  • safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?”

With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):

  1. AI labs put out RSP commitments to stop scaling when particular capabilities benchmarks are hit, resuming only when they are able to hit particular safety/alignment/security targets.
    1. Early on, as models are not too powerful, almost all of the work is being done by capabilities evaluations that determine that the model isn’t capable of e.g. takeover. The safety evaluations are mostly around security and misuse risks.
    2. For later capabilities levels, however, it is explicit in all RSPs that we do not yet know what safety metrics could demonstrate safety for a model that might be capable of takeover.
  2. Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
  3. By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
  4. Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
  5. Once labs start to reach models that pose a potential takeover risk, they either:
    1. Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
    2. Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.

Reasons to like RSPs

Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:

  1. It provides very clear and concrete policy proposals that could realistically be adopted by labs and governments (in fact, step 1 has already started!). Labs and governments don’t know how to respond to nebulous pause advocacy because it isn’t clearly asking for any particular policy (since nobody actually likes and is advocating for the six month pause proposal).
  2. It provides early wins that we can build on later in the form of initial RSP commitments with explicit holes in them. From “AI coordination needs clear wins”:
    1. “In the theory of political capital, it is a fairly well-established fact that ‘Everybody Loves a Winner.’ That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
    2. I think many people who think about the mechanics of leveraging influence don’t really understand this phenomenon and conceptualize their influence as a finite resource to be saved up over time so it can all be spent down when it matters most. But I think that is just not how it works: if people see you successfully leveraging influence to change things, you become seen as a person who has influence, has the ability to change things, can get things done, etc. in a way that gives you more influence in the future, not less.”
  3. One of the best, most historically effective ways to shape governmental regulation is to start with voluntary commitments. Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
    1. Though we could try to go to governments first rather than labs first, so far I’ve seen a lot more progress with the labs-first approach—though there’s no reason we can’t continue to pursue both in parallel.
  4. RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks. That’s important because it gives the proposal substantial additional seriousness, since it can point directly to clear harms that it is targeted at preventing.
    1. Additionally, from an x-risk perspective, I don’t even think it actually matters that much what the capability evaluations are here: most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks. And we can directly test the extent to which relevant capabilities are correlated: if they aren’t, we can change course.
  5. Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it’s easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value.
    1. Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be as at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. There is still a capabilities benchmark below which open-source is fine (though it should be a lower threshold than closed-source, since there are e.g. misuse risks that are much more pronounced for open-source), and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
  6. Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety. As I’ve said before, “the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.”

How do RSPs relate to pauses and pause advocacy?

In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.

Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.

Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular:

  • We need to actually get as many labs as possible to put out RSPs.
    • Currently, only Anthropic has done so, but I have heard positive signals from other labs and I think with sufficient pressure they might be willing to put out their own RSPs as well.
  • We need to make sure that those RSPs actually commit to the right things. What I’m looking for are:
    • Fine-tuning-based capabilities evaluations being used for below-takeover-potential models.
    • Evidence that capabilities evaluations will be done effectively and won’t be sandbagged (e.g. committing to use an external auditor).
    • An explicitly empty hole for safety evaluations for takeover-risk models that can be filled in later by progress on understanding-based evals.
  • We need to get governments to enact mandatory RSPs for all AI labs.
    • And these RSPs also need to have all the same important properties as the labs’ RSPs. Ideally, we should get the governmental RSPs to be even stronger!
  • We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
    • I’m especially worried about this point, though I don’t think it’s that hard of a sell: the idea that you should understand what your AI is doing on a deep level is a pretty intuitive one.
New Comment
42 comments, sorted by Click to highlight new comments since: Today at 8:30 AM

I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it. 

I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.   


I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we'll stop.” 

But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.

And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.

I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about. 

There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”

I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.


Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances. 

But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.   

And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…

I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted. 

But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.

As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can't do are concrete safety evals, which I'm very clear about not expecting us to have right now.

And I'm not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn't work as a way to make things go well: but that's how any story of success has to work right now given that we don't currently know how to make things go well. And actually telling success stories is an important thing to do!

If you have an alternative success story that doesn't involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.

It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.

This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I'm saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic's thoughts on anything—and the post isn't really about Anthropic's RSP at all.

Regardless, I'm happy to give my take. I don't think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.

I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.

It sounds like you're disagreeing with me, but everything you're saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can't.

But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.

When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.

Fully agree with almost all of this. Well said.

One nitpick of potentially world-ending importance:

In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems

Giving us high confidence is not the bar - we also need to be correct in having that confidence.
In particular, we'd need to be asking: "How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we're confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?..."

I assume you'd roll that into assessing your confidence - but I think it's important to be explicit about this.

 

Based on your comment, I'd be interested in your take on:

  1. Put many prominent disclaimers and caveats in the RSP - clearly and explicitly.
    vs
  2. Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like "...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]".

Not having thought about it for too long, I'm inclined to favor (2).
I'm not at all sure how realistic it is from a unilateral point of view - but I think it'd be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don't strongly expect to be able to meet ahead of time, that's useful to know: it amounts to "RSPs are a means to avoid pausing".

I imagine most labs wouldn't commit to [we only get to run this training process if Eliezer thinks it's good for global safety], but I'm not at all sure what they would commit to.

At the least, it strikes me that this is an obvious approach that should be considered - and that a company full of abstract thinkers who've concluded "There's no direct, concrete, ML-based thing we can commit to here, so we're out of options" don't appear to be trying tremendously hard.

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
  • Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
  • Some way of handling race dynamics (such that Bad Guy can't just be like "haha, cute that you guys are doing RSPs. We're either not going to engage with your silly RSPs at all, or we're gonna publish our own RSP but it's gonna be super watered down and vague").

What do RSPs actually look like right now?

  • Fairly vague commitments, more along the lines of "we will improve our information security and we promise to have good safety techniques. But we don't really know what those look like.
  • Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they'll look like). Very much a "trust us; we promise we will be safe. For misuse, we'll figure out some way of making sure there are no jailbreaks, even though we haven't been able to do that before."
    • Also, for accident risks/AI takeover risks... well, we're going to call those "ASL-4 systems". Our current plan for ASL-4 is "we don't really know what to do... please trust us to figure it out later. Maybe we'll figure it out in time, maybe not. But in the meantime, please let us keep scaling."
  • Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be "as we get closer to highly dangerous systems, we will hopefully figure something out."
  • No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like "let's play around and see what things can do" rather than "we have solid tests and some consensus around how to interpret them."
  • No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they're afraid that they're going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC's post and Anthropic includes such a clause.)

Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.

Why I'm disappointed with current comms around RSPs

Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc. 

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever.

(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)

I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."

Strongly agree with almost all of this.

My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. 

We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)

This can't be implicit, since it's a central way that we die.

If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately

This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.

If we're thinking of better mechanisms to achieve a pause, I'd add:

  1. Call it something like a "Responsible training and deployment policy (RTDP)", not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare "Responsible farming policy", "Responsible fishing policy", "Responsible diving policy" - all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
  2. Don't look for a 'practical' solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
  3. Define ASLs or similar now rather than waiting until we're much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
    1. Evan's own "We need to make sure that, once we have solid understanding-based evals, governments make them mandatory" only re-enforces this impression. Whether we have them is irrelevant to the question of whether they're necessary.
  4. Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it's hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
    1. I think there's a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway - with government/regulator backing, since they're doing everything practical, everything reasonable....
      Assuming this won't happen seems dangerously naive.
    2. If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed.
      The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary.
      The ARC Evals RSP description similarly fails - if RSPs are intended to be a path to pausing. "Practical middle ground" amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
  5. Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
    1. This is a natural way to communicate "We'd ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally".
    2. If a lab's unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
  6. Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don't want the people who know most about safety to be "involved in the drafting process"; I want to know that they oversaw the process and stand by the final version.

I'm sure there are other sensible additions, but that'd be a decent start.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none) This can't be implicit, since it's a central way that we die. If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately

This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.

Yeah, I agree—that's why I'm specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you'd know if there were anything wrong that your other evals might miss.

Define ASLs or similar now rather than waiting until we're much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].

Evan's own "We need to make sure that, once we have solid understanding-based evals, governments make them mandatory" only re-enforces this impression. Whether we have them is irrelevant to the question of whether they're necessary.

See the bottom of this comment: my main objection here is that if we were to try to define it now, we'd end up defining something easily game-able because we don't yet have metrics for understanding that aren't easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don't yet know what we could put there.

I think there's a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway - with government/regulator backing, since they're doing everything practical, everything reasonable.... Assuming this won't happen seems dangerously naive.

I definitely agree that this is a serious concern! That's part of why I'm writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.

Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)

IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.

I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.

I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn't be sufficient for reasonable amounts of safety (insofar as this is what they think).

Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I'm not really sold on substantial risk from systems which aren't capable of carrying out that harm mostly autonomously, so I don't think ASL-3 is actually important except as setup.)

It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.

I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe.

I want to be very clear that I've been really happy to see all the people pushing for strong asks here. I think it's a really valuable thing to be doing, and what I'm trying to do here is not stop that but help it focus on more concrete asks.

I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime.

To be clear, I definitely agree with this. My position is not "RSPs are all we need", "pauses are bad", "pause advocacy is bad", etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. "RSPs are pauses done right."

To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”

Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.

I mean, to be clear, I am saying something like "RSPs are the most effective way to implement a pause that I know of." The thing I'm not saying is just that "RSPs are the only policy thing we should be doing."

Instead, ARC explicitly tries to paint the moratorium folks as "extreme".

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

Thanks for writing this up.
I agree that the issue is important, though I'm skeptical of RSPs so far, since we have one example and it seems inadequate - to the extent that I'm positively disposed, it's almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)

Going only by the language in the blog post and the policy, I'd conclude that they're an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I'm not the main target audience - but I worry about the impression the current wording creates.

I hope that RSPs can be beneficial - but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy - and without any "many years" or similar)

It's hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I'll focus on your post first, and the RSP blog/policy doc after that.

Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.

There's an obvious thing to do here. It's far from obvious that it's a solution.
One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.

RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks.

They kick in when we detect that models have capabilities that we realize are relevant to downstream risks.
Both detection and realization can fail.

My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.

Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety.

This could use greater emphasis in the RSP blog/doc.

Ideally, we should get the governmental RSPs to be even stronger!

Yes!

We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.

We need governments to make them mandatory before they're necessary, not once we have them (NB, not [before it's clear they're necessary] - it might not be clear). I don't expect us to have sufficiently accurate understanding-based evals before they're necessary. (though it'd be lovely)

Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren't yet adequate, then labs need to stop.

 

Thoughts on the blog/doc themselves. Something of a laundry list, but hopefully makes clear where I'm coming from:

  1. My top-level concern is overconfidence: to the extent that we understand what's going on, and things are going as expected, I think RSPs similar to Anthropic's should be pretty good. This gives me very little comfort, since I expect catastrophes to occur when there's something unexpected that we've failed to understand.
    1. Both the blog post and the policy document fail to make this sufficiently clear.
    2. Examples:
      1. From the blog: "On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling...".
        1. This is not true: the incentive is to satisfy the conditions in the RSP. That's likely to mean that the lab believes they've solved the necessary safety issues. They may not be correct about that.
        2. To the extent that they triple-check even after they think all is well, that's based on morality/self-preservation. The RSP incentives do not push that way. Incorrectly believing they push that way doesn't give me confidence.
      2. No consideration of the possibility of jumping from ASL-(n) to ASL-(n+2).
      3. No consideration of a model being ASL-(n+1), but showing no detectable warning signs beyond ASL-n. (there's a bunch on bumping into ASL-n before expecting to - but not on this going undetected)
      4. I expect the preceding to be unusual; conditional on catastrophe, I expect something unusual has happened.
  2. On evals:
    1. Demanding capabilities may be strongly correlated, so that it doesn't matter too much if we fail to test for everything important. Alternatively, it could be the case that we do need to cover all the bases, since correlations aren't as strong as we expect. In that case, [covering all the bases that we happen to think of] may not be sufficient. (this seems unlikely, but possible, to me)
    2. More serious is the possibility that there are methods of capability elicitation/amplification that the red-teamers don't find. For example, if no red-teamer had thought to try chain-of-thought approaches, capabilities might have been missed. Where is the guarantee that nothing like this is missed?
      I don't see any correlation-based defense here - it seems quite possible that some ways to extract capabilities are just much better than others. What confidence level should we have that testing finds the best ways?
    3. Why isn't it emphasized that red-teaming can show that something is dangerous, but not that it's safe? Where's the discussion around how often we should expect tests to fail to catch important problems? Where's the discussion around p(model is dangerous | model looks safe to us)? Is this low? Why? When? When will this change? How will we know?...
    4. In general the doc seems to focus on [we're using the best techniques currently available], and fails to make a case that [the best techniques currently available are sufficient].
      1. E.g. page 16: "Evaluations should be based on the best capabilities elicitation techniques we are aware of at the time"
      2. This worries me because governments/regulators are used to situations where state-of-the-art tests are always adequate (since building x tends to imply understanding x, outside ML). Therefore, I'd want to see this made explicit and clear.
        1. This is the closest I can find, but it's rather vague:
          "Complying with higher ASLs is not just a procedural matter, but may sometimes require research or technical breakthroughs to give affirmative evidence of a model’s safety (which is generally not possible today)..."
          It'd be nice if the reader couldn't assume throughout that the kind of research/breakthrough being talked about is the kind that's routinely doable within a few months, rather than the kind that may take a decade.
  3. Miscellaneous:
    1. From the policy document, page 2:
      "As AI systems continue to scale, they may become capable of increased autonomy that enables them to proliferate and, due to imperfections in current methods for steering such systems, potentially behave in ways contrary to the intent of their designers or users."
      1. To me "...imperfections in current methods..." seems misleading - it gives the impression that labs basically know what they're doing on alignment, but need to add a few tweaks here and there. I don't believe this is true, and I'd be surprised to learn that many at Anthropic believe this.
    2. Policy doc, page 3:
      "Rather than try to define all future ASLs and their safety measures now (which would almost certainly not stand the test of time)..."
      This seems misleading since it's not hard to define ASLs and safety measures which would stand the test of time: the difficult thing is to define measures that stand the test of time, but allow scaling to continue.

      There's an implicit assumption here that the correct course is to allow as much scaling as we can get away with, rather than to define strict measures that would stop things for the foreseeable future - given that we may be overconfident.
      I don't think it's crazy to believe the iterative approach is best, but I do think it deserves explicit argument. If the argument is "yes, stricter measures would be nice, but aren't realistic right now", then please say this (not just here in your post, I mean - somewhere clear to government people).

      In particular, I think it's principled to make clear that a lab would accept more strict conditions if they were universally enforced than those it would unilaterally adopt.
      Conversely, I find it worrying for a lab to say "we're unilaterally doing x, and we think [everyone doing x] is the thing to aim for", since I expect the x that makes unilateral sense to be inadequate as a global coordination target.
    3. Page 10:
      "We will manage our plans and finances to support a pause in model training if one proves necessary"
      This seems nice, but gives the impression more of [we might need to pause for six months] than [we might need to pause for ten years]. Given that the latter seems possible, it seems important to acknowledge that radical contingency plans would be necessary for this - and to have such plans (potentially with government assistance, and/or [stuff that hasn't occurred to me]).
      Without that, there'll be an unhelpful incentive to cut corners or to define inadequate ASLs on the basis that they seem more achievable.

I'm mostly not going to comment on Anthropic's RSP right now, since I don't really want this post to become about Anthropic's RSP in particular. I'm happy to talk in more detail about Anthropic's RSP maybe in a separate top-level post dedicated to it, but I'd prefer to keep the discussion here focused on RSPs in general.

One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.

I definitely share this worry. But that's part of why I'm writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it'll take good policy and advocacy work to make that happen.

My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.

I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it'll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it's mostly a solved problem in theory and just requires good implementation.

We need governments to make them mandatory before they're necessary, not once we have them (NB, not [before it's clear they're necessary] - it might not be clear). I don't expect us to have sufficiently accurate understanding-based evals before they're necessary. (though it'd be lovely)

Pushing to require state-of-the-art safety techniques is the wrong emphasis. We need to push for adequate safety techniques. If state-of-the-art techniques aren't yet adequate, then labs need to stop.

The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I'm proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it's that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.

I think we at least do know how to do effective capabilities evaluations

This seems an overstatement to me:
Where the main risk is misuse, we'd need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)

It seems reasonable to me to claim that "we know how to do effective [capabilities given sota elicitation methods] evaluations", but that doesn't answer the right question.

Once the main risk isn't misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn't realize we were relying upon]). Obviously we don't expect these to break yet, but I'd guess that we'll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine - but that [I don't have much of a clue, so I'm advocating extreme caution] may be the more reasonable policy.

My concern with trying to put something like [understanding-based evals] into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.

We don't know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).

Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too - perhaps with a [you can't pass this check without first coming up with a test and getting it approved] condition, or it needs a "THIS RSP IS INADEQUATE TO ENSURE SAFETY" warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it's not emphasized nearly enough)

The point is that advocating for a “pause” is nebulous and non-actionable

Setting aside the potential advantages of RSPs, this strikes me as a pretty weird thing to say. I understand the term "pause" in this context to mean that you stop building cutting-edge AI models, either voluntarily or due to a government mandate. In contrast, "RSP" says you eventually do that but you gate it on certain model sizes and test results and unpause it under other test results. This strikes me as a bit less nebulous, but only a bit.

I'm not quite sure what's going on here - it's possible that the term "pause" has gotten diluted? Seems unfortunate if so.

I think the problem is that nobody really has an idea for what the resumption condition should be for a pause, and nobody's willing to defend the (actually actionable) six-month pause proposal.

the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!

my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.

that said, thank you for the post, it’s a very valuable discussion to have! upvoted.

the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!

Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?

a) near term (eval-less) scaling poses trivial xrisk

I agree that this is mostly baked in, but I think I'm pretty happy to accept it. I'd very surprised if there was substantial x-risk from the next model generation.

But also I would argue that, if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regulatory regime we'd need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value.

b) there is a substantial period during which models trigger evals but are existentially safe

Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once.

Probably the most important intervention here is to keep applying your evals while you're training your next model generation, so they trigger as soon as possible. As long as there's some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation tend to be relatively predictable given the current model generation.

So overall, my sense is that takeoff only has to be marginally continuous for this to work—if it's extremely abrupt, more of a classic FOOM scenario, then you might have problems, but I think that's pretty unlikely.

that said, thank you for the post, it’s a very valuable discussion to have! upvoted.

Thanks! Happy to chat about this more also offline.

Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?

i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.

I'd very surprised if there was substantial x-risk from the next model generation.

i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don't trust my (or anyone else's) intuitons strongly enough to say that there's a less than 1% xrisk per 10x scaling of compute. in expectation, that's killing 80M existing people -- people who are unaware that this is happening to them right now.

Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?

Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.

Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!

if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regularity regime we'd need to deal with that in place

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my understanding, only 3 or 4 labs have any chance of releasing a "next generation" model for as much as two years from now, others won't be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite--for voluntary actions taken by the labs, and that regulation can follow.

Is the idea that an indefinite pause is unactionable? If so, I'm not sure why you think that.

I talk about that here:

And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.

I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc.

As to the realism, I think it's more realistic than I think you think it is. My impression of AI Impacts' technological temptation work is that governments are totally willing to enact policies that impoverish their citizens without requiring a rigourous CBA. Early wins does seem like an important consideration, but you can imagine trying to get some early wins by e.g. banning AI from being used in certain domains, banning people from developing advanced AI without doing X, Y, or Z.

I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc.

Sure—I just think it'd be better to spend that energy advocating for good RSPs instead.

To be clear, the whole point of my post is that I am in favor of pausing/stopping AI development—I just think the best way to do that is via RSPs.

On RSPs vs pauses, my basic take is that hardcore pauses are better than RSPs and RSPs are considerably better than weak pauses.

Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.

I think good RSPs are worse than this, but probably much better than just having a lab pause scaling.

It's possible that various actors should explicitly state that hardcore pauses would be better (insofar as they think so).

  • A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. ...
  • A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. ...

I propose changing the term for this second type of evaluation to "propensity evaluations". I think this is a better term as it directly fits the definition you provided: "a model evaluation designed to test under what circumstances a model would actually try to do some task".

Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it's misleading to label only one of them as "safety evaluations". For example, we could construct a compelling safety argument for current models using solely capability evaluations.

Either can be sufficient for safety: a strong argument based on capabilities (we've conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).

Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.

I sometimes refer to capability based arguments as control arguments.

Then, we can name two lines of defense:

  • The control line of defense: Would the AI succeed at causing bad outcomes if it tried?
  • The propensity line of defense: Would the AI try to cause bad outcomes?

It's possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.

I expect that in practice, we're not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we're not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.

Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.

(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)

An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of "control arguments" which can be readily assessed with non-insane capability evaluations.)

[it turns out I have many questions - please consider this a pointer to the kind of information I'd find useful, rather than a request to answer them all!]

around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x

Can you point to what makes you think this is likely? (or why it seems the most promising approach)

In particular, I worry when people think much in terms of "doublings of total R&D effort" given that I'd expect AI assistance progress multipliers to vary hugely - with the lowest multipliers correlating strongly with the most important research directions.

To me it seems that the kind of alignment research that's plausible to speed up 30x is the kind that we can already do without much trouble - narrowly patching various problems in ways we wouldn't expect to generalize to significantly superhuman systems.
That and generating a ton of empirical evidence quickly - which is nice, but I expect the limiting factor is figuring out what questions to ask.

It doesn't seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.... I'm not clear on when this would fail, but pretty clear that it would fail.

What we'd seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches - e.g. MIRI/ARC-theory/JW stuff.

30x speedup on this seems highly unlikely. (I guess you'd agree?)
Even if it were possible to make a month of progress in one day, it doesn't seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime).

I also note that empirically, theoretical teams don't tend to add a load of very smart humans. I'm sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI.
Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assistants aren't going to have personality clashes etc)

Are you expecting sufficiently general alignment solutions to come out of work that doesn't require deep integrated understanding? Can you point to current work (or properties of current work) that would be examples? Would you guess the things we could radically speed up are sufficient for a solution, or just useful? If the latter, how much painfully-slow-by-comparison work seems likely to be needed?

Or would the hope be that for more theoretical work there's a significant speedup, even if it's not 30x? What seems plausible to you here? 5x? Why is this currently not being achieved through human scaling? Is 5x enough to compensate for the risks? What multiplier would be just sufficient to compensate?

What would you consider early evidence of the expected multiplier for theoretical work?
E.g. should we be getting a 3x speedup with current AIs on open, underspecified problems that seem somewhat easier than alignment? Are we? (on anything - not only alignment-relevant things)


My immediate reaction to this kind of approach is that it feels like wishful thinking without much evidence. However, I'm aware that I do aesthetically prefer theoretically motivated approaches - so I don't entirely trust my reaction.
I can buy being even more pessimistic about the theoretical approaches than getting lucky with software based R&D - but to me this suggests that coordination around a stop might be the best bet.

I'm not going to respond to everything you're saying here right now. It's pretty likely I won't end up responding to everything you're saying at any point; so apologies for that.

Here are some key claims I want to make:

  • Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I'm not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn't seem impossible.
  • Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we'll have access to the exact AIs we're worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond "patches". (The hope here is similar to model organisms, but somewhat more general.)
  • The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like "control of untrusted AIs -> trustworthy human level (or slightly superhuman AIs) -> [some next target like fully scalable alignment]". If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
  • Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.

Here are some other less important claims which feed into my overall takes:

  • Current AIs aren't useful for theory yet partially because they're too dumb. They suck at math.
  • I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
  • My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I'm not very confident here.
  • AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.

This is clarifying, thanks.

A few thoughts:

  • "Serial speed is key":
    • This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful].
      Perhaps this is true for ARC - that's encouraging (though it does again make me wonder why they don't employ more mathematicians - surely not all the problems are serial on a single critical path?).
      I'd guess it's less often true for MIRI and John.
      • Of course once there's a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn't seem implausible.
  • "...in the future we'll have access to the exact AIs we're worried about.":
    • We'll have access to the ones we're worried about deploying.
    • We won't have access to the ones we're worried about training until we're training them.
    • I do buy that this makes safety work for that level of AI more straightforward - assuming we're not already dead. I expect most of the value is in what it tells us about a more general solution, if anything - similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
  • "Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems":
    • This isn't clear to me. I'd guess that the same fundamental understanding is required for both. "trustworthy" seems superficially easier than "aligned", but that's not obvious in a general context.
      I'd expect that implementing the trustworthy human-level version would be a lower bar - but that the same understanding would show us what conditions would need to obtain in either case. (certainly I'm all for people looking for an easier path to the human-level version, if this can be done safely - I'd just be somewhat surprised if we find one)
  • "So coordination to do better than this would be great".
    • I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.
  • "So coordination to do better than this would be great".
    • I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.

As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:

Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.

As far as expedient, something like:

  • Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on "magic" (speculative future science which hasn't yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
  • Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.

Both of these are unlikely to perfectly succeed, but seems like good directions to push on.

I think pushing for AI lab scaling pauses is probably net negative right now, but I don't feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.

Thanks, this seems very reasonable. I'd missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))

(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))

Corresponding comment text:

This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful].

I think I disagree with what you meant, but not that strongly. It's not that important, so I don't really want to get into it. Basically, I don't think that "well-defined" is that important (not obviously required for some ability to judge the finished work) and I don't think "re-direction frequency" is the right way to think about.

if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs

Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.

RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.


Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it's the intended meaning.

More thoughts that are only semi-relevant if I misunderstood below.

------------------------------------------------------------------------------------------------------------------------------------------------------------------

If I'm understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.

Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:

To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities

This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things

I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.

Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we're back to why the open-source capability benchmarks should be the same as closed-source.

  1. ^

    This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source. 

If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.

An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models