This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-understood models (open-source ones, say), that might even be approaching feasibility. But for the existential AGI-level risks several years out, which are the ones that matter most, the field just isn't mature enough yet.
Bear in mind that there was a point in the development of nuclear safety where people built open-circuit air-cooled nuclear reactors where the air-flow past the pile went up a chimney venting the hot air up into the atmosphere (yes, really: https://en.wikipedia.org/wiki/Windscale_Piles ). They didn't then know better, and there's an area near Ravenglass in England that's still somewhat contaminated as a result after one caught fire. Nuclear safety knowledge has since advanced to be able to do elaborate Risk Management, and a good thing too, but unfortunately AI safety isn't that mature yet.
Summary
TLDR
Responsible Scaling Policies (RSPs) have been recently proposed as a way to keep scaling frontier large language models safely.
While being a nice attempt at committing to specific practices, the framework of RSP is:
Given that, I expect the RSP framework to be negative by default (Section 3, 4 and 5). Instead, I propose to build upon risk management as the core underlying framework to assess AI risks (Section 1 and 2). I suggest changes to the RSP framework that would make it more likely to be positive and allow to demonstrate what it claims to do (Section 5).
Section by Section Summary:
General Considerations on AI Risk Management
This section provides background on risk management and a motivation for its relevance in AI.
Read more.
What Standard Risk Management Looks Like
This section describes the main steps of most risk management systems, explains how it applies to AI, and provides examples from other industries of what it looks like.
Read more.
RSPs vs Standard Risk Management
This section provides a table comparing RSPs and generic risk management standard ISO/IEC 31000, explaining the weaknesses of RSPs.
It then provides a list of 3 of the biggest failures of RSPs compared with risk management.
Prioritized RSPs failures against risk management:
Read more.
Why RSPs Are Misleading and Overselling
Misleading points:
Overselling and Underdelivering
Read more.
Are RSPs Hopeless?
This section explains why using RSPs as a framework is inadequate, even compared to just starting from already-existing AI risk management frameworks and practices such as:
A substantial amount of work that RSPs has done will be helpful as a part of detailing those frameworks, but core foundational principles of RSPs are wrong and so should be abandoned.
How to move forward?
Pragmatically, I suggest a set of changes that would make RSPs more likely to be helpful for safety. To mitigate the policy and communication nefarious effects:
To see whether already defined RSPs are consistent with reasonable levels of risks:
Read more.
Meta
Epistemic status: I've been looking into various dangerous industries safety standards for now about 4-6 months, with a focus on nuclear safety. I've been doing AI standardization work in standardization bodies (CEN-CENELEC & ISO/IEC) for about 10 months, along risk management experts from other domains (e.g. medical device, cars). In that context, I've read the existing AI ISO/IEC SC42 and JTC21 standards and started trying to apply them to LLMs and refining them. On RSPs, I've spent a few dozen hours reading the docs and discussing those with people involved and around it.
Tone: I hesitated in how charitable I wanted this piece to be. One the one hand, I think that RSPs is a pretty toxic meme (see Section 4) that got rushed towards a global promotion without much epistemic humility over how it was framed, and as far as I've seen without anyone caring much about existing risk management approaches. In that sense, I think that it should strongly be pushed back against under its current framing.
On the other hand, it's usually nice to try to not use negative connotations and calm discussion to move forward epistemically and constructively.
I aimed for something in-between, where I did emphasize with strong negative connotations what I think are the worst parts, while focusing on remaining constructive and focusing on the object-level in many parts.
This mixture may have cause me to land in an uncanney valley, and I'm curious to receive feedback on that.
Section 1: General Considerations on AI Risk Management
Risk management is about demonstrating that risks are below acceptable levels. Demonstrating the absence of risks is much more difficult than showing that some risks are dealt with. More specifically, the less you understand a system, the harder it is to rule out risks.
Let’s take an example: why can we prove more easily that the chances that a nuclear power plant causes a large-scale catastrophe are <1 / 100 000 while we can’t do so with GPT-5? In large part because we now understand nuclear power plants and many of their risks. We know how they work, and the way they can fail. They’ve turned a very unstable reaction (nuclear fission) into something manageable (with nuclear reactors). So the uncertainty we have over a nuclear power plant is much smaller than the one we have on GPT-5.
One corollary is that in risk management, uncertainty is an enemy. Saying “we don’t know” is a failure. Ruling out risks confidently requires a deep understanding of the system and disproving with very high confidence significant worry. To be clear: it is hard. In particular when the operational domain of your system is “the world”. That’s why safety is demanding. But is that a good reason to lower our safety standards when the lives of billions of people are at stake? Obviously no.
One could legitimately say: Wait, but there’s no risk in sight, the burden of proof is on those that claim that it’s dangerous. Where’s the evidence?
Well, there’s plenty:
One could legitimately say: No, but it’s not catastrophic, it’s not a big deal. Against this stance, I’ll quote the famous physicist R. Feynman reflecting on the Challenger disaster in rocket safety, a field with much higher standards than AI safety:
One could finally hope that we understand the past failures of our systems. Unfortunately, we don’t. We not only don’t understand their failure; we don’t understand how and why they work in the first place.
So how are we supposed to deal with risk?
Risk management proposes a few step methods that I’ll describe below. Most industries implement a process along those lines, with some minor variations and a varying degree of rigor and depth according to the level of regulation and type of risks. I’ll put a few tables on that in a table you can check in the Annex.
Section 2: What Standard Risk Management Looks Like
Here’s a description of the core steps of the risk management process. Names vary between frameworks but the gist of it is contained here and usually shared across frameworks.
What’s the point of those pretty generic steps and why would it help AI safety?
(1) The definition of risk thresholds is key 1) to make commitments falsifiable & avoid goalpost moving and 2) to keep the risk-generating organization accountable when other stakeholders are incurring risks due to its activity. If an activity is putting people’s lives at risk, it is important that they know how much and for what benefits and goals.
2. The UC Berkeley Center for Long-Term Cybersecurity NIST-inspired risk management profile for General-purpose AI systems co-written with D. Hendrycks provides some thoughts on how to define those in Map 1.
(2) Risk identification through systematic methods is key to try to reach something as close as possible from a full coverage of risks. As we said earlier, in risk management, uncertainty is a failure and a core way to substantially reduce it is to try to be as comprehensive as possible.
(3) Risk assessment through qualitative and quantitative means allows us to actually estimate the uncertainty we have. It is key to then prioritize safety measures and decide whether it’s reasonable to keep the project under its current forms or modify it.
The US Nuclear Regulatory Commission (NRC) specifies that reactor designs must meet a theoretical 1 in 10,000 year core damage frequency, but modern designs exceed this. US utility requirements are 1 in 100,000 years, the best currently operating plants are about 1 in one million and those likely to be built in the next decade are almost 1 in 10 million.
World Nuclear Association, 2022
b. Events are very fat-tailed and misunderstood, as was the case in nuclear safety in the 1970s. It has been done, and it is through the iterative practice of doing it that an industry can become more responsible and cautious. Reading a book review of Safe Enough?, a book on the history of quantitative risk assessment methods used in nuclear safety, there’s a sense of déjà-vu:
If nuclear plants were to malfunction at some measurable rate, the industry could use that data to anticipate its next failure. But if the plants don't fail, then it becomes very difficult to have a conversation about what the true failure rate is likely to be. Are the plants likely to fail once a decade? Once a century? Once a millennium? In the absence of shared data, scientists, industry, and the public were all free to believe what they wanted.
Astral Codex Ten, 2023, describing the genesis of probabilistic risk assessment in nuclear safety.
(4) Risk treatment is in reaction to the risk assessment and must be pursued until you reach the risk thresholds defined. The space of interventions here is very large, larger than is usually assumed. Better understanding one’s system, narrowing down its domain of operation by making it less general, increasing the amount of oversight, improving safety culture: all those are part of a broad set of interventions that can be used to meet thresholds. There can be a loop between the treatment and the assessment if substantial changes to the system are done.
(5) Monitoring is the part that ensures that the risk assessment remains valid and nothing major has been left out. This is what behavioral model evaluations are most useful for, i.e. ensuring that you track the risks you’ve identified. Good evaluations would map to a pre-defined risk appetite (e.g. 1% chance of >1% deaths) and would cover all risks brought up through systematic risk identification.
(6) Reporting is the part that ensures that all relevant stakeholders are provided with the right information. For instance, those incurring risks from the activities should be provided with information on the amount of risk they’re exposed to.
Now that we’ve done a rapid overview of standard risk management and why it is relevant to AI safety, let’s talk about how RSPs compare against that.
Section 3: RSPs vs Standard Risk Management
Some underlying principles of RSPs should definitely be pursued. There are just better ways to pursue these principles, that already exist in risk management, and happen to be what most other dangerous industries and fields do. To give two examples of such good underlying principles:
Below, I argue why RSPs are a bad implementation of some good risk management principles and why that makes the RSP framework inadequate to manage risks.
Direct Comparison
Let’s dive into a more specific comparison between the two approaches. The International Standards Organization (ISO) has developed two risk management standards that are relevant to AI safety, although not focused on it:
To be clear, those standards are not sufficient. They’re considered weak by most EU standardization actors or extremely weak by risk management experts from other industries like the medical device industry. There will be a very significant amount of work needed to refine such frameworks for general-purpose AI systems (see a first iteration by T. Barrett here, and a table of how it maps to ISO/IEC 23894 here) But those provide basic steps and principles that, as we explained above, are central to adequate risk management.
In the table below, I start from the short version of ARC Evals’ RSP principles and try to match the ISO/IEC 31000 version that most corresponds. I then explain what’s missing from the RSP version. Note that:
The table version:
Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling?
Defining risk criteria: The organization should specify the amount and type of risk that it may or may not take, relative to objectives.
It should also define criteria to evaluate the significance of risk and to support decision-making processes.
Risk criteria should be aligned with the risk management framework and customized to the specific purpose and scope of the activity under consideration.
[...]
The criteria should be defined taking into consideration the organization’s obligations and the views of stakeholders.
[...]
To set risk criteria, the following should be considered:
— the nature and type of uncertainties that can affect outcomes and objectives (both tangible and intangible);
— how consequences (both positive and negative) and likelihood will be defined and measured;
— time-related factors;
— consistency in the use of measurements;
— how the level of risk is to be determined;
— how combinations and sequences of multiple risks will be taken into account;
— the organization’s capacity.
RSPs doesn’t argue why systems passing evals are safe. This is downstream of the absence of risk thresholds with a likelihood scale. For example, Anthropic RSP also dismisses accidental risks as “speculative” and “unlikely” without much depth, without much understanding of their system, and without expressing what “unlikely” means.
On the other hand, the ISO standard asks the organization to define risk thresholds, and emphasizes the need to match risk management with organizational objectives (i.e. build human-level AI).
Protections: what aspects of current protective measures are necessary to contain catastrophic risks?
Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits?
Risk analysis: The purpose of risk analysis is to comprehend the nature of risk and its characteristics including, where appropriate, the level of risk. Risk analysis involves a detailed consideration of uncertainties, risk sources, consequences, likelihood, events, scenarios, controls and their effectiveness.
Risk evaluation: The purpose of risk evaluation is to support decisions. Risk evaluation involves comparing the results of the risk analysis with the established risk criteria to determine where additional action is required. This can lead to a decision to:
— do nothing further;
— consider risk treatment options;
— undertake further analysis to better understand the risk;
— maintain existing controls;
— reconsider objectives.
ISO proposes a much more comprehensive procedure than RSPs, that doesn’t really analyze risk levels or have a systematic risk identification procedure.
The direct consequence is that RSPs are likely to lead to high levels of risks, without noticing.
For instance, RSPs don’t seem to cover capabilities interaction as a major source of risk.
Risk treatment plans: Risk treatment options are not necessarily mutually exclusive or appropriate in all circumstances. Options for treating risk may involve one or more of the following:
— avoiding the risk by deciding not to start or continue with the activity that gives rise to the risk; — taking or increasing the risk in order to pursue an opportunity; — removing the risk source;
— changing the likelihood;
— changing the consequences; — sharing the risk (e.g. through contracts, buying insurance);
— retaining the risk by informed decision
Treatment plans should be integrated into the management plans and processes of the organization, in consultation with appropriate stakeholders.
The information provided in the treatment plan should include:
— the rationale for selection of the treatment options, including the expected benefits to be gained;
— those who are accountable and responsible for approving and implementing the plan;
— the proposed actions;
— the resources required, including contingencies;
— the performance measures;
— the constraints;
— the required reporting and monitoring;
— when actions are expected to be undertaken and completed
ISO, thanks to the definition of a risk threshold, ensures that risk mitigation measures bring risks below acceptable levels. The lack of risk thresholds for RSPs makes the risk mitigation measures ungrounded.
Example: ASL-3 risk mitigation measures as defined by Anthropic (i.e. close to catastrophically dangerous) imply significant chances to be stolen by Russia or China (I don’t know any RSP person who denies that). What are the risks downstream of that? The hope is that those countries keep the weights secure and don’t cause too many damages with it.
Monitoring and review: The purpose of monitoring and review is to assure and improve the quality and effectiveness of process design, implementation and outcomes. Ongoing monitoring and periodic review of the risk management process and its outcomes should be a planned part of the risk management process, with responsibilities clearly defined. [...] The results of monitoring and review should be incorporated throughout the organization’s performance management, measurement and reporting activities.
Recording and reporting: The risk management process and its outcomes should be documented and reported through appropriate mechanisms. Recording and reporting aims to:
— communicate risk management activities and outcomes across the organization;
— provide information for decision-making;
— improve risk management activities;
— assist interaction with stakeholders, including those with responsibility and accountability for risk management activities.
Those parts have similar components.
But ISO encourages reporting the results of risk management to those that are affected by the risks, which seems like a bare minimum for catastrophic risks.
Anthropic’s RSP proposes to do so after deployment, which is a good accountability start, but still happens once a lot of the catastrophic risk has been taken.
Prioritized Risk Management Shortcomings of RSPs
Here’s a list of the biggest direct risk management failures of RSPs:
1. Using underspecified definitions of risk thresholds and not quantifying the risk. RSPs don't define risk thresholds in terms of likelihood. Instead, they focus straight away on symptoms of risks (certain capabilities that an evaluation is testing is one way a risk could instantiate) rather than the risk itself (the model helping in any possible way to build bioweapons). This makes it hard to verify whether safety requirements have been met and argue whether the thresholds are reasonable. Why is it an issue?
By describing neither quantitatively nor qualitatively why it is risky, expressed in terms of risk criteria (e.g. 0.1% chance of killing >1% of humans) it doesn’t do the most important step to demonstrate that below this threshold, things are safe and acceptable. For instance, in the example above, why is “the majority of people with any life sciences degree” relevant? Would it be fine if only 10% of this population was now able to create a bioweapon? Maybe, maybe not. But without clear criteria, you can’t tell.
2. Claiming “responsible scaling” without including a process to make the assessment comprehensive. When you look at nuclear accidents, what’s striking is how unexpected failures are. Fukushima is an example where everything goes wrong at the same time. Chernobyl is an example where engineers didn’t think that the accident that happened was possible (someone claims that they were so surprised that engineers actually ran another real-world test of the failure that happened at Chernobyl because they doubted too much it could happen).
Without a more comprehensive process to identify risks and compare their likelihood and severity against pre-defined risk thresholds, there’s very little chance that RSPs will be enough. When I asked some forecasters and AI safety researchers around me, the estimates of the annual probability of extinction caused by an ASL-3 system (defined in Anthropic RSPs) were several times above 1%, up to 5% conditioning on our current ability to measure capabilities (and not an idealized world where we know very well how to measure those).
3. Including the white knight clause that kills commitments.
One of the proposals that striked me the most when reading RSPs is the insertion of what deserves the name of the white knight clause.
I believe that there are companies that are safer than others. But that’s not the right question. The right question is: is there any company which wouldn’t consider itself as a bad guy? And the answer is: no. OpenAI, Anthropic and DeepMind would all argue about the importance of being at the frontier to solve alignment. Meta and Mistral would argue that it’s key to democratize AI to not prevent power centralization. And so on and so forth.
This clause is effectively killing commitments. I’m glad that Anthropic included only a weakened version of it in its own RSP but I’m very concerned that ARC is pitching it as an option. It’s not the role of a company to decide whether it’s fine or not to increase catastrophic risks for society as a whole.
Section 4: Why RSPs Are Misleading and Overselling
Misleading
Beyond the designation of misalignment risks as “speculative” on Anthropic RSPs and a three line argument for why it’s unlikely among next generation systems, there are several extremely misleading aspects of RSPs:
If you want to read more on that, you can read that.
Overselling, underdelivering
The RSP framework has some nice characteristics. But first, these are all already covered, in more detail, by existing risk assessment frameworks that no AI lab has implemented. And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:
I also feel like it leads to many disagreements that all hinge on: do we expect labs to implement ambitious RSPs?
And my answer is: given their track record, no. Not without government intervention. Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.
I expect labs to push hard for the misleading version, on the basis that pausing is unrealistic and would be bad for innovation or for international race. Policymakers will have a hard time distinguishing the risk levels between the two because it hinges on details and aren’t quantified in RSPs. They are likely to buy the bad misleading version because it’s essentially selling that there’s no trade-off between capabilities and safety. That would effectively enforce a trajectory with unprecedented levels of catastrophic risks.
Section 5: Are RSPs Hopeless?
Well, yes and no.
In the short-run, given that it seems that RSPs have started being pushed at the UK Summit and various other places, I’ll discuss what changes could make RSPs beneficial without locking in regulation a bad framework.
How to Move Forward?
Mitigating nefarious effects:
Checking whether RSPs manage risks adequately:
At a risk management level, if one wanted to demonstrate that RSPs like Anthropic’s one are actually doing what they claim to do (i.e. “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”), a simple way to do so would be to run a risk assessment on ASL-3 systems with a set of forecasters, risk management experts and AI risk experts that are representative of views on AI risks and that have been selected by an independent body free of any conflict of interest.
I think that a solid baseline would be to predict the chances of various intermediary and final outcomes related to the risks of such systems:
It might not be too far from what Anthropic seems to be willing to do internally, but doing it with a publicly available methodology, and staff without self-selection or conflict of interests makes a big difference. Answers to questions 1) and 2) could raise risks so the output should be communicated to a few relevant actors but could potentially be kept private.
If anyone has the will but doesn’t have the time or resources to do it, I’m working with some forecasters and AI experts that could probably make it happen. Insider info would be helpful but mostly what would be needed from the organization is some clarifications on certain points to correctly assess the capabilities of the system and some info about organizational procedures.
Acknowledgments
I want to thank Eli Lifland, Henry Papadatos and my other NAIR colleague, Olivia Jimenez, Akash Wasil, Mikhail Samin, Jack Clark, and other anonymous reviewers for their feedback and comments. Their help doesn’t mean that they endorse the piece. All mistakes are mine.
Annex
Comparative Analysis of Standards
This (cropped) table shows the process of various standards for the 3 steps of risk management. As you can see, there are some differences but every standard seems to follow a similar structure.
From (Raz & Hillson, 2005)
Here is a comparable table for the last two parts of risk management.