I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we'll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can't do are concrete safety evals, which I'm very clear about not expecting us to have right now.
And I'm not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn't work as a way to make things go well: but that's how any story of success has to work right now given that we don't currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn't involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I'm saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic's thoughts on anything—and the post isn't really about Anthropic's RSP at all.
Regardless, I'm happy to give my take. I don't think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
It sounds like you're disagreeing with me, but everything you're saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can't.
But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
Fully agree with almost all of this. Well said.
One nitpick of potentially world-ending importance:
In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems
Giving us high confidence is not the bar - we also need to be correct in having that confidence.
In particular, we'd need to be asking: "How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we're confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?..."
I assume you'd roll that into assessing your confidence - but I think it's important to be explicit about this.
Based on your comment, I'd be interested in your take on:
Not having thought about it for too long, I'm inclined to favor (2).
I'm not at all sure how realistic it is from a unilateral point of view - but I think it'd be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don't strongly expect to be able to meet ahead of time, that's useful to know: it amounts to "RSPs are a means to avoid pausing".
I imagine most labs wouldn't commit to [we only get to run this training process if Eliezer thinks it's good for global safety], but I'm not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered - and that a company full of abstract thinkers who've concluded "There's no direct, concrete, ML-based thing we can commit to here, so we're out of options" don't appear to be trying tremendously hard.
Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:
What would a good RSP look like?
What do RSPs actually look like right now?
Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.
Why I'm disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc.
On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever.
(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."
Strongly agree with almost all of this.
My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.
Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can't be implicit, since it's a central way that we die.
If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.
If we're thinking of better mechanisms to achieve a pause, I'd add:
I'm sure there are other sensible additions, but that'd be a decent start.
Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none) This can't be implicit, since it's a central way that we die. If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.
Yeah, I agree—that's why I'm specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you'd know if there were anything wrong that your other evals might miss.
Define ASLs or similar now rather than waiting until we're much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan's own "We need to make sure that, once we have solid understanding-based evals, governments make them mandatory" only re-enforces this impression. Whether we have them is irrelevant to the question of whether they're necessary.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we'd end up defining something easily game-able because we don't yet have metrics for understanding that aren't easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don't yet know what we could put there.
I think there's a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway - with government/regulator backing, since they're doing everything practical, everything reasonable.... Assuming this won't happen seems dangerously naive.
I definitely agree that this is a serious concern! That's part of why I'm writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn't be sufficient for reasonable amounts of safety (insofar as this is what they think).
Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I'm not really sold on substantial risk from systems which aren't capable of carrying out that harm mostly autonomously, so I don't think ASL-3 is actually important except as setup.)
It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe.
I want to be very clear that I've been really happy to see all the people pushing for strong asks here. I think it's a really valuable thing to be doing, and what I'm trying to do here is not stop that but help it focus on more concrete asks.
I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime.
To be clear, I definitely agree with this. My position is not "RSPs are all we need", "pauses are bad", "pause advocacy is bad", etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. "RSPs are pauses done right."
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.
I mean, to be clear, I am saying something like "RSPs are the most effective way to implement a pause that I know of." The thing I'm not saying is just that "RSPs are the only policy thing we should be doing."
Thanks for writing this up.
I agree that the issue is important, though I'm skeptical of RSPs so far, since we have one example and it seems inadequate - to the extent that I'm positively disposed, it's almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)
Going only by the language in the blog post and the policy, I'd conclude that they're an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I'm not the main target audience - but I worry about the impression the current wording creates.
I hope that RSPs can be beneficial - but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy - and without any "many years" or similar)
It's hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I'll focus on your post first, and the RSP blog/policy doc after that.
Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
There's an obvious thing to do here. It's far from obvious that it's a solution.
One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.
RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks.
They kick in when we detect that models have capabilities that we realize are relevant to downstream risks.
Both detection and realization can fail.
My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.
Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety.
This could use greater emphasis in the RSP blog/doc.
Ideally, we should get the governmental RSPs to be even stronger!
Yes!
We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
We need governments to make them mandatory before they're necessary, not once we have them (NB, not [before it's clear they're necessary] - it might not be clear). I don't expect us to have sufficiently accurate understanding-based evals before they're necessary. (though it'd be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren't yet adequate, then labs need to stop.
Thoughts on the blog/doc themselves. Something of a laundry list, but hopefully makes clear where I'm coming from:
I'm mostly not going to comment on Anthropic's RSP right now, since I don't really want this post to become about Anthropic's RSP in particular. I'm happy to talk in more detail about Anthropic's RSP maybe in a separate top-level post dedicated to it, but I'd prefer to keep the discussion here focused on RSPs in general.
One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.
I definitely share this worry. But that's part of why I'm writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it'll take good policy and advocacy work to make that happen.
My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it'll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it's mostly a solved problem in theory and just requires good implementation.
We need governments to make them mandatory before they're necessary, not once we have them (NB, not [before it's clear they're necessary] - it might not be clear). I don't expect us to have sufficiently accurate understanding-based evals before they're necessary. (though it'd be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis. We need to push for adequate safety techniques. If state-of-the-art techniques aren't yet adequate, then labs need to stop.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I'm proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it's that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
I think we at least do know how to do effective capabilities evaluations
This seems an overstatement to me:
Where the main risk is misuse, we'd need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that "we know how to do effective [capabilities given sota elicitation methods] evaluations", but that doesn't answer the right question.
Once the main risk isn't misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn't realize we were relying upon]). Obviously we don't expect these to break yet, but I'd guess that we'll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine - but that [I don't have much of a clue, so I'm advocating extreme caution] may be the more reasonable policy.
My concern with trying to put something like [understanding-based evals] into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
We don't know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too - perhaps with a [you can't pass this check without first coming up with a test and getting it approved] condition, or it needs a "THIS RSP IS INADEQUATE TO ENSURE SAFETY" warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it's not emphasized nearly enough)
The point is that advocating for a “pause” is nebulous and non-actionable
Setting aside the potential advantages of RSPs, this strikes me as a pretty weird thing to say. I understand the term "pause" in this context to mean that you stop building cutting-edge AI models, either voluntarily or due to a government mandate. In contrast, "RSP" says you eventually do that but you gate it on certain model sizes and test results and unpause it under other test results. This strikes me as a bit less nebulous, but only a bit.
I'm not quite sure what's going on here - it's possible that the term "pause" has gotten diluted? Seems unfortunate if so.
I think the problem is that nobody really has an idea for what the resumption condition should be for a pause, and nobody's willing to defend the (actually actionable) six-month pause proposal.
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?
a) near term (eval-less) scaling poses trivial xrisk
I agree that this is mostly baked in, but I think I'm pretty happy to accept it. I'd very surprised if there was substantial x-risk from the next model generation.
But also I would argue that, if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regulatory regime we'd need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value.
b) there is a substantial period during which models trigger evals but are existentially safe
Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once.
Probably the most important intervention here is to keep applying your evals while you're training your next model generation, so they trigger as soon as possible. As long as there's some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation tend to be relatively predictable given the current model generation.
So overall, my sense is that takeoff only has to be marginally continuous for this to work—if it's extremely abrupt, more of a classic FOOM scenario, then you might have problems, but I think that's pretty unlikely.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
Thanks! Happy to chat about this more also offline.
Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?
i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.
I'd very surprised if there was substantial x-risk from the next model generation.
i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don't trust my (or anyone else's) intuitons strongly enough to say that there's a less than 1% xrisk per 10x scaling of compute. in expectation, that's killing 80M existing people -- people who are unaware that this is happening to them right now.
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!
if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regularity regime we'd need to deal with that in place
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my understanding, only 3 or 4 labs have any chance of releasing a "next generation" model for as much as two years from now, others won't be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite--for voluntary actions taken by the labs, and that regulation can follow.
I talk about that here:
And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.
I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc.
As to the realism, I think it's more realistic than I think you think it is. My impression of AI Impacts' technological temptation work is that governments are totally willing to enact policies that impoverish their citizens without requiring a rigourous CBA. Early wins does seem like an important consideration, but you can imagine trying to get some early wins by e.g. banning AI from being used in certain domains, banning people from developing advanced AI without doing X, Y, or Z.
I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc.
Sure—I just think it'd be better to spend that energy advocating for good RSPs instead.
To be clear, the whole point of my post is that I am in favor of pausing/stopping AI development—I just think the best way to do that is via RSPs.
On RSPs vs pauses, my basic take is that hardcore pauses are better than RSPs and RSPs are considerably better than weak pauses.
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
I think good RSPs are worse than this, but probably much better than just having a lab pause scaling.
It's possible that various actors should explicitly state that hardcore pauses would be better (insofar as they think so).
- A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. ...
- A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. ...
I propose changing the term for this second type of evaluation to "propensity evaluations". I think this is a better term as it directly fits the definition you provided: "a model evaluation designed to test under what circumstances a model would actually try to do some task".
Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it's misleading to label only one of them as "safety evaluations". For example, we could construct a compelling safety argument for current models using solely capability evaluations.
Either can be sufficient for safety: a strong argument based on capabilities (we've conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).
Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.
I sometimes refer to capability based arguments as control arguments.
Then, we can name two lines of defense:
It's possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.
I expect that in practice, we're not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we're not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.
Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.
(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)
An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of "control arguments" which can be readily assessed with non-insane capability evaluations.)
[it turns out I have many questions - please consider this a pointer to the kind of information I'd find useful, rather than a request to answer them all!]
around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x
Can you point to what makes you think this is likely? (or why it seems the most promising approach)
In particular, I worry when people think much in terms of "doublings of total R&D effort" given that I'd expect AI assistance progress multipliers to vary hugely - with the lowest multipliers correlating strongly with the most important research directions.
To me it seems that the kind of alignment research that's plausible to speed up 30x is the kind that we can already do without much trouble - narrowly patching various problems in ways we wouldn't expect to generalize to significantly superhuman systems.
That and generating a ton of empirical evidence quickly - which is nice, but I expect the limiting factor is figuring out what questions to ask.
It doesn't seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.... I'm not clear on when this would fail, but pretty clear that it would fail.
What we'd seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches - e.g. MIRI/ARC-theory/JW stuff.
30x speedup on this seems highly unlikely. (I guess you'd agree?)
Even if it were possible to make a month of progress in one day, it doesn't seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime).
I also note that empirically, theoretical teams don't tend to add a load of very smart humans. I'm sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI.
Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assistants aren't going to have personality clashes etc)
Are you expecting sufficiently general alignment solutions to come out of work that doesn't require deep integrated understanding? Can you point to current work (or properties of current work) that would be examples? Would you guess the things we could radically speed up are sufficient for a solution, or just useful? If the latter, how much painfully-slow-by-comparison work seems likely to be needed?
Or would the hope be that for more theoretical work there's a significant speedup, even if it's not 30x? What seems plausible to you here? 5x? Why is this currently not being achieved through human scaling? Is 5x enough to compensate for the risks? What multiplier would be just sufficient to compensate?
What would you consider early evidence of the expected multiplier for theoretical work?
E.g. should we be getting a 3x speedup with current AIs on open, underspecified problems that seem somewhat easier than alignment? Are we? (on anything - not only alignment-relevant things)
My immediate reaction to this kind of approach is that it feels like wishful thinking without much evidence. However, I'm aware that I do aesthetically prefer theoretically motivated approaches - so I don't entirely trust my reaction.
I can buy being even more pessimistic about the theoretical approaches than getting lucky with software based R&D - but to me this suggests that coordination around a stop might be the best bet.
I'm not going to respond to everything you're saying here right now. It's pretty likely I won't end up responding to everything you're saying at any point; so apologies for that.
Here are some key claims I want to make:
Here are some other less important claims which feed into my overall takes:
This is clarifying, thanks.
A few thoughts:
- "So coordination to do better than this would be great".
- I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
As far as expedient, something like:
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don't feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I'd missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn't need to be re-directed frequently [EDIT: the preceding was poorly worded - I meant that if prior to the availability of AI assistants this were true, it'd allow a lot of speedup as the AIs take over this work; otherwise it's less clearly so helpful].
I think I disagree with what you meant, but not that strongly. It's not that important, so I don't really want to get into it. Basically, I don't think that "well-defined" is that important (not obviously required for some ability to judge the finished work) and I don't think "re-direction frequency" is the right way to think about.
if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs
Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.
RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.
I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it's the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
If I'm understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we're back to why the open-source capability benchmarks should be the same as closed-source.
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.
An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:
How do we make it to a state where AI goes well?
I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.
The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:
With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):
Reasons to like RSPs
Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:
How do RSPs relate to pauses and pause advocacy?
In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.
Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.
Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular: