I work at Redwood Research.
if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan
I'm basically imagining being able to use controlled AIs which aren't qualitatively smarter than humans for whatever R&D purposes we want. (Though not applications like (e.g.) using smart AIs to pilot drone armies live.) Some of these applications will be riskier than others, but I think this can be done while managing risk to a moderate degree.
Bootstrapping to some extent should also be possible where you use the first controlled AIs to improve the safety of later deployments (both improving control and possibly alignment).
Is your perspective something like:
With (properly motivated) qualitatively wildly superhuman AI, you can end the Acute Risk Period using means which aren't massive crimes despite not collaborating with the US government. This likely involves novel social technology. More minimally, if you did have a sufficiently aligned AI of this power level, you could just get it to work on ending the Acute Risk Period in a basically legal and non-norms-violating way. (Where e.g. super persuasion would clearly violate norms.)
I think that even having the ability to easily take over the world as a private actor is pretty norms violating. I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period (edit: without US government collaboration and) without needing truly insanely smart AIs. I suppose that if you go smart enough this is possible though pre-existing norms also just get more confusing in the regime where you can steer the world to whatever outcome you want.
So overall, I'm not sure I disagree with this perspective exactly. I think the overriding consideration for me is that this seems like a crazy and risky proposal at multiple levels.
To be clear, you are explicitly not endorsing this as a plan nor claiming this is Anthropic's plan.
I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place.
AIs which aren't qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is "only" around 25% with such AIs because default methods likely avoid egregious misalignment (perhaps 30% chance of egregious misalignment with token effort and then some chance you get lucky for a 25% chance of risk overall).
Then given this, I have two objections to the story you seem to present:
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI. Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea). Thus, you might as well go all the way to preventing much smarter AIs from being built (by anyone) for a while which seems possible with some US government support and the use of these human-ish level AIs.
TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.
I have a policy against commiting or recommending commiting massive crimes.
My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan.
(This might be doable without government support via coordination between multiple labs, but I basically doubt it.)
Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.
I left an earlier comment on more important and clearer points, but I thought I would also leave a comment with some more tentative remarks, explanations of where I'm confused, and notes.
In this checklist, perhaps the most important target for ASL-4 technical safety measures (not including security) is having the measures to ensure that ASL-4 is safe with respect to misalignment. From what I could tell, you were imagining:
I think mechanistic interpretability is unlikely to provide a high degree of assurance (<10% probability of a 10x risk reduction, and <20% probability of a 3x risk reduction where the risk reduction is relative to other more simple methods). So, I think the first bullet will likely be unimportant. Then, the question is whether the second bullet will suffice or whether you are leaning on control.
I was generally a bit unclear on what you were imagining given that you didn't describe a specific list of possible approaches and I generally didn't see a clear story for how the safety case would work. At a more basic level, I expect that the main hope for the first TAIs will end up being AI control in short timelines, and this seems underemphasized if you think this, so likely we disagree or I ended up confused about what you're imagining.
Personally, I'd advocate for using roughly three pillars for ASL-4:
It seems likely to me that this proposal I'm outlining doesn't provide a reasonable level of assurance (e.g. <0.5% takeover/escape/severe sabotage risk throughout ASL-4), but I think it could work at the start of ASL-4 and is reasonably likely to be the best we've got in short timelines.
Overall, I think that it will likely be quite hard to meet a reasonable level of assurance for ASL-4 with the technology we have at least in slightly more pessimistic worlds (worlds where there are some warning signs or the architecture is somewhat less favorable). So, plans should bear this in mind. I think extensive planning for when you can't make reasonable safety targets seem quite important, including plans for when you find smoking gun evidence of egregious misalignment or plans for when the evidence is more confusing, but our countermeasures and tests aren't sufficiently good to provide a reasonable degree of assurance.
You used the term "early TAI". I would have thought this meant the first TAI where you defined TAI to be an AI which is a drop-in replacement for human workers, including in AI R&D. However, in the sections "Largely Solving Alignment Fine-Tuning for Early TAI" and "Rendering Early TAI Reliably Harmless", you seemed to possibly imply that this was before TAI. In particular, you seemed to imply that scheming wouldn't be a risk at the start of TAI (seems unclear to me given that the AIs are comparable to top human experts!) and you seemed to imply that such AIs wouldn't speed up R&D that much (while I expect that drop-in replacements for top human AI researchers would pretty likely result in huge acceleration!).
Perhaps I was mostly confused by how you split things up across different headings?
This is also related to my earlier point on ASL-4 security.
You seem to define TAI as "AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D". This seems to imply that TAI would likely greatly accelerate AI R&D to the extent that it is possible to greatly accelerate with human labor (which seems like an assumption of the piece).
But later you say:
AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety.
I would have guessed that for this definition and the assumptions of the piece, you would expect >10x acceleration of AI R&D (perhaps 30x) given the definition of TAI.
Further "allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible" doesn't seem like an interesting or important bar to me. If progress has already accelerated and the prediction is that it will keep accelerating, then the intelligence explosion is already happening (though it is unclear what the full extent of the explosion will be).
You seem to have some idea in mind of "intelligence-explosion level", but it was unclear to me what this was supposed to mean.
Perhaps you mean something like "could the AIs fully autonomously do an intelligence explosion"? This seems like a mostly unimportant distinction given that projects can use human labor (even including rogue AI-run projects). I think a more natural question is how much the AIs can accelerate things over a human baseline and also how easy further algorithmic advances seem to be (or if we end up effectively having very minimal further returns to labor on AI R&D).
Overall, I expect all of this to be basically continuous such that we are already reasonably likely to be on the early part of a trajectory which is part of an intelligence explosion and this will be much more so the case once we have TAI.
(That said, note that at TAI I still expect things will take potentially a few years and potentially much longer if algorithmic improvements fizzle out rather than resulting in a hyperexponential trajectory without needing further hardware.)
See also the takeoff speeds report by Tom Davidson and this comment from Paul.
Here's a long series of takes I had while reading this; this is ordered sequentially rather than by importance.
I removed takes/notes which I already covered above or in my earlier comment.
Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]
FWIW, broadly human-level AI being possible seems near certain to me (98% likely)
If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today.
I interpret this statement to mean P(TAI before 2030 | TAI possible) > 50%. This seems a bit too high to me, though it is also in a list of things called plausible so I'm a bit confused. (Style-wise, I was a bit confused by what seems to be probabilities of probabilities.) I think P(TAI possible) is like 98%, so we can simplify to P(TAI before 2030) which I think is maybe 35%?
Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.
I'm surprised this doesn't mention misalignment.
Also, I find "misuse of weapons-related capabilities" somewhat awkward. Is this referring to countries using AI for military or terrorists / lone actors use AIs for terrorism/similar? I think it's unnatural to call countries using your AI for weapons R&D "misuse", in the same way we don't call stealing F-35 plans "misuse".
Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.
Seems surprisingly weak if these systems are drop-in replacements for humans!
Our best models are broadly superhuman, warranting ASL-5 precautions, and they're starting to be used in high-stakes settings. They're able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can't keep up with.
Based on some aspects of how Sam talks about ASL-4, I'm tempted to interpret "broadly superhuman" to mean "at least as good as the best humans in all important domains". However, the language he uses elsewhere and the specific words seem to actually mean "notably better than the best humans in all important domains". Or perhaps even "much better than the best humans in all important domains" Some of the text later on is "first substantially superhuman models pose ASL-5-level risk" which seems to imply "notably better" or "much better" to me.
For now, I'll assume Sam means "notably better" or "much better" as this seems like the most likely interpretation.
I think you shouldn't build AIs which are this smart until you've had a long time with models comparable to top human scientists (perhaps >5 years of delay) and you should be making plans to avoid building such models and handle things with human-ish level models. I think you should try to avoid building AIs which are smarter than top human scientists and I don't see a strong reason to expect you'll need stronger AIs. (More on this...)
If Sam actually means "at least as good", then I disagree with comments "if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2" and "This is the endgame for our AI safety work: If we haven't succeeded decisively on the big core safety challenges by this point, there's so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now." If Sam means "notably better" or "much better", then I think this seems about right. However, in this case, my comments about avoiding building such systems apply.
In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are trying to cause serious harm.
+1
If they're significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
Given how easy the RSP is to amend with approximately zero cost in most cases like this (clearly unwarranted pause), I think this is unlikely to happen in practice. I guess I could imagine some more specific commitment you make like the ASL-3 commitments being a problem, but I roughly expect procedural commitments for ASL-4 and beyond. (Other than perhaps security.)
I generally think we're unlikely to meet a "reasonable" risk threshold for the key regimes and thus think that a lot of the action is in being super clear about loosening the RSP and exactly what that entails. Plausibly, reasonable actors should self-destruct in these cases, though I'm currently skeptical this will happen for any actor and I also lean toward this being a bad idea.
Overall, I'd like the structure of RSPs to look more like a procedural commitment than a specific commitment to countermeasures. You should explain your evaluations in advance and notify the public if you deviate from them.
One of Anthropic's main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case.
I'm skeptical and this seems pretty unlikely.
(I think something involving model internals could end up being pretty useful, I don't expect that Anthropic's ongoing work in mech interp will help much with this.)
Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust.
I don't think this is necessary exactly (see here). But, it would make it easier.
quantitatively forecast the effectiveness of our mitigations
This section seems to think that forecasting is key. Why? Can't we just check if the mitigations are sufficient at the time without needing to extrapolate which imposes additional difficulties? I agree forecasting is useful.
Protecting algorithmic secrets [...] It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.
It seems like the proposal is to de facto assume current algorithmic secrets will be stolen by competent actors eventually. This is my guess for what Anthropic should do at the moment, but it is worth remarking on how scary this is.
We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do.
This seems to be implying that you won't be able to get massive AI speed-ups because you won't trust AIs in many domains. I think this will likely be seen as unacceptable and thus you should prepare for how to deploy AIs in these domains. (See also AI Control.)
Providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.
I don't think the board provides a real independent perspective at the moment, and I'm unsure if it ever will. (See also Zach's comment.) The LTBT will likely just defer to the board in the current status quo (they should be appointing the board by ASL-4 from my understanding), so there is really just the board. I don't think we should trust the board of Anthropic to be an independent check on Anthropic unless the specific board members have well-built up and independent views on AI safety (and ideally a majority of such board members). (By default I think Anthropic leadership will effectively control the board and convince the board that they should defer to experts who agree with the leadership.)
I think all of this is hard to achieve even if this is was a high priority, but the bottom line is that the board is unlikely to be an independent check. (Various employees could potentially be an independent check.)
On that note, I think the most urgent safety-related issue that Anthropic can't directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.
Strong +1
we as an organization are very limited in what we can do to make this happen.
I disagree, I think there are things Anthropic could do that would help considerably. This could include:
I'm not sure exactly what is good here, but I don't think Anthropic is as limited as you suggest.
This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases
I think Anthropic should either heavily plan for scenarios where it self-destructs as an organization or should be prepared to proceed with a plan B that doesn't meet an objectively reasonable safety bar (e.g. <1% lifetime takeover risk). (Or both.)
Developing Methods to Align a Substantially Superhuman AI
I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
Addressing AI Welfare as a Major Priority
I lean against focusing on AI welfare as a major priority. I'd target perhaps 2.5% of resources on AI welfare in the absence of outside pressure. (Zach also made a comment on this and my earlier post on AI welfare talks about resource allocation considerations some.)
As far as security, perhaps part of what is going on is that you expect that achieving this high bar of security is too expensive:
ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.
My sense is indeed that SL5 level security would be a large tax to operate under, particularly when implemented in a hurry. However, I think this is also a natural point at which national security concerns become large and commercialization is likely to greatly reduce.
Thanks for writing this! I appreciate the effort to make your perspective more transparent (and implicitly Anthropic's perspective as well). In this comment, I'll explain my two most important concerns with this proposal:
I have a variety of other takes on this proposal as well as various notes, but I decided to start by just writing up these two thoughts. I'll very likely add more comments in a bit.
When you have AIs which can massively accelerate total AI capabilities R&D production (>5x)[1], I think it is crucial that such systems are secure against very high-resource state actor attacks. (From the RAND report, this would be >=SL5[2].) This checklist seems to assume that this level of security isn't needed until ASL-5, but you also say that ASL-5 is "broadly superhuman" so it seems likely that dramatic acceleration occurs before then. You also say various other things that seem to imply that early TAI could dramatically accelerate AI R&D.
I think having extreme security at the point of substantial acceleration is the most important intervention on current margins in short timelines. So, I think it is important to ensure that this is correctly timed.
As far as the ASL-4 security bar, you say:
early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks
This seems to imply a level of security of around SL-4 from the RAND report which would be robust to routine state actor attacks, but not top priority attacks. This seems insufficient given the costs of such a system being stolen and how clearly appealing this would be as you note later:
We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history.
This seems to imply that you'll need more than SL5 security![3]
You say that at ASL-5 (after TAI):
Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made.
For us to judge whether this is a good overall proposal, it would help to have a clear exit plan including an argument that this exit state is acceptable. This allows us to discuss whether this exit state actually is acceptable and to analyze whether the earlier steps will plausibly lead to this state.
Further, it seems likely that people writing plans and proposals now are also the plans that the government might end up using! (After all, who will the relevant people in the US government ask for help in figuring out what to do?)
As far as I can tell, the proposed exit state you're imagining is roughly "perfectly (scalably?) solve alignment (or just for substantially superhuman systems?) and then hand things off to the government":
So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control.
(The proposal is also that we'll have improved societal resilience and tried to address AI welfare concerns along the way. I left this out because it seemed less central.)
I worry about the proposal of "perfectly solve alignment", particularly because you didn't list any concrete plausible research bets which are being scoped out and tested in advance. (And my sense is that few plausible bets exist and all seem pretty likely to fail.)
Further, I think the default plan should be to ensure a delay (perhaps at least 5 years) prior to building wildly superhuman AI after full TAI. Wildly superhuman AI seems incredibly scary, even relative to AIs which are qualitatively comparably smart to top human experts. (This is not just for misalignment reasons, though this is part of it.) I think the "perfectly solve alignment then the government is in charge" plan plausibly leads to the default action being to build wildly superhuman AI as quickly as possible. So, I'd prefer a more specific plan which is opinionated about the transition to wildly superhuman AI.
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them even on tricky, high-stakes questions where we can't (or just don't) check their outputs. For AIs to be human-obsoleting, they would need to outcompete top human experts, and thus probably can't be much dumber than top human experts. Ideally, they also wouldn't be much smarter than top human experts and preferably notably dumber while compensating with speed and knowledge (though the capability profile of such systems might be very spiky in practice). Being able to defer to AIs to this extent (we trust them at least as much as well-regarded humans in terms of judgment and alignment) is stronger than just ruling out egregious misalignment (e.g. it requires that AIs are sufficiently philosophically competent and actually try on near-impossible-to-check questions), though I expect that ruling out egregious misalignment is the hardest part in practice. Further, we would aim to have more than 3 years of time prior to needing to build substantially superhuman AIs to remain competitive after the point where we have such human-obsoleting and deferrable AIs.
As part of achieving my proposed exit plan, I would also focus much more on buying time when AIs are this powerful or nearly this powerful. Using huge amounts of AI labor to buy time could be done prior to achieving the full desiderata of this exit plan using approaches like AI Control. Part of this would involve increasing willingness to pay on coordination and non-proliferation such as making demonstrations of risks as you discuss in the section "Developing Clear Smoking Gun Demos for Emerging Risk Factors". (That said, this seems focused on speculative risks, and I'd also potentially expand this to demonstrations of the most concerning capabilities from a national security perspective.) But beyond increasing willingness to pay, I think it also seems likely to be important to work (using vast amounts of AI labor) on reducing the cost of coordination and non-proliferation such as by building better monitoring approaches (e.g., better satellite tracking of data centers), improving verification approaches, building up carrots for coordination, and potentially building up sticks to reduce the cost of enforcing non-proliferation.
I think this is a good proposal because it seems as though a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point. The AIs we defer to could end up recommending a wide variety of proposals (buying more time, emulated minds, developing totally different AI paradigms), not just doing huge amounts of research into aligning superintelligence. Such AIs would always have an option of trying to run themselves much faster to accomplish very difficult research which requires substantial serial time. We wouldn't need to be able to check their work or their intermediate states which improves the situation considerably.
More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years. Edit: so it would be good to buy time, first for ensuring we have human obsoleting AIs we can defer to and then so that these AIs can have enough time to figure out what to do.
Buck and I are writing an overall short plan for (technical) AI safety and in this we'll discuss all aspects of this in more detail.
I'm referring to AIs making humans working on improving AI from the software side >5x more productive. It is plausible (perhaps 40% in short timelines) that algorithmic advances will in practice slow down dramatically such that software improvements aren't an important input, but I think you should be robust to the likely outcome in which AI R&D is very important. ↩︎
Note that SL5 is not ASL-5! ↩︎
I think you likely need substantially more than SL5 security (security sufficient to resist against unprecedentedly well-resourced attacks) within a short period after you have AIs which can massively (>10x) accelerate R&D in key domains (AI, weapons, cyber offense), but you don't need it immediately. It's probably a good 80/20 to aim for SL5 and then greater security within a year. This could plausibly be insufficient and is probably objectively unacceptably risky (e.g., maybe the RSP should demand stronger with a target of ensuring less than a 5% chance the key models are stolen), but something like this seems to get most of the benefit and even this 80/20 is still very ambitious. ↩︎
Agreed, I really should have said "or possibly even train against it". I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!
Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
I think it's plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
I think this is a reasonable line to draw - people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don't know how this will go in practice, but I don't think people should accept "the AI was just role playing" as an excuse for an actually deployed AI system on an actual user input.