Epistemic status: I only ~50% endorse this, which is below my typical bar for posting something. I’m more bullish on “these are arguments which should be in the water supply and discussed” than “these arguments are actually correct.” I’m not an expert in this, I’ve only thought about it for ~15 hours, and I didn’t run this post by any relevant experts before posting.

Thanks to Max Nadeau and Eric Neyman for helpful discussion.

Right now there's a significant amount of public debate about open source AI. People concerned about AI safety generally argue that open sourcing powerful AI systems is too dangerous to be allowed; the classic example here is "You shouldn't be allowed to open source an AI system which can produce step-by-step instructions for engineering novel pathogens." On the other hand, open source proponents argue that open source models haven't yet caused significant harm, and that trying to close access to AI will result in concentration of power in the hands of a few AI labs.

I think many AI safety-concerned folks who haven’t thought about this that much tend to vaguely think something like “open sourcing powerful AI systems seems dangerous and should probably be banned.” Taken literally, I think this plan is a bit naive: when we're colonizing Mars in 2100 with the help of our aligned superintelligence, will releasing the weights of GPT-5 really be a catastrophic risk?

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 

I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model (call it GPT-SoTA)[1]; GPT-SoTA is widely deployed throughout the economy and deployed to monitor for and prevent certain attack vectors, and the best open source model isn't smart enough to cause any significant harm without GPT-SoTA catching it. In this hypothetical world, so long as we can trust GPT-SoTA, we are safe from harms caused by open source models. In other words, so long as the best open source models lag sufficiently behind the best proprietary models and we’re smart about how we use our best proprietary models, open sourcing models isn't the thing that kills us.

In this rest of this post I will:

  • Motivate this plan by analogy to responsible disclosure in cryptography
  • Go into more detail on this plan
  • Discuss how this relates to my understanding of the current plan as implied by responsible scaling policies (RSPs)
  • Discuss some key uncertainties
  • Give some higher-level thoughts on the discourse surrounding open source AI

An analogy to responsible disclosure in cryptography

[I'm not an expert in this area and this section might get some details wrong. Thanks to Boaz Barak for pointing out this analogy (but all errors are my own).

See this footnote[2] for a discussion of alternative analogies you could make to biosecurity disclosure norms, and whether they’re more apt to risk from open source AI.]

Suppose you discover a vulnerability in some widely-used cryptographic scheme. Suppose further that you're a good person who doesn't want anyone to get hacked. What should you do?

If you publicly release your exploit, then lots of people will get hacked (by less benevolent hackers who've read your description of the exploit). On the other hand, if white-hat hackers always keep the vulnerabilities they discover secret, then the vulnerabilities will never get patched until a black-hat hacker finds the vulnerability and exploits it. More generally, you might worry that not disclosing vulnerabilities could lead to a "security overhang," where discoverable-but-not-yet-discovered vulnerabilities accumulate over time, making the situation worse when they're eventually exploited.

In practice, the cryptography community has converged on a responsible disclosure policy along the lines of:

  • First, you disclose the vulnerability to the affected parties.
    • As a running example, consider Google's exploit for the SHA-1 hash function. In this case, there were many affected parties, so Google publicly posted a proof-of-concept for the exploit, but didn't include enough detail for others to immediately reproduce it.
    • In other cases, you might privately disclose more information, e.g. if you found a vulnerability in the Windows OS, you might privately disclose it to Microsoft along with the code implementing an exploit.
  • Then you set a reasonable time-frame for the vulnerability to be patched.
    • In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.
  • At the end of this time period, you may publicly release your exploit, including with source code for executing it.
    • This ensures that affected parties are properly incentivized to patch the vulnerability, and helps other white-hat hackers find other vulnerabilities in the future.

As I understand things, this protocol has resulted in our cryptographic schemes being relatively robust: people mostly don't get hacked in serious ways, and when they do it's mostly because of attacks via social engineering (e.g. the CIA secretly owning their encryption provider), not via attacks on the scheme.[3]

Responsible disclosure for capabilities of open source AI systems: an outline

[Thanks to Yusuf Mahmood for pointing out that the protocol outlined in this section is broadly similar to the one here. More generally, I expect that ideas along these lines are already familiar to people who work in this area.]

In this section I’ll lay out an protocol for open sourcing AI systems which is analogous to the responsible disclosure protocol from cryptography. Suppose the hypothetical company Mesa has trained a new AI system camelidAI which Mesa would like to open source. Let’s also call the most capable proprietary AI system GPT-SoTA, which we can assume is well-behaved[4]. I’m imagining that GPT-SoTA is significantly more capable than camelidAI (and, in particular, is superhuman in most domains). In principle, the protocol below will still make sense if GPT-SoTA is worse than camelidAI (because open source systems have surpassed proprietary ones), but it will degenerate to something like “ban open source AI systems once they are capable of causing significant novel harms which they can’t also reliably mitigate.”

In this protocol, before camelidAI can be open sourced, [?? Mesa?, the government?, a third-party? ??] must:

  • Evaluate camelidAI for what sorts of significant novel harms it could cause if open-sourced.[5] These evaluators should have, at a minimum, access to all the tools that users of the open source system will have, including e.g. the ability to finetune camelidAI, external tooling which can be built on top of camelidAI, and API calls to GPT-SoTA. So a typical workflow might look something like: have GPT-SoTA generate a comprehensive list of possible takeover plans, then finetune camelidAI to complete steps in these plans. For example, we might find that:
  • Disclose these new harmful capabilities to [?? the government?, a third-party monitor?, affected parties? ??].
  • Work with relevant actors to improve systems until they are robust to everyone having access to camelidAI.
    • E.g. make sure that there is a widely-available open-source tool which can detect phishing attempts as sophisticated as camelidAI's with very high reliability.
    • E.g. shut down the DNA synthesis companies and biolabs that don’t screen orders, or force them to use GPT-SoTA to screen orders to potential pandemic agents.
    • Note that if camelidAI is very capable, some of these preventative measures might be very ambitious, e.g. “make society robust to engineered pandemics.” The source of hope here is that we have access to a highly capable and well-behaved GPT-SoTA. 
    • Note also that these “robustification” measures are things that we should do anyway, even if we didn’t want to open source camelidAI; otherwise there would be an overhang that a future unaligned AI (possibly a model which was open sourced illegally) could exploit.
  • Once society is robust to harms caused by camelidAI (as certified by [??]), you are allowed to open source camelidAI.
  • On the other hand, if Mesa open sources camelidAI before finishing the above process, then it’s treated as a bad actor (similarly to how we would treat a hacker who releases an exploit without responsible disclosure).
    • Maybe this means that you are held liable for harms caused by camelidAI or something, not really sure.

As examples, let me note two special cases of this protocol:

  • Suppose camelidAI = LLaMA-2. I think probably there are no significant novel harms enabled by access to  LLaMA-2[6]. Thus, after making that evaluation, the mitigation step is trivial: no “patches” are needed, and LLaMA-2 can be open sourced. (I think this is good: AFAICT, LLaMA-2’s open sourcing has been good for the world, including for alignment research.)
  • Suppose camelidAI is capable of discovering this one weird trick for turning rocks and air into a black hole (astrophysicists hate it!). Assuming there is no plausible mitigation for this attack, camelidAI never gets to be open sourced. (I hope that even strong open source proponents would agree that this is the right outcome in this scenario.)

I’ll also note two ways that this protocol differs from from responsible disclosure in cryptography: 

  1. Mesa is not allowed to set a deadline on how long society has to robustify itself to camelidAI's capabilities. If camelidAI has a capability which would be catastrophic if misused and it takes a decade of technological progress before we can come up with a "patch" for the problem, then Mesa doesn't get to open source the model until that happens. 
  2. In cryptography, the onus is on affected parties to patch the vulnerability, but in this case the onus is partly on the AI system's developer. 

These two differences mean that other parties aren't as incentivized to robustify their systems; in principle they could drag their feet forever and Mesa will never get to release camelidAI. I think something should be done to fix this, e.g. the government should fine companies which insufficiently prioritize implementing the necessary changes.

But overall, I think this is fair: if you are aware of a way that your system could cause massive harm and you don't have a plan for how to prevent that harm, then you don't get to open source your AI system.

One thing that I like about this protocol is that it's hard to argue with: if camelidAI is demonstrably capable of e.g. autonomously engineering a novel pathogen, then Mesa can't fall back to claiming that the harms are imaginary or overhyped, or that as a general principle open source AI makes us safer. We will have a concrete, demonstrable harm; and instead of debating whether AI harms can be mitigated by AI in the abstract, we can discuss how to mitigate this particular harm. If AI can provide a mitigation, then we’ll find and implement the mitigation. And similarly, if it ends up that the harms were imaginary or overhyped, then Mesa will be free to open source camelidAI.

How does this relate to the current plan?

As I understand things, the high-level idea driving many responsible scaling policy (RSP) proponents is something like:

Before taking certain actions (e.g. training or deploying an AI system), AI labs need to make "safety arguments," i.e. arguments that this action won't cause significant harm. For example, if they want to deploy a new system, they might argue:

  1. Our system won't cause harm because it's not capable enough to do significant damage. (If OpenAI had been required to make a safety argument before releasing GPT-4, this is likely the argument they would have made, and it seems true to me.)
  2. Our system could cause harm if it attempted to but it won't attempt to because, e.g. it is only deployed through an API and we've ensured using [measures] that no API-mediated interaction could induce it to attempt harm.
  3. Our system could cause harm if it attempted to and we can't rule out that it will attempt to, but it won't succeed in causing harm because, e.g. it's only being used in a tightly-controlled environment where we have extremely good measures in place to stop it from successfully executing harmful actions.

If no such argument exists, then you need to do something which causes such an argument to exist (e.g. doing a better job of aligning your model, so that you can make argument (2) above). Until you've done so, you can't take whatever potentially-risky action you want to take.

I think that if you apply this idea in the case where the action is "open sourcing an AI system," you get something pretty similar to the protocol I outlined above: in order to open source an AI system, you need to make an argument that it's safe to open source that system. If there is no such argument, then you need to do stuff (e.g. improve email monitoring for phishing attempts) which make such an argument exist.

Right now, the safety argument for open sourcing would be the same as (1) above: current open source systems aren't capable enough to cause significant novel harm. In the future, these arguments will become trickier to make, especially for open source models which can be modified (e.g. finetuned or incorporated into a larger system) and whose environment is potentially "the entire world." But, as the world is radically changed by advances in frontier AI systems, these arguments might continue to be possible for non-frontier systems. (And I expect open source models to continue to lag the frontier.)

Some uncertainties

Here are some uncertainties I have:

  • In practice, how does this play out?
    • I think a reasonable guess might be: in a few years, SoTA models will be smart enough to cause major catastrophes if open-sourced, and – even with SoTA AI assistance – we won’t be able to patch the relevant vulnerabilities until after the singularity (after which the ball is out of our court). If so, this protocol basically boils down to a ban on open source AI with extra steps.
    • I’ll note, however, that open source proponents (many of whom expect slower progress towards harmful capabilities) probably disagree with this forecast. If they are right then this protocol boils down to “evaluate, then open source.” I think there are advantages to having a policy which specializes to what AI safety folks want if AI safety folks are correct about the future and specializes to what open source folks want if open source folks are correct about the future.
  • Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems?
    • Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post. Two reasons I’m worried evaluators might fail:
      • Evaluators might not have access to significantly better tools than the users, and there are many more users. E.g. even though the evaluators will be assisted by GPT-SoTA, so will the millions of users who will have access to camelidAI if it is open-sourced.
      • The world might change in ways that enable new threat models after camelidAI is open-sourced. For example, suppose that camelidAI + GPT-SoTA isn’t dangerous, but camelidAI + GPT-(SoTA+1) (the GPT-SoTA successor system) is dangerous. If GPT-(SoTA+1) comes out a few months after camelidAI is open-sourced, this seems like bad news.
  • Maybe using subtly unaligned SoTA AI systems to evaluate and monitor other AI systems is really bad for some reason that's hard for us to anticipate?
    • E.g. something something the AI systems coordinate with each other.

Some thoughts on the open source discourse

I think many AI safety-concerned folks make a mistake along the lines of: "I notice that there is some capabilities threshold T past which everyone having access to an AI system with capabilities >T would be an existential threat in today's world. On the current trajectory, someday someone will open source an AI system with capabilities >T. Therefore, open sourcing is likely to lead to extinction and should be banned."

I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities . So overall, I think folks in the AI safety community worry too much about threats from open source models.

Further, AI safety community opposition to open source AI is currently generating a lot of animosity from the open source community. For background, the open source ideology is deeply interwoven with the history of software development, and strong proponents of open source have a lot of representation and influence in tech.[7] I'm somewhat worried that on the current trajectory, AI safety vs. open source will be a major battlefront making it hard to reach consensus (much worse than the IMO not-too-bad AI discrimination/ethics vs. x-risk division). 

To the extent that this animosity is due to unnecessary fixation on the dangers of open source or sloppy arguments for the existence of this danger, I think this is really unfortunate. I think there are good arguments for worrying in particular ways about the potential dangers of open sourcing AI systems at some scale, and I think being much more clear on the nuances of these threat models might lead to much less animosity. 

Moreover, I think there’s a good chance that by the time open source models are dangerous, we will have concrete evidence that they are dangerous (e.g. because we’ve already seen that unaligned proprietary models of the same scale are dangerous). This means that policy proposals of the shape “if [evidence of danger], then [policy]” get most of the safety benefit while also failing gracefully (i.e. not imposing excessive development costs) in worlds where the safety community is wrong about the pending dangers. Ideally, this means that such policies are easier to build consensus around.


 

  1. ^

    This currently seems about right to me, i.e. that LLaMA-2 is a little bit worse than GPT-3.5 which came out 20 months ago.

  2. ^

    Jeff Kaufman has written about a difference in norms between the computer security and biosecurity communities. In brief, while computer security norms encourage trying to break systems and disclosing vulnerabilities, biosecurity norms discourage open discussion of possible vulnerabilities. Jeff attributes this to a number of structural factors, including how difficult it can be to patch biosecurity vulnerabilities; it’s possible that threat models from open source AI have more in common with biorisk models, in which case we should instead model our defenses based on them. For more ctrl-f “cold sweat” here to read Kevin Esvelt discussing why he didn’t disclose the idea of a gene drive to anyone – not even his advisor – until he was sure that it was defense-dominant. (h/t to Max Nadeau for both of these references, and to most of the references to bio-related material that I link elsewhere.)

  3. ^

    I expect some folks will want to argue about whether our cryptography is actually all that good, or point out that the words “relatively” and “mostly” in that sentence are concerning if you think that “we only get one shot” with AI. So let me preemptively clarify that I don't care too much about the precise success level of this protocol; I'm mostly using it as an illustrative analogy.

  4. ^

    We can assume this because we’re dealing with the threat model of catastrophes caused by open source AI. If you think the first thing that kills us is misaligned proprietary AI systems, then you should focus on that threat model instead of open source AI.

  5. ^

    This is the part of the protocol that I feel most nervous about; see bullet point 2 in the "Some uncertainties" section.

  6. ^

    It’s shown here that a LLaMA-2 finetuned on virology data was useful for giving hackathon participants instructions for obtaining and releasing the reconstructed 1918 influenza virus. However, it’s not clear that this harm was novel – we don’t know how much worse the participants would have done given only access to the internet.

  7. ^

    I've noticed that AI safety concerns have had a hard time gaining traction at MIT in particular, and one guess I have for what's going on is that the open source ideology is very influential at MIT, and all the open source people currently hate the AI safety people.

21

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 11:48 PM
  • Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post. Two reasons I’m worried evaluators might fail:
    • [...]
    • The world might change in ways that enable new threat models after camelidAI is open-sourced. For example, suppose that camelidAI + GPT-SoTA isn’t dangerous, but camelidAI + GPT-(SoTA+1) (the GPT-SoTA successor system) is dangerous. If GPT-(SoTA+1) comes out a few months after camelidAI is open-sourced, this seems like bad news.

My main concern here is that that there will be technical advancements in the world in things like finetuning or scaffolding and these will make camelidAI sufficiently capable to be a concern. This seems quite unlikely for current open-source models (as they are far from sufficiently capable), but will increase in probability as open source models get more powerful. E.g., it doesn't seem that unlikely to me that advances in finetuning, dataset construction, and scaffolding are sufficient for GPT4 to make lots of money doing cybercrime online (this threat model isn't very existentially concerning, but the stretch from here to existential concerns isn't that huge).

It's hard for me to be very confident (>99%) that there won't be substantial jumpy improvements along these lines. As there are probably larger threats other than open source, maybe we should just eat the small fraction of worlds (maybe 1-5%) where a sudden jump like this happens (it probably wouldn't be existential even conditional on large jumps). I'm sympathetic to not worrying much about 1/1000 or 1/100 doom from open sourcing when we probably have bigger problems...

Let’s also call the most capable proprietary AI system GPT-SoTA, which we can assume is well-behaved. I’m imagining that GPT-SoTA is significantly more capable than camelidAI (and, in particular, is superhuman in most domains). In principle, the protocol below will still make sense if GPT-SoTA is worse than camelidAI (because open source systems have surpassed proprietary ones), but it will degenerate to something like “ban open source AI systems once they are capable of causing significant novel harms which they can’t also reliably mitigate.”

I think a reasonable amount of the concern is going to come from GPT-SoTA stalling out or pausing due to alignment concerns. Then, if open source model continue to advance (either improvements on top of base models like I discussed earlier or further releases which can't be stopped), we might be in trouble. TBC, I don't think you were assuming that GPT-SoTA will necessarily keep advancing anywhere, but it seems relevant to note this concern.

We're starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn't rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.

If they are right then this protocol boils down to “evaluate, then open source.” I think there are advantages to having a policy which specializes to what AI safety folks want if AI safety folks are correct about the future and specializes to what open source folks want if open source folks are correct about the future.

In practice, arguing that your evaluations show open-sourcing is safe may involve a bunch of paperwork and maybe lawyer fees. If so, this would be a big barrier for small teams, so I expect open-source advocates not to be happy with such a trajectory.

FWIW, this pretty closely matches my thinking on the subject.

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 


The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd like to avoid. In this scenario, your plan would likely lead to someone coming up with a fake plan to robustify the world and then claim that it'd be fine for them to release their model as open-source, because people really want to do open-source.

For example, in your plan you write:

Then you set a reasonable time-frame for the vulnerability to be patched: In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.

This is exactly the kind of plan that I'm worried about. People will be tempted to argue that surely 4 years is enough time for the biodefense plan to be implemented, four years rolls around and it's clearly not in place, but then they push for release anyway.

I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model

You seem to have hypothesised what is to me an obviously unsafe scenario. Let's suppose our best proprietary models hit upon a dangerous bioweapon capability. Well, now we only have two years to prepare for it, regardless of whether this is completely wildly unrealistic. Worse, this occurs for each and every dangerous capability.

Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems? Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post.

When we're talking about risk management, a 50% chance that a key assumption will work out, when there isn't a good way to significantly reduce this uncertainty often doesn't translate into a 50% chance of it being a good plan, but rather a near 0% chance.