The premise of AI risk is that AI is a danger, and therefore research into AI might be dangerous. In the AI alignment community, we're trying to do research which makes AI safer, but occasionally we might come up with results that have significant implications for AI capability as well. Therefore, it seems prudent to come up with a set of guidelines that address:

  • Which results should be published?
  • What to do with results that shouldn't be published?

These are thorny questions that it seems unreasonable to expect every researcher to solve for themselves. The inputs to these questions involve not only technical knowledge about AI, but also knowledge about the behavior of progress, to the extent we can produce such using historical record or other methods. AI risk organizations might already have internal policies on these issues, but they don't share them and don't discuss or coordinate them with each other (that I know of: maybe some do it in private channels). Moreover, coordination might be important even if each actor is doing something reasonable when regarded in isolation (avoiding bad Nash equilibria). We need to have a public debate on the topic inside the community, so that we arrive at some consensus (that might be updated over time). If not consensus, then at least a reasonable spectrum of possible policies.

Some considerations that such a policy should take into account:

  • Some results might have implications that shorten the AI timelines, but are still good to publish since the distribution of outcomes is improved.
  • Usually we shouldn't even start working on something which is in the should-not-be-published category, but sometimes the implications only become clear later, and sometimes dangerous knowledge might still be net positive as long as it's contained.
  • In the midgame, it is unlikely for any given group to make it all the way to safe AGI by itself. Therefore, safe AGI is a broad collective effort and we should expect most results to be published. In the endgame, it might become likely for a given group to make it all the way to safe AGI. In this case, incentives for secrecy become stronger.
  • The policy should not fail to address extreme situations that we only expect to arise rarely, because those situations might have especially major consequences.

Some questions that such a policy should answer:

  • What are the criteria that determine whether a certain result should be published?
  • What are good channels to ask for advise on such a decision?
  • How to decide what to do with a potentially dangerous result? Circulate in a narrow circle? If so, which? Conduct experiments in secret? What kind of experiments?

The last point is also related to a topic with independent significance, namely, what are reasonable precautions for testing new AI algorithms? This has both technical aspects (e.g. testing on particular types of datasets or particular types of environments, throttling computing power) and procedural aspects (who should be called to advice/decide on the manner). I expect to have several tiers of precautions, s.t. a tier can be selected according to our estimate of the new algorithm's potential, and guidelines for producing such an estimate.

I emphasize that I don't presume to have good answers to these questions. My goal here was not to supply answers, but to foster debate.

New Comment
13 comments, sorted by Click to highlight new comments since:

So, here's some considerations (not an actual policy)

It's instructive to look at the case of nuclear weapons, and the key analogies or disanalogies to math work. For nuclear weapons, the basic theory is pretty simple and building the hardware is the hard part, while for AI, the situation seems reversed. The hard part there is knowing what to do in the first place, not scrounging up the hardware to do it.

First, a chunk from Wikipedia

Most of the current ideas of the Teller–Ulam design came into public awareness after the DOE attempted to censor a magazine article by U.S. anti-weapons activist Howard Morland in 1979 on the "secret of the hydrogen bomb". In 1978, Morland had decided that discovering and exposing this "last remaining secret" would focus attention onto the arms race and allow citizens to feel empowered to question official statements on the importance of nuclear weapons and nuclear secrecy. Most of Morland's ideas about how the weapon worked were compiled from highly accessible sources—the drawings which most inspired his approach came from the Encyclopedia Americana. Morland also interviewed (often informally) many former Los Alamos scientists (including Teller and Ulam, though neither gave him any useful information), and used a variety of interpersonal strategies to encourage informational responses from them (i.e., asking questions such as "Do they still use sparkplugs?" even if he wasn't aware what the latter term specifically referred to)....

When an early draft of the article, to be published in The Progressive magazine, was sent to the DOE after falling into the hands of a professor who was opposed to Morland's goal, the DOE requested that the article not be published, and pressed for a temporary injunction. After a short court hearing in which the DOE argued that Morland's information was (1). likely derived from classified sources, (2). if not derived from classified sources, itself counted as "secret" information under the "born secret" clause of the 1954 Atomic Energy Act, and (3). dangerous and would encourage nuclear proliferation...

Through a variety of more complicated circumstances, the DOE case began to wane, as it became clear that some of the data they were attempting to claim as "secret" had been published in a students' encyclopedia a few years earlier....

Because the DOE sought to censor Morland's work—one of the few times they violated their usual approach of not acknowledging "secret" material which had been released—it is interpreted as being at least partially correct, though to what degree it lacks information or has incorrect information is not known with any great confidence.

So, broad takeaways from this: The Streisand effect is real. A huge part of keeping something secret is just having nobody suspect that there is a secret there to find. This is much trickier for nuclear weapons, which are of high interest to the state, while it's more doable for AI stuff (and I don't know how biosecurity has managed to stay so low-profile). This doesn't mean you can just wander around giving the rough sketch of the insight, in math, it's not too hard to reinvent things once you know what you're looking for. But, AI math does have a huge advantage in this it's a really broad field and hard to search through (I think my roommate said that so many papers get submitted to NeurIPS that you couldn't read through them all in time for the next NeurIPS conference), and, in order to reinvent something from scratch without having the fundamental insight, you need to be pointed in the exact right direction and even then you've got a good shot at missing it (see: the time-lag between the earliest neural net papers and the development of backpropagation, or, in the process of making the Infra-Bayes post, stumbling across concepts that could have been found months earlier if some time-traveler had said the right three sentences at the time.)

Also, secrets can get out through really dumb channels. Putting important parts of the H-bomb structure in a student's encyclopedia? Why would you do that? Well, probably because there's a lot of people in the government and people in different parts have different memories of which stuff is secret and which stuff isn't.

So, due to AI work being insight/math-based, security would be based a lot more on just... not telling people things. Or alluding to them. Although, there is an interesting possibility raised by the presence of so much other work in the field. For nuclear weapons work, things seem to be either secret or well-known among those interested in nuclear weapons. But AI has a big intermediate range between "secret" and "well-known". See all those Arxiv papers with like, 5 citations. So, for something that's kinda iffy (not serious enough (given the costs of the slowdown in research with full secrecy) to apply full secrecy, not benign enough to be comfortable giving a big presentation at NeurIPS about it), it might be possible to intentionally target that range. I don't think it's a binary between "full secret" and "full publish", there's probably intermediate options available.

Of course, if it's known that an organization is trying to fly under the radar with a result, you get the Streisand effect in full force. But, just as well-known authors may have pseudonyms, it's probably possible to just publish a paper on Arxiv (or something similar) under a pseudonym and not have it referenced anywhere by the organization as an official piece of research they funded. And it would be available for viewing and discussion and collaborative work in that form, while also (with high probability) remaining pretty low-profile.

Anyways, I'm gonna set a 10-minute timer to have thoughts about the guidelines:

Ok, the first thought I'm having is that this is probably a case where Inside View is just strictly better than Outside View. Making a policy ahead of time that can just be followed requires whoever came up with the policy to have a good classification in advance all the relevant categories of result and what to do with them, and that seems pretty dang hard to do especially because novel insights, almost by definition, are not something you expected to see ahead of time.

The next thought is that working something out for a while and then going "oh, this is roughly adjacent to something I wouldn't want to publish, when developed further" isn't quite as strong of an argument for secrecy as it looks like, because, as previously mentioned, even fairly basic additional insights (in retrospect) are pretty dang tricky to find ahead of time if you don't know what you're looking for. Roughly, the odds of someone finding the thing you want to hide scale with the number of people actively working on it, so that case seems to weigh in favor of publishing the result, but not actively publicizing it to the point where you can't befriend everyone else working on it. If one of the papers published by an organization could be built on to develop a serious result... well, you'd still have the problem of not knowing which paper it is, or what unremarked-on direction to go in to develop the result, if it was published as normal and not flagged  as anything special. But if the paper got a whole bunch of publicity, the odds go up that someone puts the pieces together spontaneously. And, if you know everyone working on the paper, you've got a saving throw if someone runs across the thing.

There is a very strong argument for talking to several other people if you're unsure whether it'd be good to publish/publicize, because it reduces the problem of "person with laxest safety standards publicizes" to "organization with the laxest safety standards publicizes". This isn't a full solution, because there's still a coordination problem at the organization level, and it gives incentives for organizations to be really defensive about sharing their stuff, including safety-relevant stuff. Further work on the inter-organization level of "secrecy standards" is very much needed. But within an organization, "have personal conversation with senior personnel" sounds like the obvious thing to do.

So, current thoughts: There's some intermediate options available instead of just "full secret" or "full publish" (publish under pseudonym and don't list it as research, publish as normal but don't make efforts to advertise it broadly) and I haven't seen anyone mention that, and they seem preferable for results that would benefit from more eyes on them, that could also be developed in bad directions. I'd be skeptical of attempts to make a comprehensive policy ahead of time, this seems like a case where inside view on the details of the result would outperform an ahead-of-time policy. But, one essential aspect that would be critical on a policy level is "talk it out with a few senior people first to make the decision, instead of going straight for personal judgement", as that tamps down on the coordination problem considerably.

Publishing under a pseudonym may end up being counterproductive due to the Streisand effect. Identities behind many pseudonyms may suddenly be publicly revealed following a publication on some novel method for detecting similarities in writing style between texts.

Regarding making a policy ahead of time, I think we can have an evolving model of what ingredients are missing to get transformative AI, and some rule of thumb that says how dangerous your result is, given how much progress it makes towards each ingredient (relevant but clearly insufficient < might or might not be sufficient < plausibly a full solution), how concrete/actionable it is (abstract idea < impractical method < practical method) and how original/surprising it is (synthesis of ideas in the field < improvement on idea in the field < application of idea outside the field < completely out of the blue).

One problem is, the model itself might be an infohazard. This consideration pushes towards making the guidelines secret in themselves, but that would make it much harder to debate and disseminate them. Also, the new result might have major implications for the model. So, yes, certainly there is no replacement for the inside view, but I still feel that we can have guidelines that help focusing on the right considerations.

There's some intermediate options available instead of just "full secret" or "full publish"... and I haven't seen anyone mention that...

OpenAI's phased release of GPT2 seems like a clear example of exactly this. And there is a forthcoming paper looking at the internal deliberations around this from Toby Shevlane, in addition to his extant work on the question of how disclosure potentially affects misuse.

Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.

And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first. 

If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high, 

then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.

(note that research on how to make better chips counts as capabilities research in this model)

Another way to think about it is that the problems are created by research. If you don't think that "another new piece of AI research has been produced" is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.

Hmm, so in this model we assume that (i) the research output of the rest of the world is known (ii) we are deciding about one result only (iii) the thresholds are unknown. In this case you are right that we need to compare our alignment : capability ratio to the rest of the world's alignment : capability ratio.

Now assume that, instead of just one result overall, you produce a single result every year. Most of the results in the sequence have alignment : capability ratio way above the rest of the world, but then there is a year in which the ratio is only barely above the rest of the world. In this case, you are better off not publishing the irregular result, even though the naive ratio criterion says to publish. We can reconcile it with the previous model by including your own research in the reference, but it creates a somewhat confusing self-reference.

Second, we can switch to modeling the research output of the rest of the world as a random walk. In this case, if the average direction of progress is pointing towards failure, then moving along this direction is net negative, since it reduces the chance to get success by luck.

I second this sentiment.

...Although maybe I would say we need "AI infohazard guidance, options, and resources" rather than an "AI infohazard policy"? I think that would better convey the attitude that we trust each other and are trying to help each other—not just because we do in fact presumably trust each other, but also because we have no choice but to trust each other... The site moderators can enforce a "policy", but if the authors don't buy in, they'll just publish elsewhere.

I was just talking about it (in reference to my own posts) a few days ago—see here. I've just been winging it, and would be very happy to have "AI infohazard guidance, options, and resources". So, I'm following this discussion with interest. :-)

Well, yes, the "policy" is meant to be entirely voluntary, since we don't have a way to enforce it. At most, there can be a mechanism of "soft enforcement" in the sense that, if you pledge to follow the "policy", this somewhat increases the trust in you, and presumably people who are sufficiently trustworthy will be privy to some unpublished knowledge and decision making. However, even then, the rules will inevitably be somewhat given to interpretation and personal judgement.

The first thing I would note is that stakeholders need to be involved in making any guidelines, and that pushing for guidelines from the outside is unhelpful, if not harmful, since it pushes participants to be defensive about their work. There are also an extensive literature discussing the general issue of information dissemination hazards and the issues of regulation in other domains, such as nuclear weapons technology, biological and chemical weapons, and similar.

There is also a fair amount of ongoing work on synthesizing this literature and the implications for AI. Some of it is even on this site. For example, see: and

So there is tons of discussion about this already, and there is plenty you should read on the topic - I suspect you can start with the paper that provided the name for your post, and continuing with sections of GovAI's research agenda.

The first thing I would note is that stakeholders need to be involved in making any guidelines, and that pushing for guidelines from the outside is unhelpful, if not harmful, since it pushes participants to be defensive about their work.

Hmm, maybe I was unclear. When I said that "we need to have a public debate on the topic inside the community" I meant, the community of AI alignment researchers. So, not from the outside.

As to the links, thank you. They do seem like potentially valuable inputs into the debate, although (from skimming) they don't seem to reach the point of proposing concrete guidelines and procedures.

I think there needs to be individual decisionmaking (on the part of both organizations and individual researchers, especially in light of the unilateralists' curse,) alongside a much broader discussion about how the world should handle unsafe machine learning, and more advanced AI.

I very much don't think that the AI safety community debating and coming up with shared, semi-public guidelines for, essentially, what to withhold from the broader public, done without input from the wider ML / AI and research community who are impacted and whose work is a big part of what we are discussing, would be wise. That community needs to be engaged in any such discussions.

I think a page titled "here are some tools and resources for thinking about AI-related infohazards" would be helpful and uncontroversial and feasible... That could include things like a list of trusted people in the community who have an open offer to discuss and offer feedback in confidence, and links to various articles and guidelines on the topic (without necessarily "officially" endorsing any particular approach), etc.

I agree that your proposal is well worth doing, it just sounds a lot more ambitious and long-term.

I'm not talking about guidelines for the wider AI community. I'm talking about guidelines for my own research (and presumably other alignment researchers would be interested in the same). The wider AI community doesn't share my assumptions about AI risk. In particular, I believe that most of what they're doing is actively harmful. Therefore, I don't expect them to accept these guidelines, and I'm also mostly uninterested in their input. Moreover, it's not the broader public that worries me, but precisely the broader AI community. It is from them that I want to withhold things.

Creating any sort of guidelines that the wider community would also accept is a different sort of challenge altogether. It's also a job for other people. Personally, I have enough on my plate as it is, and politics is not my comparative advantage by any margin.