Review

(An "exfohazard" is information that's not dangerous to know for individuals, but which becomes dangerous if widely known. Chief example: AI capability insights.)

Different alignment researchers have a wide array of different models of AI Risk, and one researcher's model may look utterly ridiculous to another researcher. Somewhere in the concept-space there exists the correct model, and the models of some of us are closer to it than those of others. But, taking on a broad view, we don't know which of us are more correct. ("Obviously it's me, and nearly everyone else is wrong," thought I and most others reading this.)

Suppose that you're talking to some other alignment researcher, and they share some claim X they're concerned may be an exfohazard. But on your model, X is preposterous. The "exfohazard" is obviously wrong, or irrelevant, or a triviality everyone already knows, or some vacuous philosophizing. Does that mean you're at liberty to share X with others? For example, write a LW post refuting the relevance of X to alignment?

Well, consider system-wide dynamics. Somewhere between all of us, there's a correct-ish model. But it looks unconvincing to nearly everyone else. If every alignment researcher feels free to leak information that's exfohazardous on someone else's model but not on their own, then either:

  • The information that's exfohazardous relative to the correct model ends up leaked as well, OR
  • Nobody shares anything with anyone outside their cluster. We're all stuck in echo chambers.

Both seem very bad. Adopting the general policy of "don't share information exfohazardous on others' models, even if you disagree with those models" prevents this.

However, that policy has an issue. Imagine if some loon approaches you on the street and tells you that you must never talk about birds, because birds are an exfohazard. Forever committing to avoid acknowledging birds' existence in conversations because of this seems rather unreasonable.

Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.

This satisfies a nice property: it means that someone telling you an exfohazard doesn't make you more likely to spread it. I. e., that makes you (mostly) safe to tell exfohazards to[1].

That seems like a generally reasonable policy to adopt, for everyone who's at all concerned about AI Risk.

  1. ^

    Obviously there's the issue of your directly using the exfohazard to e. g. accelerate your own AI research.

    Or the knowledge of it semi-consciously influencing you to follow some research direction that leads to your re-deriving it, which ends up making you think that your knowledge of it is now independent of the other researcher having shared it with you; while actually, it's not independent. So if you share it, thinking the escape clause applies, they will have made a mistake (from their perspective) by telling you.

    Still, mostly safe.

New Comment
1 comment, sorted by Click to highlight new comments since:

Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it.

In an ideal world, it's good to relax this clause in some way, from a binary to a spectrum. For example: if someone tells me of a hazard that I'm confident I would've discovered one my own one week later, then they only get to dictate me not-sharing-it for a week. "Knowing" isn't a strict binary; anyone can rederive anything with enough time (maybe) — it's just a question of how long it would've taken me to find it if they didn't tell me. This can even include someone bringing my attention to something I already knew, but to which I wouldn't as quickly have thought to pay attention if they didn't bring attention to it.

In the non-ideal world we inhabit, however, it's unclear how fraught it is to use such considerations.