What does it mean for an AGI to be 'safe'?

So8res

(Note: This post is probably old news for most readers here, but I find myself repeating this surprisingly often in conversation, so I decided to turn it into a post.)

I don't usually go around saying that I care about AI "safety". I go around saying that I care about "alignment" (although that word is slowly sliding backwards on the semantic treadmill, and I may need a new one soon).

But people often describe me as an “AI safety” researcher to others. This seems like a mistake to me, since it's treating one part of the problem (making an AGI "safe") as though it were the whole problem, and since “AI safety” is often misunderstood as meaning “we win if we can build a useless-but-safe AGI”, or “safety means never having to take on any risks”.

Following Eliezer, I think of an AGI as "safe" if deploying it carries no more than a 50% chance of killing more than a billion people:

When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to "less than certainty of killing literally everyone". Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.

Notably absent from this definition is any notion of “certainty” or "proof". I doubt we're going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).

On my models, making an AGI "safe" in this sense is a bit like finding a probabilistic circuit: if some probabilistic circuit gives you the right answer with 51% probability, then it's probably not that hard to drive the success probability significantly higher than that.

If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they've probably... well, they've probably found a way to keep their AGI weak enough that it isn’t very useful. But if they can do that with an AGI capable of ending the acute risk period, then they've probably solved most of the alignment problem. Meaning that it should be easy to drive the probability of disaster dramatically lower.

The condition that the AI actually be useful for pivotal acts is an important one. We can already build AI systems that are “safe” in the sense that they won’t destroy the world. The hard part is creating a system that is safe and relevant.

Another concern with the term “safety” (in anything like the colloquial sense) is that the sort of people who use it often endorse the "precautionary principle" or other such nonsense that advocates never taking on risks even when the benefits clearly dominate.

In ordinary engineering, we recognize that safety isn’t infinitely more important than everything else. The goal here is not "prevent all harms from AI", the goal here is "let's use AI to produce long-term near-optimal outcomes (without slaughtering literally everybody as a side-effect)".

Currently, what I expect to happen is that humanity destroys itself with misaligned AGI. And I think we’re nowhere near knowing how to avoid that outcome. So the threat of “unsafe” AI indeed looms extremely large—indeed, this seems to be rather understating the point!—and I endorse researchers doing less capabilities work and publishing less, in the hope that this gives humanity enough time to figure out how to do alignment before it’s too late.

But I view this strategic situation as part of the larger project “cause AI to produce optimal long-term outcomes”. I continue to think it's critically important for humanity to build superintelligences eventually, because whether or not the vast resources of the universe are put towards something wonderful depends on the quality and quantity of cognition that is put to this task.

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

If we’re just discussing terminology, I continue to believe that “AGI safety” is much better than “AI safety”, and plausibly the least bad option.

One problem with “AI alignment” is that people use that term to refer to “making very weak AIs do what we want them to do”.
Another problem with “AI alignment” is that people take it to mean “alignment with a human” (i.e. work on ambitious value learning specifically) or “alignment with humanity” (i.e. work on CEV specifically). Thus, work on things like task AGIs and sandbox testing protocols etc. are considered out of scope for “AI alignment”.

Of course, “AGI safety” isn’t perfect either. How can it be abused?

“they've probably found a way to keep their AGI weak enough that it isn’t very useful.” — maybe, but when we’re specifically saying “AGI”, not “AI”, that really should imply a certain level of power. Of course, if the term AGI is itself “sliding backwards on the semantic treadmill”, that’s a problem. But I haven’t seen that happen much yet (and I am fighting the good fight against it!)
The term “AGI safety” seems to rule out the possibility of “TAI that isn’t AGI”, e.g. CAIS. — Sure, but in my mind, that’s a feature not a bug; I really don’t think that “TAI that isn’t AGI” is going to happen, and thus it’s not what I‘m working on.
This quote:

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

Sometimes I talk about “safe and beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the larger project, and “AGI safety” as the part where we try to make AGIs that don’t kill everyone. I do think it’s useful to have different terms for those.

What is the current biggest bottleneck to an alignment solution meeting the safety bar you've describe here (<50% chance of killing more than a billion)?

I'd guess Nate might say one of:

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable problems like "your first system will probably be very kludgey" and "having the correct outer training signal by default results in inner misalignment" and "pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work".

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

If we’re just discussing terminology, I continue to believe that “AGI safety” is much better than “AI safety”, and plausibly the least bad option.

One problem with “AI alignment” is that people use that term to refer to “making very weak AIs do what we want them to do”.
Another problem with “AI alignment” is that people take it to mean “alignment with a human” (i.e. work on ambitious value learning specifically) or “alignment with humanity” (i.e. work on CEV specifically). Thus, work on things like task AGIs and sandbox testing protocols etc. are considered out of scope for “AI alignment”.

Of course, “AGI safety” isn’t perfect either. How can it be abused?

“they've probably found a way to keep their AGI weak enough that it isn’t very useful.” — maybe, but when we’re specifically saying “AGI”, not “AI”, that really should imply a certain level of power. Of course, if the term AGI is itself “sliding backwards on the semantic treadmill”, that’s a problem. But I haven’t seen that happen much yet (and I am fighting the good fight against it!)
The term “AGI safety” seems to rule out the possibility of “TAI that isn’t AGI”, e.g. CAIS. — Sure, but in my mind, that’s a feature not a bug; I really don’t think that “TAI that isn’t AGI” is going to happen, and thus it’s not what I‘m working on.
This quote:

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

What is the current biggest bottleneck to an alignment solution meeting the safety bar you've describe here (<50% chance of killing more than a billion)?

I'd guess Nate might say one of:

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable problems like "your first system will probably be very kludgey" and "having the correct outer training signal by default results in inner misalignment" and "pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work".

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

38

What does it mean for an AGI to be 'safe'?

38