confusion about alignment requirements

posted on 2022-10-06 — also cross-posted on lesswrong, see there for comments

for now, let's put aside the fact that we can't decide whether we're trying to achieve sponge coordination or FAS, and merely consider what it takes to build an aligned AI — regardless of whether it has the capability of saving the world as a singleton, or merely to be a useful but safe tool.

the question this post is about is: what requirements do we want such a solution to satisfy?

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds. however, their designs are actually very different from one another.

maybe one is an advanced but still overall conventional text-predicting simulator, another is a clever agentic neural net with reinforcement learning and access to a database and calculator, and the third is a novel kind of AI whose core doesn't really relate to current machine learning technology.

so, they start talking about why they think they AI is aligned. however, they run into an issue: they don't even agree on what it takes to be sure an AI is safe, let alone aligned!

maybe one of them has a proof that their AI is resistant to a reasonable class of acausal attacks, another has reasons to think their approach probly avoids the issue altogether somehow, and the third has a model of the world that fails to understand acausal attacks and rejects their possibility altogether.
maybe one of them has developed a world-modeling system that is general enough to support embedded agency, another has patched theirs to support it as a special case, and the third think their AI will simply modify itself that way because it's instrumentally convergent.
maybe one of them has gone out and built a decision system which implements FDT, another counts on CDT turning itself into FDT, and the third has no idea how to determine how their system fits into decision theories.
maybe one of them has built something that is confidently eventually aligned and hopefully enough continuously aligned, another has built something that is acting safely now and has a bunch of ad-hoc corrigibility devices which hopefully prevent it from taking a sharp left turn, and the third expects their AI to robustly keep being safe in the long run for reasons that seem hard to understand.
maybe one of them has built their AI to respect the values of its creator, another has made the AI care about the part of its model that they believe to be pointing to an abstraction of human values, and the third has an AI that simply takes orders from what it interacts with and can hopefully be ordered to self-modify in a way that makes it resistant to alien superintelligences by the time it meets them.

and those are optimistic cases! many alignment approaches would simply:

fail to consider that their design might fail in ways they haven't thought of
not think to ask the alignment community at large whether their design is safe
ask the community, but then only select criticisms they take into account based on their ability to understand those criticisms, rather than based on their importance
assume away the possible failure modes that would be brought up
accidentally kill everyone way before any of this happens

i've noticed this pattern of confusion in myself after trying to explain alignment ideas i've found promising to some people, and the nature of their criticism — "wait, where's the part that makes this lead to good worlds? why do you think it would work?" — seems to be of a similar nature to my criticism of people who think "alignment is easy, just do X": the proposal is failing to answer some fundamental concerns that the person proposing has a hard time even conceiving of.

and so, i've come to wonder: given that those people seem to be missing requirements for an alignment proposal, requirements which seem fundamental to me but unknown unknown to them, what requirements are unknown unknown to me? what could i be missing? how do i know which actual requirements i'm failing to satisfy because i haven't even considered them? how do we collectively know which actual requirements we're all collectively missing? what set of requirements is necessary for an alignment proposal to satisfy, and what set is sufficient?

it feels like there ought to be a general principle that covers all of this. the same way that the logical induction paper demonstrates that the computability desideratum and the "no dutchbook" desideratum, together suffice to satisfy ten other desiderata about logical inductors; it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds. but this isn't guaranteed, and i don't know that we'll find such a thing in time, or that we'll have any idea how to build something that satisfies those requirements.

posted on 2022-10-06 — also cross-posted on lesswrong, see there for comments