avatar

posted on 2022-10-06 — also cross-posted on lesswrong, see there for comments

confusion about alignment requirements

for now, let's put aside the fact that we can't decide whether we're trying to achieve sponge coordination or FAS, and merely consider what it takes to build an aligned AI — regardless of whether it has the capability of saving the world as a singleton, or merely to be a useful but safe tool.

the question this post is about is: what requirements do we want such a solution to satisfy?

let's say three groups have each built an AI which they think is aligned, and before they press the start button on it, they're trying to convince the other two that their design is safe and leads to good worlds. however, their designs are actually very different from one another.

maybe one is an advanced but still overall conventional text-predicting simulator, another is a clever agentic neural net with reinforcement learning and access to a database and calculator, and the third is a novel kind of AI whose core doesn't really relate to current machine learning technology.

so, they start talking about why they think they AI is aligned. however, they run into an issue: they don't even agree on what it takes to be sure an AI is safe, let alone aligned!

and those are optimistic cases! many alignment approaches would simply:

i've noticed this pattern of confusion in myself after trying to explain alignment ideas i've found promising to some people, and the nature of their criticism — "wait, where's the part that makes this lead to good worlds? why do you think it would work?" — seems to be of a similar nature to my criticism of people who think "alignment is easy, just do X": the proposal is failing to answer some fundamental concerns that the person proposing has a hard time even conceiving of.

and so, i've come to wonder: given that those people seem to be missing requirements for an alignment proposal, requirements which seem fundamental to me but unknown unknown to them, what requirements are unknown unknown to me? what could i be missing? how do i know which actual requirements i'm failing to satisfy because i haven't even considered them? how do we collectively know which actual requirements we're all collectively missing? what set of requirements is necessary for an alignment proposal to satisfy, and what set is sufficient?

it feels like there ought to be a general principle that covers all of this. the same way that the logical induction paper demonstrates that the computability desideratum and the "no dutchbook" desideratum, together suffice to satisfy ten other desiderata about logical inductors; it seems like a simple set of desiderata ought to capture the true name of what it means for an AI to lead to good worlds. but this isn't guaranteed, and i don't know that we'll find such a thing in time, or that we'll have any idea how to build something that satisfies those requirements.

posted on 2022-10-06 — also cross-posted on lesswrong, see there for comments

unless explicitely mentioned, all content on this site was created by me; not by others nor AI.