TsviBT's Shortform

TsviBT

This is a special post for quick takes by TsviBT. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

An important thing that the AGI alignment field never understood:

Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.

But what I think people haven't understood is

If a mind is highly capable, it has a source of knowledge.
The source of knowledge involves deep change.
Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
If there's lots of strong goal-pursuits operating on everything, nothing (properties, architectures, constraints, data formats, conceptual schemes, ...) sticks around unless it has to stick around.
So if you want something to stick around (such as the property "this machine doesn't kill all humans") you have to know what sort of thing can stick around / what sort of context makes things stick around, even when there are strong goal-pursuits around, which is a specific thing to know because most things don't stick around.
The elements that stick around and help determine the mind's goal-pursuits have to do so in a way that positively makes them stick around (reflective stability of goals).

There's exceptions and nuances and possible escape routes. And the older Yudkowsky-led research about decision theory and tiling and reflective probability is relevant. But this basic argument is in some sense simpler (less advanced, but also more radical ("at the root")) than those essays. The response to the failure of those essays can't just be to "try something else about alignment"; the basic problem is still there and has to be addressed.

An important thing that the AGI alignment field never understood:

But what I think people haven't understood is

If a mind is highly capable, it has a source of knowledge.
The source of knowledge involves deep change.
Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
If there's lots of strong goal-pursuits operating on everything, nothing (properties, architectures, constraints, data formats, conceptual schemes, ...) sticks around unless it has to stick around.
So if you want something to stick around (such as the property "this machine doesn't kill all humans") you have to know what sort of thing can stick around / what sort of context makes things stick around, even when there are strong goal-pursuits around, which is a specific thing to know because most things don't stick around.
The elements that stick around and help determine the mind's goal-pursuits have to do so in a way that positively makes them stick around (reflective stability of goals).

TsviBT's Shortform

4