General alignment properties

TurnTrout

26 General alignment properties

by TurnTrout

8th Aug 2022

2 min read

2

26

AIXI and the genome are both ways of specifying intelligent agents.

Give AIXI a utility function (perhaps over observation histories), and hook it up to an environment, and this pins down a policy.^[1]
Situate the genome in the embryo within our reality, and this eventually grows into a human being with a policy of their own.

These agents have different "values", in whatever sense we care to consider. However, these two agent-specification procedures also have very different general alignment properties.

General alignment properties are not about what a particular agent cares about (e.g. the AI "values" chairs). I call an alignment property "general" if the property would be interesting to a range of real-world agents trying to solve AI alignment. Here are some examples.

Terminally valuing latent objects in reality.

AIXI only "terminally values" its observations and doesn't terminally value latent objects in reality, while humans generally care about e.g. dogs (which are latent objects in reality).

Navigating ontological shifts.

Consider latent-diamond-AIXI (LDAIXI), an AIXI variant. LDAIXI's utility function which scans its top 50 hypotheses (represented as Turing machines), checks each work tape for atomic representations of diamonds, and then computes the utility to be the amount of atomic diamond in the world.

If LDAIXI updates sufficiently hard towards non-atomic physical theories, then it can no longer find any utility in its top 50 hypotheses. All policies now might have equal value (zero), and LDAIXI would not continue maximizing the expected diamond content of the future. From our viewpoint, LDAIXI has failed to rebind its "goals" to its new conceptions of reality. (From LDAIXI's "viewpoint", it has Bayes-updated on its observations and continues to select optimal actions.)

On the other hand, physicists do not stop caring about their friends when they learn quantum mechanics. Children do not stop caring about animals when they learn that animals are made out of cells. People seem to navigate ontological shifts pretty well.

Reflective reasoning / embeddedness.

AIXI can't think straight about how it is embedded in the world. However, people quickly learn heuristics like "If I get angry, I'll be more likely to be mean to people around me", or "If I take cocaine now, I'll be even more likely to take cocaine in the future."

Fragility of outcome value to initial conditions / Pairwise misalignment severity

This general alignment property seems important to me, and I'll write a post on it. In short: How pairwise-unaligned are two agents produced with slightly different initial hyperparameters/architectural choices (e.g. reward function / utility function / inductive biases)?

I'm excited about people thinking more about general alignment properties and about what generates those properties.

^{^}
Supposing e.g. uniformly random tie-breaking for actions enabling equal expected utility.

Complexity of valueEmbedded AgencyGeneral Alignment PropertiesOntologyShard TheoryAI

Frontpage

Mentioned in

74The shard theory of human values

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:33 PM

[-]Gunnar_Zarncke4y23

The main difference between LDAIXI and a human in terms of ontology seems to be that the things the human values are ultimately grounded in senses and a reward tied to that. For example, we value sweet things because we have a detector for sweetness and a reward tied to that. When our understanding of what sugar is changes the detector doesn't, and thus the ontology change works out fine. But I don't see a reason you couldn't set up LDAIXI the same way: Just specify the reward in terms of a diamond detector - or multiple ones. In the end, there are already detectors that AIXI uses - how else would it get input?

Reply

[-]TurnTrout4y30

Because LDAIXI doesn't e.g. have the credit assignment mechanism which propagates reward into learned values. Hutter just called it "reward." But that "reward function" is really just a utility function over observation histories, or the work tapes of the hypotheses, or whatever. Not the same as the mechanisms within people which make them have good general alignment properties.

(See also: the detached lever fallacy)

Reply

Moderation Log