[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

[-]ESRogs6y10

Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.

To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?

[-]Rohin Shah6y30

No, under the current formalization, even if we are not in class C we have to trust A[C] over our own beliefs. Specifically, we need $E_{u s} [X ∣ i n f o] = E_{u s} [E_{A [C]} [X] ∣ i n f o]$ for any X and information about A[C] . But then if we are given the info that $E_{A [C]} [X] = Y$ , then we have:

$E_{u s} [X ∣ i n f o]$

$= E_{u s} [E_{A [C]} [X] ∣ i n f o]$ (definition of universality)

$= E_{u s} [E_{A [C]} [X] ∣ E_{A [C]} [X] = Y]$ (plugging in the specific info we have)

$= Y$ (If we are told that A[C] says Y, then we should expect that A[C] says Y)

Putting it together, we have $E_{u s} [X ∣ i n f o] = Y$ , that is, given the information that A[C] says Y, we must expect that the answer to X is Y.

This happens because we don't have an observer-independent way of defining epistemic dominance: even if we have access to the ground truth, we don't know how to take two sets of beliefs and say "belief set A is strictly 'better' than this belief set B" [1]. So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

You could hope that in the future we have an observer-independent way of defining epistemic dominance, and then the requirement that we adopt A[C]'s beliefs would go away.

We could say that a set of beliefs is 'strictly better' if for every quantity X its belief is more accurate, but this is unachievable, because even full Bayesian updating on true information causes you to update in the wrong direction for some quantities, just by bad luck. ↩︎

[-]ESRogs6y10

Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:

So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?

If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop (without reference to any other thing that we are trusting it more than). No?

[-]Rohin Shah6y30

In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right?

Yes.

If we replace our beliefs with A[C]'s, then how is that us trusting it "over" c or C? It seems like it's us trusting it, full stop

So I only showed the case where $i n f o$ contains information about $A [C]$ 's predictions, but $i n f o$ is allowed to contain information from $A [C]$ and $C$ (but not other agents). Even if it contains lots of information from C, we still need to trust $A [C]$ .

In contrast, if $i n f o$ contained information about $A [A [C]]$ 's beliefs, then we would not trust $A [C]$ over that.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

23

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

23