AI ALIGNMENT FORUM
AF

133
paulom
Ω10110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
10Natural Abstraction: Convergent Preferences Over Information Structures
2y
0
No wikitag contributions to display.
Parametrically retargetable decision-makers tend to seek power
paulom3y00

Thanks for the reply. I'll clean this up into a standalone post and/or cover this in a related larger post I'm working on, depending on how some details turn out.

What are A,B,C here?

Variables I forgot to rename, when I changed how I was labelling the arguments of f in my example. This should be 12→2, 22→3, 32→1 retargetable (as arguments i to f(i|j)).

Reply
Parametrically retargetable decision-makers tend to seek power
paulom3y60

I appreciate this generalization of the results - I think it's a good step towards showing the underlying structure involved here.

One point I want to comment on is transitivity of ≥nmost, as a relation on induced functions f:Θ→R. Namely, it isn't, and can even contain cycles of non-equivalent elements. (This came up when I was trying to apply a version of these results, and hoping that ≥nmost would be the preference relation I was looking for out of the box.) Quite possibly you noticed this since you give 'limited transitivity' in Lemma B.1 rather than full transitivity, but to give a concrete example:

Let V=⎛⎜⎝123312231⎞⎟⎠ and f(i|j)=Vij. The permutations are σ∈S3 with the usual action on {1,2,3}. Then we have [1] f1≥2mostf2≥2mostf3≥2mostf1 (and f2≱2mostf1). This also works on retargetability directly, with f being A2→B, B2→C, C2→A retargetable. Notice also that f is invariant under joint permutations (constant diagonals), and I think can be represented as EU-determined, so neither of these save it.

A narrow point is that for a non-transitive relation, I think the notation should be something other than ≥ (maybe ≽).

But more importantly, I think we would really rather a transitive (at least acyclic) relation, if we want to interpret this is 'most θ prefer' or any kind of preference / aggregation of preferences. If our theorem gives us only an intransitive relation as our conclusion, then we should tweak it.

One way you can do this: aim for a stronger relation like ≥no-m:

Definition (Orbit-mean dominance?): Let Of,A≠B(θ)={θ′∈Orbit|Θ(θ):f(A|θ′)≠f(B|θ′)}. Write f(B|θ)≥no-mf(A|θ) if ∀θ:∑Of,A≠B(θ)f(B|θ′)≥n∑Of,A≠B(θ)f(A|θ′).

Since the orbits are under Sd i.e. finite, it's easy to just sum over them. More generally, you could parameterize this with an arbitrary aggregator g:OrbitsΘ(f)→R in place of summation; I'm not sure whether this general form or the ∑ case should be the focus.

This is transitive for n=1 and acyclic for[2] n>1 (consider θ by θ); and possibly any orbit-based transitive relation is representable in basically this form[3] (with some g), since I'd guess any partial order on sets with cardinality ≤c can be represented as a pointwise inequality of functions, but I haven't thought about this too carefully.

With this notion of ≥no-m, we also need a stronger version of retargetability for the main theorem to hold. For the ∑ version, this could be

Definition (scalar-retargetability): Write f is AB−−−→scalar if there exists σ∈Sd such that for all θ with f(A|θA)−f(B|θA)=c>0 we have f(B|σθA)−f(A|σθA)≥c (and likewise multiply scalar-retargetable).

Then scalar-retargetability from A to B will imply f(B|θ)≥no-mf(A|θ).

And: I think many (all?) of the main power-seeking results are already secretly in this form. For example, θ-wise comparison of ∑θ′∈Orbit|Θ(θ)IsOptimal(X|C,θ′) gives a preference relation ≥no-m identical to the relation ≥nmost. Assuming this also works for the other rationalities, then the cases we care about were transitive all along exactly because the relations can be expressed in this way.

What do you think?



  1. We get the same single orbit {1,2,3} for all θ a.k.a. j; the orbit elements j with f(i|j)>f(i′|j) are the columns where row i > row i′. There are always two such columns when comparing row i and row i+1 (mod 3). For example, f(1,1)=1<3=f(2,1)f(1,2)=2>1=f(2,2)f(1,3)=3>2=f(2,3) ↩︎

  2. We exclude θ s.t. f(A|θ′)=f(B|θ′) in this version of the definition to match the behaviour of ≥nmost with n>1, and allow n-scalar-retargetability to imply ≥no-m. There's a case that you should include them, in which case you do get transitivity, and even the stronger property: if x≤ny≤mz, then x≤nmz. I think this corresponds to looking at likelihood ratios of P(A∧¬B)::P(B∧¬A) vs. P(A)::P(B). ↩︎

  3. Compare also what would give you a total order (instead of partial order): aggregating over all of Θ at once, like ∫Θf(A|θ)dμ(θ), instead of aggregating orbitwise at each θ. ↩︎

Reply