## AI ALIGNMENT FORUMAF

Andrew Jacob Sauer

# Posts

Sorted by New

Troll Bridge

Sorry to necro this here, but I find this topic extremely interesting and I keep coming back to this page to stare at it and tie my brain in knots. Thanks for your notes on how it works in the logically uncertain case. I found a different objection based on the assumption of logical omniscience:

Regarding this you say:

Perhaps you think that the problem with the above version is that I assumed logical omniscience. It is unrealistic to suppose that agents have beliefs which perfectly respect logic. (Un)Fortunately, the argument doesn't really depend on this; it only requires that the agent respects proofs which it can see, and eventually sees the Löbian proof referenced.

However, this assumes that the Löbian proof exists. We show that the Löbian proof of A=cross→U=−10 exists by showing that the agent can prove □(A=cross→U=−10)→(A=cross→U=−10), and the agent's proof seems to assume logical omniscience:

Examining the agent, either crossing had higher expected utility, or P(cross)=0. But we assumed □(A=cross→U=−10), so it must be the latter. So the bridge gets blown up.

If here means "provable in PA", the logic does not follow through if the agent is not logically omniscient: the agent might find crossing to have a higher expected utility regardless, because it may not have seen the proof. If here instead means "discoverable by the agent's proof search" or something to that effect, then the logic here seems to follow through (making the reasonable assumption that if the agent can discover a proof for A=cross->U=-10, then it will set its expected value for crossing to -10). However, that would mean we are talking about provability in a system which can only prove finitely many things, which in particular cannot contain PA and so Löb's theorem does not apply.

I am still trying to wrap my head around exactly what this means, since your logic seems unassailable in the logically omniscient case. It is counterintuitive to me that the logically omniscient agent would be susceptible to trolling but the more limited one would not. Perhaps there is a clever way for the troll to get around this issue? I dunno. I certainly have no proof that such an agent cannot be trolled in such a way.

Deducing Impact

Rot13:

Gur vzcnpg bs na rirag ba lbh vf gur qvssrerapr orgjrra gur rkcrpgrq inyhr bs lbhe hgvyvgl shapgvba tvira pregnvagl gung gur rirag jvyy unccra, naq gur pheerag rkcrpgrq inyhr bs lbhe hgvyvgl shapgvba.

Zber sbeznyyl, jr fnl gung gur rkcrpgrq inyhr bs lbhe hgvyvgl shapgvba vf gur fhz, bire nyy cbffvoyr jbeyqfgngrf K, bs C(K)*H(K), juvyr gur rkcrpgrq inyhr bs lbhe hgvyvgl shapgvba tvira pregnvagl gung n fgngrzrag R nobhg gur jbeyq vf gehr vf gur fhz bire nyy cbffvoyr jbeyqfgngrf K bs C(K|R)*H(K). Gur vzcnpg bs R orvat gehr, gura, vf gur nofbyhgr inyhr bs gur qvssrerapr bs gubfr gjb dhnagvgvrf.

Troll Bridge

The proof doesn't work on a logically uncertain agent. The logic fails here:

Examining the source code of the agent, because we're assuming the agent crosses, either PA proved that crossing implies U=+10, or it proved that crossing implies U=0.

A logically uncertain agent does not need a proof of either of those things in order to cross, it simply needs a positive expectation of utility, for example a heuristic which says that there's a 99% chance crossing implies U=+10.

Though you did say there's a version which still works for logical induction. Do you have a link to where I can see that version of the argument?

Edit: Now I still see the logic. On the assumption that the agent crosses but also proves that U=-10, the agent must have a contradiction somewhere, because that, and the logical uncertainty agents I'm aware of have a contradiction upon proving U=-10 because they prove that they will not cross, and then immediately cross in a maneuver meant to prevent exactly this kind of problem.

Wait but proving crossing implies U=-10 does not mean that prove they will not cross, exactly because they might still cross, if they have a contradiction.

God this stuff is confusing. I still don't think the logic holds though.

Troll Bridge

Seems to me that if an agent with a reasonable heuristic for logical uncertainty came upon this problem, and was confident but not certain of its consistency, it would simply cross because expected utility would be above zero, which is a reason that doesn't betray an inconsistency. (Besides, if it survived it would have good 3rd party validation of its own consistency, which would probably be pretty useful.)