Tuesday, July 27th 2021

Tue, Jul 27th 2021

Frontpage Posts

Shortform

4Alex Turner1dMy power-seeking theorems [https://www.lesswrong.com/s/fSMbebQyR4wheRrvk] seem a
bit like Vingean reflection
[https://www.lesswrong.com/posts/iWwaJ5wPGLZJWjPAL/vingean-reflection-reliable-reasoning-for-self-improving]
. In Vingean reflection, you reason about an agent which is significantly
smarter than you: if I'm playing chess against an opponent who plays the optimal
policy for the chess objective function, then I predict that I'll lose the game.
I predict that I'll lose, even though I can't predict my opponent's (optimal)
moves - otherwise I'd probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which
e.g. avoid shutdown and survive into the far future, even without saying what
particular actions these policies take to get there. I may not even be able to
compute a single optimal policy for a single non-trivial objective, but I can
still reason about the statistical tendencies of optimal policies.

1

Wiki/Tag Page Edits and Discussion

Monday, July 26th 2021

Mon, Jul 26th 2021

Frontpage Posts

Shortform

32dSuppose I am interested in finding a program M whose input-output behavior has
some property P that I can probabilistically check relatively quickly (e.g. I
want to check whether M implements a sparse cut of some large implicit graph). I
believe there is some simple and fast program M that does the trick. But even
this relatively simple M is much more complex than the specification of the
property P.
Now suppose I search for the simplest program running in time T that has
property P. If T is sufficiently large, then I will end up getting the program
"Search for the simplest program running in time T' that has property P, then
run that." (Or something even simpler, but the point is that it will make no
reference to the intended program M since encoding P is cheaper.)
I may be happy enough with this outcome, but there's some intuitive sense in
which something weird and undesirable has happened here (and I may get in a
distinctive kind of trouble if P is an approximate evaluation). I think this is
likely to be a useful maximally-simplified example to think about.

5

Wiki/Tag Page Edits and Discussion