AI ALIGNMENT FORUMAF

Tsvi Benson-Tilsen

Comments

Open problem: thin logical priors

I agree that the epistemic formulation is probably more broadly useful, e.g. for informed oversight. The decision theory problem is additionally compelling to me because of the apparent paradox of having a changing caring measure. I naively think of the caring measure as fixed, but this is apparently impossible because, well, you have to learn logical facts. (This leads to thoughts like "maybe EU maximization is just wrong; you don't maximize an approximation to your actual caring function".)

Concise Open Problem in Logical Uncertainty

In case anyone shared my confusion:

The while loop where we ensure that eps is small enough so that

bound > bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))

is technically necessary to ensure that bad1() doesn't surpass bound, but it is immaterial in the limit. Solving

bound = bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))

gives

eps >= (1/3) (1 - e^{ -[bound - bad1()] / [next - this]] })

which, using the log(1+x) = x approximation, is about

(1/3) ([bound - bad1()] / [next - this] ).

Then Scott's comment gives the rest. I was worried about the fact that we seem to be taking the exponential of the error in our approximation, or something. But Scott points out that this is not an issue because we can make [next-this] as big as we want, if necessary, without increasing bad1() at all, by guessing p1 for a very long time until [bound - bad1()] / [next - this]] is close enough to zero that the error is too small to matter.

Concise Open Problem in Logical Uncertainty

Could you spell out the step

every iteration where mean(𝙴[𝚙𝚛𝚎𝚟:𝚝𝚑𝚒𝚜])≥2/5 will cause bound - bad1() to grow exponentially (by a factor of 11/10=1+(1/2)(−1+2/5𝚙𝟷))

a little more? I don't follow. (I think I follow the overall structure of the proof, and if I believed this step I would believe the proof.)

We have that eps is about (2/3)(1-exp([bad1() - bound]/(next-this))), or at least half that, but I don't see how to get a lower bound on the decrease of bad1() (as a fraction of bound-bad1() ).