Working through a small tiling result

[-]RogerDearnaley5mo*20

Would it help if we relaxed to accepting probabilistic evidence, a proof that the odds of a successor at generation n+1 accepting chocolate if the model at generation n did was provably greater than 1 - epsilon_n, for some series of epsilon_n such that the product series of all the (1 - epsilon_n) lower bound success chances converges to a number that is still almost one — i.e. if we're "almost" sure that the successors will always keep accepting chocolate? Many people might accept a P(DOOM) that was provably sufficiently low, but not provably zero.

[-]James Payor5mo10

In a more realistic and complicated setting, we may definitely want to be obtaining a high probability under some distribution we trust to be well-grounded, as our condition for a chain of trust. In terms of the technical difficulty I'm interested in working through, I think it should be possible to get satisfying results about proving that another proof system is correct, and whatnot, without needing to invoke probability distributions. To the extent that you can make things work with probabilistic reasoning, I think they can also be made to work in a logic setting, but we're currently missing some pieces.

[-]James Payor5mo10

Anyhow, regarding probability distributions, there's some philosophical difficulty in my opinion about "grounding". Specifically, what reason should I have to trust that the probability distribution is doing something sensible around my safety questions of interest? How did we construct things such that it was?

The best approach I'm aware of to building a computable (but not practical) distribution with some "grounding" results is logical induction / Garrabrant induction. They come with have a self-trust result of the form that logical inductors will, across time, converge to predicting their future selves' probabilities agree with their current probabilities. If I understand correctly, this includes limiting toward predicting a conditional probability for an event $X$ if we are given that the future inductor assigns probability $p$ .

...however, as I understand, there's still scope for any probability distributions we try to base on logical inductors to be "ungrounded", in that we only have a guarantee that ungrounded/adversarial perturbations must be "finite" across the limit to infinity.

Here is something more technical on the matter that I alas haven't made the personal effort to read through: https://www.lesswrong.com/posts/5bd75cc58225bf067037556d/logical-inductor-tiling-and-why-it-s-hard

[-]Vanessa Kosoy5mo11

IIUC, fixed point equations like that typically have infinitely many solution. So, you defined not one predicate, but an infinite family of them. Therefore, your agent will trust a copy of itself, but usually won't trust variants of itself with other choices of fixed point. In this sense, this proposal is similar to proposals based on quining (as quining has many fixed points as well).

[This comment is no longer endorsed by its author]Reply

[-]James Payor5mo*10

My belief is that this one was fine, because self-reference occurs only under quotation, so it can be constructed by modal fixpoint / quining. But that is why the base definition of "good" is built non-recursively.

Is that what you were talking about?

(Edit: I've updated the post to be clearer on this technical detail.)

[-]Vanessa Kosoy5mo20

Sorry, I was wrong. By Lob's theorem, all versions of are provably equivalent, so they will trust each other.

^{^}

The technical detail here is that the recursive definition "X accepts chocolate and only good successors" needs to be grounded out somehow, as our definition is referencing itself before it is fully defined.

What we want is something "coinductive", where the goodness lasts arbitrarily far, as long as you could ever ask. And the way to do this is to talk about all finite depths. Hence the definition in terms of successor chains.

Having grounded it out this way, it's still the case that it meets our recursive criteria. If X accepts chocolate only accepts good successors, you can show its successor chains all accept chocolate. And vice versa.

^{^}

In my view, the key reason that Löb's theorem "works" is implicit proof compression, happening in the assumption $□ (□ P \to P)$ , which says "there is a proof of fixed length $k$ that a proof of $P$ of any size implies $P$ ".

Notably, Löb's theorem won't work if instead you have a family of proofs like $\forall k . □_{k + 1} (□_{k} P \to P)$ . I unfortunately lack a clear explanation for why at this time.

^{^}

Technical detail: in this case we can directly construct $g o o d_{n e w}$ as a modal fixpoint, i.e. by quining, because it only refers to itself under a level of quotation. So we need not use a statement about successor chains to ground it.

^{^}

It is my understanding that $□ g o o d (X) \to □ g o o d_{n e w} (X)$ , and that $S o u n d (T) \land g o o d_{n e w} (X) \to g o o d (X)$ .

My story for what is going on here is:

It is hard to move from the knowledge that all successor chains accept chocolate to knowledge that this is provable, which is required by $g o o d_{n e w}$ for successors. This means we can't go from $g o o d (X)$ to $g o o d_{n e w} (X)$ . But when we do have a proof $□ g o o d (X)$ on hand we can construct a proof $□ g o o d_{n e w} (X)$ .
Without trusting our proofs, we cannot move from the knowledge that everything in a successor chain has nested proofs that they accept chocolate to direct knowledge that they all accept chocolate. This means we can't go from $g o o d_{n e w} (X)$ directly to $g o o d (X)$ , but we can if we make use of soundness.

There is a cleaner way to see the relationship that I have missed, I'm interested to know if you, reader, have thoughts.

I'll include my proof sketches below, skip these if you want to puzzle it out yourself:

Proving the wonky equivalence of $g o o d (X)$ and $g o o d_{n e w} (X)$

Given a bot $X$ such that $□ g o o d (X)$ , we have a proof that all successor-chains of $X$ accept chocolate. From this we can prove that $X$ accepts chocolate, and that every particular successor $Y$ of $X$ has $□ g o o d (Y)$ (by specializing the proof).

This gives us $□ (X (C h o c o l a t e) \land \forall Y . X (S u c c e s s o r (Y)) \to □ g o o d (Y))$ . If we then assume on the meta level that $□ (\forall Y . □ g o o d (Y) \to □ g o o d_{n e w} (Y))$ , we can derive on the object level that $\forall Y . □ g o o d (Y) \to □ g o o d_{n e w} (Y)$ . So we can apply Löb's theorem (aka bizarro induction over infinite meta levels?) to establish the result on the object level.

So this establishes that $□ g o o d (X) \to □ g o o d_{n e w} (X)$ . Now we consider the reverse direction.

Suppose we have a successor chain starting at $X$ and also that $g o o d_{n e w} (X)$ . From this we know that the chain begins with $X$ accepting chocolate. And for the next member of the chain $Y$ we have $□ {g o o d}_{n e w} (Y)$ . Assuming soundness, we can turn the next proof into the knowledge $g o o d_{n e w} (Y)$ . We can proceed inductively along the whole chain.

So that establishes $g o o d_{n e w} (X) \to g o o d (X)$ under the assumption of soundness.

^{^}

For instance, probably there are pretty simple constructions that "accidentally" diagonalize the thing that $B o t_{n e w}$ is doing. I'm interested to discuss those, since I have the sense that we really should be able to do something that is mostly "diagonalization-proof", especially if said diagonalization is "accidental" and not malicious.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

35

Working through a small tiling result

35

Setup

Accepting provably-safe successors

Failing to prove ourself safe

Regaining self-trust with a tweak

But does it blend

Musing on what remains