AI ALIGNMENT FORUM
AF

Agent FoundationsLöb's theoremTiling AgentsAI
Frontpage

35

Working through a small tiling result

by James Payor
13th May 2025
6 min read
9

35

Agent FoundationsLöb's theoremTiling AgentsAI
Frontpage
Working through a small tiling result
2RogerDearnaley
1James Payor
1James Payor
1Vanessa Kosoy
1James Payor
2Vanessa Kosoy
New Comment
6 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:10 PM
[-]RogerDearnaley4mo*20

Would it help if we relaxed to accepting probabilistic evidence, a proof that the odds of a successor at generation n+1 accepting chocolate if the model at generation n did was provably greater than 1 - epsilon_n, for some series of epsilon_n such that the product series of all the (1 - epsilon_n) lower bound success chances converges to a number that is still almost one — i.e. if we're "almost" sure that the successors will always keep accepting chocolate? Many people might accept a P(DOOM) that was provably sufficiently low, but not provably zero.

Reply
[-]James Payor4mo10

In a more realistic and complicated setting, we may definitely want to be obtaining a high probability under some distribution we trust to be well-grounded, as our condition for a chain of trust. In terms of the technical difficulty I'm interested in working through, I think it should be possible to get satisfying results about proving that another proof system is correct, and whatnot, without needing to invoke probability distributions. To the extent that you can make things work with probabilistic reasoning, I think they can also be made to work in a logic setting, but we're currently missing some pieces.

Reply
[-]James Payor4mo10

Anyhow, regarding probability distributions, there's some philosophical difficulty in my opinion about "grounding". Specifically, what reason should I have to trust that the probability distribution is doing something sensible around my safety questions of interest? How did we construct things such that it was?

The best approach I'm aware of to building a computable (but not practical) distribution with some "grounding" results is logical induction / Garrabrant induction. They come with have a self-trust result of the form that logical inductors will, across time, converge to predicting their future selves' probabilities agree with their current probabilities. If I understand correctly, this includes limiting toward predicting a conditional probability p for an event X if we are given that the future inductor assigns probability p.

...however, as I understand, there's still scope for any probability distributions we try to base on logical inductors to be "ungrounded", in that we only have a guarantee that ungrounded/adversarial perturbations must be "finite" across the limit to infinity.

Here is something more technical on the matter that I alas haven't made the personal effort to read through: https://www.lesswrong.com/posts/5bd75cc58225bf067037556d/logical-inductor-tiling-and-why-it-s-hard

Reply
[-]Vanessa Kosoy4mo11

IIUC, fixed point equations like that typically have infinitely many solution. So, you defined not one goodnew predicate, but an infinite family of them. Therefore, your agent will trust a copy of itself, but usually won't trust variants of itself with other choices of fixed point. In this sense, this proposal is similar to proposals based on quining (as quining has many fixed points as well).

[This comment is no longer endorsed by its author]Reply
[-]James Payor4mo*10

My belief is that this one was fine, because self-reference occurs only under quotation, so it can be constructed by modal fixpoint / quining. But that is why the base definition of "good" is built non-recursively.

Is that what you were talking about?

(Edit: I've updated the post to be clearer on this technical detail.)

Reply
[-]Vanessa Kosoy4mo20

Sorry, I was wrong. By Lob's theorem, all versions of goodnew are provably equivalent, so they will trust each other.

Reply
Moderation Log
More from James Payor
View more
Curated and popular this week
6Comments

tl;dr it seems that you can get basic tiling to work by proving that there will be safety proofs in the future, rather than trying to prove safety directly.

"Tiling" here roughly refers to a state of affairs in which we have a program that is able to prove itself safe to run. I'll use a simple problem to keep this post self-contained, but here are some links to some relevant discussion from the past.

The idea I work through below is not new, here is Giles saying it 13 years ago. It's going to be brittle as well, but it seems to me like it's relevant to a general answer for tiling. I'd appreciate engagement, pointers as to why this isn't a great solution, literature references, discussion, etc.

Setup

I like the simplicity of the problem presented by cousin_it, and I'll adapt it for this post. It starts like this:

A computer program X is asked one of two questions:

  • Would you like some chocolate?
  • Here's the source code of another program Y. Do you accept it as your successor?

You would like your program to accept the chocolate. And you would like it to accept a broad range of successors. But you need to make sure that every chain of successors continues to accept the chocolate.

Rather than programs, I'm going to work with "provability logic bots" in some ambient theory T. However, everything should adapt directly to programs searching for proofs.

We will write bots using sentences that become true or false to answer "do you accept the chocolate" and "do you accept Y as your successor".

Accepting provably-safe successors

A natural start, following all the work on provability-based cooperation, is to accept successors who you can prove will always still take the chocolate.

We define "good(X)" to mean "for all finite sequences of bots that start with X, if every bot accepts the next one as its successor, then every bot in the sequence accepts the chocolate". This is a way to ground out "X is good iff it accepts the chocolate and only accepts good successors"[1].

Then we can set up our bot as:

  • Bot(Chocolate):=⊤ (i.e. "true", "yes please")
  • Bot(Successor(X)):=□good(X) ("there exists a proof in theory T of good(X)")

With this definition in place, we can prove our Bot is "good", so long as we trust that the proofs it finds are correct (i.e. we have given it a sound theory T to reason with).

Failing to prove ourself safe

Unfortunately though, constructed this way Bot cannot prove itself good using just its own theory T. So we encounter trouble when we ask if Bot will accept itself as its successor.

In particular, applying the rule that "X is good when it takes the chocolate and only admits good successors", we get:

  • Bot(Successor(Bot))↔□good(Bot)↔□(∀X.□good(X)→good(X))

If this statement were true, we could plug in a bot X that refuses chocolate and get that □(□⊥→⊥). By Löb's theorem we can then derive □⊥, which would say that our theory proves falsehood and isn't sound.

So our Bot fails to "tile".

Regaining self-trust with a tweak

On my current theory of what's going wrong here, the issue lies in the fact that we are searching for a proof of some fixed length that has to say that all future proofs of any length are trustworthy. And this will never fly, because (citation needed) you can only verify finitely-sized proofs using finite proof length, on pain of Löbian paradox[2].

So, my current theory of tiling says goes on to say: we can do better if we instead use a paradigm like "we prove (using some fixed length) that whenever a new successor comes along, there will at that time be a proof (of some length) that it is safe".

Applying this, consider a new "good" notion for our bot to use, defined as follows:

  • goodnew(X):=X(Chocolate)∧∀Y.X(Successor(Y))→□goodnew(Y)

What this says is "X is good if it takes the chocolate, and if every successor it accepts is provably good (in this same way)"[3]. This definition is close but not exactly equivalent to the original, but if we trust our proofs then goodnew(X) implies good(X)[4].

Then we update our bot to use the new notion in its proofs:

  • Botnew(Successor(X)):=□goodnew(X)

Now we can ask if Botnew accepts itself as a successor:

  • P:=Botnew(Successor(Botnew))
  • P↔□(Botnew(Chocolate)∧∀X.Botnew(Successor(X))→□goodnew(X))
  • P↔□(∀X.□goodnew(X)→□goodnew(X))

And the last line is trivial, it works!

Like before, our bot can tell that all future successors it accepts will come with a proof that they're good, but now this fact is considered sufficient for trust.

But does it blend

Here we might wonder if the change made above is more like a hack that achieves some class of "quining", or if it is doing something closer to the true spirit of tiling.

Well for one thing, it is the case that □good(X)→□goodnew(X)! This means that every successor accepted by our original Bot is still correctly accepted by Botnew. (Proof sketch in footnote.[4])

And for another thing, it means that Xnew also accepts our original X as a successor:

  • P:=Botnew(Successor(Bot))
  • P↔□(Bot(Chocolate)∧∀X.Bot(Successor(X))→□goodnew(X))
  • P↔□(□good(X)→□goodnew(X))

That's good, but here's an example of something not accepted: a bot that searches for a proof that a proof exists that the successor is good. This gets us something like:

  • □(□□goodnew(X)→□goodnew(X))

...and that's not going to work. This makes the point that we haven't tackled giving our bot the ability to trust its own theory, even though bounded trust would work and help us out with this example.

I'd be interested if anyone has other examples of bots that really should be accepted as a successor, but aren't.[5]

Musing on what remains

The tweak here was quite problem-specific, though I think it adapts to any standard tiling setup I've seen. I feel this is a step in the right direction, moving closer to my theory of how tiling and trust should work. But it is still quite brittle.

The tweak worked here because we were able to add in another layer of box / proof-meta right before another proof search occurred. It's pretty inflexible in this way, and lacks an ability to adapt to a more organic setting. We could try to patch this by allowing it to use as much proof nesting as it wants (which I think is equivalent to something like assuming a finite number of levels of soundness?), but then I'd expect this to have its own difficulties.

On a broader note, I think the tweak we had to carefully design by hand should emerge from the underlying theory doing better tracking of proof-lengths or self-trust-well-foundedness or whatnot, taking some lesson from things like Benya's Parametric Polymorphism. And at the same time we should get some ability to trust proofs in general rather than relying on happy coincidences. I think all of the Löbian difficulties are a symptom of a failure of our current formalisms to properly account for self-knowledge, and I'd love to instead fix a lot of the problems at this level.

For another thing, I would love a proper accounting of how "proving that a proof will exist" is working as a technique, mostly to further the project of inventing better foundations. I have many thoughts on that but nothing crisper than my earlier claim, that in order to avoid diagonalizing yourself it helps to show that "for anything that may happen, I will find a proof of some length to justify my action" rather than trying to show up front that there is a proof of known length that justifies that in advance. I'm hoping that this is a broad principle, and not just working here because our successor is literally searching for proofs in the same way as we are.

As mentioned up top, I'd love discussion if anyone wants to engage on this stuff. And to be clear, this sort of solution is not novel and was proposed at least 13 years ago (I'm very late to the party!)

  1. ^

    The technical detail here is that the recursive definition "X accepts chocolate and only good successors" needs to be grounded out somehow, as our definition is referencing itself before it is fully defined.

    What we want is something "coinductive", where the goodness lasts arbitrarily far, as long as you could ever ask. And the way to do this is to talk about all finite depths. Hence the definition in terms of successor chains.

    Having grounded it out this way, it's still the case that it meets our recursive criteria. If X accepts chocolate only accepts good successors, you can show its successor chains all accept chocolate. And vice versa.

  2. ^

    In my view, the key reason that Löb's theorem "works" is implicit proof compression, happening in the assumption □(□P→P), which says "there is a proof of fixed length k that a proof of P of any size implies P".

    Notably, Löb's theorem won't work if instead you have a family of proofs like ∀k.□k+1(□kP→P). I unfortunately lack a clear explanation for why at this time.

  3. ^

    Technical detail: in this case we can directly construct goodnew as a modal fixpoint, i.e. by quining, because it only refers to itself under a level of quotation. So we need not use a statement about successor chains to ground it.

  4. ^

    It is my understanding that □good(X)→□goodnew(X), and that Sound(T)∧goodnew(X)→good(X).

    My story for what is going on here is:

    1. It is hard to move from the knowledge that all successor chains accept chocolate to knowledge that this is provable, which is required by goodnew for successors. This means we can't go from good(X) to goodnew(X). But when we do have a proof □good(X) on hand we can construct a proof □goodnew(X).
    2. Without trusting our proofs, we cannot move from the knowledge that everything in a successor chain has nested proofs that they accept chocolate to direct knowledge that they all accept chocolate. This means we can't go from goodnew(X) directly to good(X), but we can if we make use of soundness.

    There is a cleaner way to see the relationship that I have missed, I'm interested to know if you, reader, have thoughts.

    I'll include my proof sketches below, skip these if you want to puzzle it out yourself:

    Proving the wonky equivalence of good(X) and goodnew(X)

    Given a bot X such that □good(X), we have a proof that all successor-chains of X accept chocolate. From this we can prove that X accepts chocolate, and that every particular successor Y of X has □good(Y) (by specializing the proof).

    This gives us □(X(Chocolate)∧∀Y.X(Successor(Y))→□good(Y)). If we then assume on the meta level that □(∀Y.□good(Y)→□goodnew(Y)), we can derive on the object level that ∀Y.□good(Y)→□goodnew(Y). So we can apply Löb's theorem (aka bizarro induction over infinite meta levels?) to establish the result on the object level.

    So this establishes that □good(X)→□goodnew(X). Now we consider the reverse direction.

    Suppose we have a successor chain starting at X and also that goodnew(X). From this we know that the chain begins with X accepting chocolate. And for the next member of the chain Y we have □goodnew(Y). Assuming soundness, we can turn the next proof into the knowledge goodnew(Y). We can proceed inductively along the whole chain.

    So that establishes goodnew(X)→good(X) under the assumption of soundness.

  5. ^

    For instance, probably there are pretty simple constructions that "accidentally" diagonalize the thing that Botnew is doing. I'm interested to discuss those, since I have the sense that we really should be able to do something that is mostly "diagonalization-proof", especially if said diagonalization is "accidental" and not malicious.