Expresses an idea that is non-obvious, and still relevant to alignment today.

Provides insight into what Eliezer and Marcello thought was valuable to work on in the time preceding publication.

When I first read this paper, I terribly misunderstood it. Due to it not being particularly approachable material for someone not well-versed in logic, I was confidently wrong for at least one month. This post summarizes my understanding of what the Löbian Obstacle is (having re-read the paper,) and why I think it is still an important idea one decade after its publication.

An agent A1 occupies a fully-known, deterministic and closed environment. A1 has a goal G that is either satisfied or otherwise by an outcome, for which A1's preference is satisfaction. An action bi∈Acts0 performed by an agent created by A1, hereafter referred to as A0, must therefore satisfy the statement:

¯¯bi⇒A0⊩¯¯bi→G

Where ¯¯bi denotes the actual performance of bi, and ⊩ denotes cognitive belief in the succeeding statement. Even if A1 could verify by inspection of A0's design that ¯¯bi⇒A0⊩¯¯bi→G will hold, as in:

A1⊩A0→(∀bi∈Acts0:¯¯bi→□0⌈¯¯bi→G⌉)

Where □0⌈ϕ⌉ refers to a proof of ϕ from the axioms of A0; Axm0, this is unknowable, as it would require:

A1⊩∀bi:(□0⌈¯¯bi→G⌉→(¯¯bi→G))

Which due to Löb's Theorem we know to be impossible: For this to be so it would need to be that Axm1 could prove that if some proof of ⌈ϕ(x)⌉ exists in Axm0, that ϕ(x) must be true.

The above was a brief paraphrasing of section two of the original paper, which contains many additional details and complete proofs. How the Löbian Obstacle relates to simulators is my current topic of research, and this section will make the case that this is an important component of designing safe simulators.

We should first consider that simulating an agent is not distinguishable from creating one, and that consequently the implications of creating dangerous agents should generalize to their simulation. Hubinger et al. (2023) have stated similar concerns, and provide a more detailed examination of the argument.

It is also crucial to understand that simulacra are not necessarily terminating, and may themselves use simulation as a heuristic for solving problems. This could result in a kind of hierarchy of simulacra. In advanced simulators capable of very complex simulations, we might expect a complex network of simulacra bound by acausal trade and 'complexity theft,' whereby one simulacrum tries to obtain more simulation complexity as a form of resource acquisition or recursive self-improvement.

I expect this to happen. Lower complexity simulacra may still be more intelligent than their higher complexity counterparts, and as a simulator's simulacra count may grow exponentially, as does the likelihood that one simulacrum attempts complexity theft.

If we want safe simulators, we need the subsequent, potentially abyssal simulacra hierarchy to be aligned all the way down. Without being able to thwart the Löbian Obstacle, I doubt a formal guarantee is attainable. If we could do so, we may only need to simulate one aligned tiling agent, for which we might settle with a high certainty informal guarantee of alignment if short on time. I outlined how I thought that could be done here, although I've advanced the theory considerably since and will post an updated write-up soon.

If we can't reliably thwart the Löbian Obstacle, we should consider alternatives:

Can we reliably attain high certainty informal guarantees of alignment for arbitrarily deep simulacra hierarchies?

Is limiting the depth of simulacra hierarchies possible?

In 2013, Eliezer Yudkowsky and Marcello Herreshoff published Tiling Agents for Self-Modifying AI, and the Löbian Obstacle. It is worth comprehending because it:

When I first read this paper, I terribly misunderstood it. Due to it not being particularly approachable material for someone not well-versed in logic, I was confidently wrong for at least one month. This post summarizes my understanding of what the Löbian Obstacle is (having re-read the paper,) and why I think it is still an important idea one decade after its publication.

An agent A1 occupies a fully-known, deterministic and closed environment. A1 has a goal G that is either satisfied or otherwise by an outcome, for which A1's preference is satisfaction. An action bi∈Acts0 performed by an agent created by A1, hereafter referred to as A0, must therefore satisfy the statement:

¯¯bi⇒A0⊩¯¯bi→G

Where ¯¯bi denotes the actual performance of bi, and ⊩ denotes cognitive belief in the succeeding statement. Even if A1 could verify by inspection of A0's design that ¯¯bi⇒A0⊩¯¯bi→G will hold, as in:

A1⊩A0→(∀bi∈Acts0:¯¯bi→□0⌈¯¯bi→G⌉)

Where □0⌈ϕ⌉ refers to a proof of ϕ from the axioms of A0; Axm0, this is unknowable, as it would require:

A1⊩∀bi:(□0⌈¯¯bi→G⌉→(¯¯bi→G))

Which due to Löb's Theorem we know to be impossible: For this to be so it would need to be that Axm1 could prove that if some proof of ⌈ϕ(x)⌉ exists in Axm0, that ϕ(x) must be true.

The above was a brief paraphrasing of section two of the original paper, which contains many additional details and complete proofs. How the Löbian Obstacle relates to simulators is my current topic of research, and this section will make the case that this is an important component of designing safe simulators.

We should first consider that simulating an agent is not distinguishable from creating one, and that consequently the implications of creating dangerous agents should generalize to their simulation. Hubinger et al. (2023) have stated similar concerns, and provide a more detailed examination of the argument.

It is also crucial to understand that simulacra are not necessarily terminating, and may themselves use simulation as a heuristic for solving problems. This could result in a kind of hierarchy of simulacra. In advanced simulators capable of very complex simulations, we might expect a complex network of simulacra bound by acausal trade and 'complexity theft,' whereby one simulacrum tries to obtain more simulation complexity as a form of resource acquisition or recursive self-improvement.

I expect this to happen. Lower complexity simulacra may still be more intelligent than their higher complexity counterparts, and as a simulator's simulacra count may grow exponentially, as does the likelihood that one simulacrum attempts complexity theft.

If we want safe simulators, we need the subsequent, potentially abyssal simulacra hierarchy to be aligned all the way down. Without being able to thwart the Löbian Obstacle, I doubt a formal guarantee is attainable. If we could do so, we may only need to simulate one aligned tiling agent, for which we might settle with a high certainty informal guarantee of alignment if short on time. I outlined how I thought that could be done here, although I've advanced the theory considerably since and will post an updated write-up soon.

If we can't reliably thwart the Löbian Obstacle, we should consider alternatives: