This text originated from a retreat in late 2018, where researchers from FHI, MIRI and CFAR did an extended double-crux on AI safety paradigms, with Eric Drexler and Scott Garrabrant in the core. In the past two years I tried to improve it in terms of understandability multiple times, but empirically it seems quite inadequate. As it seems unlikely I will have time to invest further work into improving it, I'm publishing it as it is, with the hope that someone else will maybe understand the ideas even at this form, and describe them more clearly.

The box inversion hypothesis consists of the two following propositions

There exists something approximating a duality / an isomorphism between technical AI safety problems in the Agent Foundations agenda and some of the technical problems implied by the Comprehensive AI Services framing

The approximate isomorphism holds between enough properties that some solutions to the problems in one agenda translate to solutions to problems in the other agenda

I will start with an apology - I will not try to give my one paragraph version of the Comprehensive AI Services. It is an almost 200 pages long document, conveying dozens of models and intuitions. I don’t feel like being the best person to give a short introduction. So, I just assume familiarity with CAIS. I will also not try to give my short version of the various problems which broadly fit under the Agent Foundations agenda, as I assume most of the readers are already familiar with them.

0. The metaphor: Circle inversion

People who think geometrically rather than spatially may benefit from looking at a transformation of a plane called circle inversion first. A nice explanation is here - if you have never met the transformation, pages 1-3 of the linked document should be enough.

You can think about the “circle inversion” as a geometrical metaphor for the “box inversion”.

1. The map: Box inversion

The central claim is that there is a transformation between many of the technical problems in the Agent Foundations agenda and CAIS. To give you some examples

problems with daemons <-> problems with molochs

questions about ontologies <-> questions about service catalogues

manipulating the operator <-> addictive services

some “hard core” of safety (tiling, human-compatibility, some notions of corrigibility) <-> defensive stability, layer of security services

...

The claim of the box inversion hypothesis is that this is not a set of random anecdotes, but there is a pattern, pointing to a map between the two framings of AI safety. Note that the proposed map is not exact, and also is not a trivial transformation like replacing "agent" with "service".

To explore two specific examples in more detail:

In the classical "AI in a box" picture, we are worried about the search process creating some inner mis-aligned part, a sub-agent with misaligned objectives.

In the CAIS picture, one reasonable worry is the evolution of the system of services hitting a basin of attraction of so-called moloch - a set of services which has emergent agent-like properties, and misaligned objectives.

Regarding some properties, the box inversion turns the problem “inside out”: instead of sub-agents the problem is basically with super-agents.

Regarding some abstract properties, the problem seems similar, and the only difference is where we draw the boundaries of the “system”.

2. Adding nuance

Using the circle inversion metaphor to guide our intuition again: some questions are transformed into exactly the same questions. For example, a question whether two circles intersect is invariant under the circle inversion. Similarly, some safety problems stay the same after the "box inversion".

This may cause an incorrect impression that the agendas are actually exactly the same technical agenda, just stated in different language. This is not the case - often, the problems are the same in some properties, but different in others. (Vaguely said, there is something like a partial isomorphism, which does not hold between all properties. Someone familiar with category theory could likely express this better.)

It is also important to note that apart from the mapping between problems, there are often differences between CAIS and AF in how they guide our intuitions on how to solve these problems. If I try to informally describe the intuition

CAIS is a perspective which is rooted in engineering, physics and continuity"continuum"

Agent foundations feel, at least for me, more like coming from science, mathematics, and a "discrete/symbolic" perspective

(Note that there is also a deep duality between science and engineering, there are several fascinating maps between "discrete/symbolic" and "continuum" pictures, and, there is an intricate relation between physics and mathematics. I hope to write more on that and how it influences various intuitions about AI safety in some other text.)

3. Implications

As an exercise, I recommend to take your favourite problem in one of the agendas, and try to translate it to the other agenda via the box inversion.

Overall, if true, I think the box inversion hypothesis provides some assurance that the field as a whole is tracking real problems, and some seemingly conflicting views are actually closer than they appear. I hope this connection can shed some light on some of the disagreements and "cruxes" in AI safety. From the box inversion perspective, they sometimes seem like arguing whether things are inside or outside of the circle of symmetry in a space which is largely symmetrical to circular inversion.

I have some hope that some problems may be more easily solvable in one view, similarly to various useful dualities elsewhere. At least in my experience for many people it is usually much easier to see some specific problem in one of the perspectives than the other.

4. Why the name

In one view, we are worried that the box, containing the wonders of intelligence and complexity, will blow up in our face. In the other view, we are worried that the box, containing humanity and its values, with wonders of intelligence and complexity outside, will crush upon our heads.

This text originated from a retreat in late 2018, where researchers from FHI, MIRI and CFAR did an extended double-crux on AI safety paradigms, with Eric Drexler and Scott Garrabrant in the core. In the past two years I tried to improve it in terms of understandability multiple times, but empirically it seems quite inadequate. As it seems unlikely I will have time to invest further work into improving it, I'm publishing it as it is, with the hope that someone else will maybe understand the ideas even at this form, and describe them more clearly.The box inversion hypothesis consists of the two following propositions

Comprehensive AI ServicesframingI will start with an apology - I will not try to give my one paragraph version of the Comprehensive AI Services. It is an almost 200 pages long document, conveying dozens of models and intuitions. I don’t feel like being the best person to give a short introduction. So, I just assume familiarity with CAIS. I will also not try to give my short version of the various problems which broadly fit under the Agent Foundations agenda, as I assume most of the readers are already familiar with them.

## 0. The metaphor: Circle inversion

People who think geometrically rather than spatially may benefit from looking at a transformation of a plane called circle inversion first. A nice explanation is here - if you have never met the transformation, pages 1-3 of the linked document should be enough.

You can think about the “circle inversion” as a geometrical metaphor for the “box inversion”.

## 1. The map: Box inversion

The central claim is that there is a transformation between many of the technical problems in the Agent Foundations agenda and CAIS. To give you some examples

The claim of the box inversion hypothesis is that this is not a set of random anecdotes, but there is a pattern, pointing to a map between the two framings of AI safety. Note that the proposed map is not exact, and also is not a trivial transformation like replacing "agent" with "service".

To explore two specific examples in more detail:

In the classical "AI in a box" picture, we are worried about the search process creating some inner mis-aligned part, a sub-agent with misaligned objectives.

In the CAIS picture, one reasonable worry is the evolution of the system of services hitting a basin of attraction of so-called moloch - a set of services which has emergent agent-like properties, and misaligned objectives.

Regarding some properties, the box inversion turns the problem “inside out”: instead of sub-agents the problem is basically with super-agents.

Regarding some abstract properties, the problem seems similar, and the only difference is where we draw the boundaries of the “system”.

## 2. Adding nuance

Using the circle inversion metaphor to guide our intuition again: some questions are transformed into exactly the same questions. For example, a question whether two circles intersect is invariant under the circle inversion. Similarly, some safety problems stay the same after the "box inversion".

This may cause an incorrect impression that the agendas are actually exactly the same technical agenda, just stated in different language. This is not the case - often, the problems are the same in some properties, but different in others. (Vaguely said, there is something like a partial isomorphism, which does not hold between all properties. Someone familiar with category theory could likely express this better.)

It is also important to note that apart from the mapping between problems, there are often differences between CAIS and AF in how they guide our intuitions on how to solve these problems. If I try to informally describe the intuition

(Note that there is also a deep duality between science and engineering, there are several fascinating maps between "discrete/symbolic" and "continuum" pictures, and, there is an intricate relation between physics and mathematics. I hope to write more on that and how it influences various intuitions about AI safety in some other text.)

## 3. Implications

As an exercise, I recommend to take your favourite problem in one of the agendas, and try to translate it to the other agenda via the box inversion.Overall, if true, I think the box inversion hypothesis provides some assurance that the field as a whole is tracking real problems, and some seemingly conflicting views are actually closer than they appear. I hope this connection can shed some light on some of the disagreements and "cruxes" in AI safety. From the box inversion perspective, they sometimes seem like arguing whether things are inside or outside of the circle of symmetry in a space which is largely symmetrical to circular inversion.

I have some hope that some problems may be more easily solvable in one view, similarly to various useful dualities elsewhere. At least in my experience for many people it is usually much easier to see some specific problem in one of the perspectives than the other.

## 4. Why the name

In one view, we are worried that the box, containing the wonders of intelligence and complexity, will blow up in our face. In the other view, we are worried that the box, containing humanity and its values, with wonders of intelligence and complexity outside, will crush upon our heads.