When does one program simulate another? When does one program approximately simulate another?

In some contexts, these questions have concrete answers. In other contexts they feel about the same as the problem of functionalist triviality. When simulation comes up in contexts relevant to agent foundations, it tends to feel like the latter. Here, I outline some arguments which I've used in the past to show that a given concept of simulation doesn't quite accomplish what it sets out to do.

For sake of concreteness, I will give two problems which we can imagine solving with an appropriate definition of simulation.

First, consider the problem of taking counterfactuals in order to evaluate consequences of our actions. We may wish to seek copies of our own algorithm in the environment in order to counterfact on them (or to make a mere assumption that their output will be the same as ours). Here, I assume we're trying to solve this problem by looking for sub-components (or properties) of some world model which meet some criterion; namely, forming a simulation of us, according to some formal definition. We've already decided on a program P which represents us (possibly more than one, but we take interest in one at at time), and we are comparing P to (our models of) the environment in some sense. We expect to find one copy of P — ourselves — and we are looking for others.

As a second example, suppose that we have in hand a mathematical description of an AI which we are considering building. Suppose that the AI has various safety and transparency properties, but all these properties are stated in terms of the agent's beliefs and actions, where beliefs are stored in some specified format and actions are output via a specified channel. We are interested in knowing whether building the agent will lead to any "emergent subagents", and if so, whether those agents will be aligned with the original agent (and with us), and what safety and transparency properties those agents might have. In pursuit of this goal, we are attempting to prove theorems about whether certain programs P representing subagents can be simulated on certain sub-components (or properties) of our AI.

A Strawman

Again for sake of concreteness, I'll put forward a straw definition of simulation. Suppose the program P we're searching for simulations of is represented as a Turing machine, and we enrich it with an intuitive set of counterfactuals (moving the head, changing the state, changing symbols anywhere on the tape; all of this at any time step). We thus get a causal graph (the program graph), in which we think of some vertices as input and some as output. I assume our model of the world (in the first example) or the AI (in the second) is similarly made up of components which have states and admit fairly straightforward local counterfactuals, such as removing certain groups of atoms or changing the contents of certain registers. We can thus think of the environment or the AI as a causal graph (the target graph). We claim some part of the target graph simulates the program graph if there is a mapping between the two which successfully represents all the counterfactuals. Since the target graph may be highly redundant, we allow single vertices in the program graph to correspond to collections of vertices in the target graph, and we similarly allow single states in the program graph to correspond to multiple possible states in the target graph. (Imagine a collection of voltages which all map to "0".)

This definition likely has many problems. For example, in a Turing machine virtually all causation has to move through the read/write head, and we have counterfactuals which can move the head. We probably don't care if the environment implements the computation in a way which runs all causation through one point like this. This definition of simulation may also be trivial (in the Functionalist Triviality sense) when the target graph is large enough and rich enough; a target graph with enough counterfactuals will contain some of the right shape to match the program graph. For the sake of this post, just imagine someone has taken this definition as a starting point and attempted to remove some obvious problems. (The arguments I'm presenting were originally replies to a couple different proposals of this general type.)

Monk Treehouse

An order of monks has dedicated themselves to calculating the program P. The monks live in a gigantic treehouse, and have constructed a large, abacus-like computer on which to run P. Large ceremonial stone weights are moved from one place to another in order to execute the algorithm. The weights are so heavy that moving any one of them could cause the treehouse to become unbalanced and fall down. In order to do their computation, the monks have come up with a balancing system. As a first example, I'll suppose that these monks came up with a second program P' when they first built the treehouse, and P' is executed on a separate set of stone weights on the other side of the treehouse. The program P' is different from P, but the monks were able to prove a theorem guaranteeing that the distributions of weights would balance.

Intuitively, the monks on one side of the treehouse are simulating the program P — it's actually what they set out to do. But none of the counterfactuals which we want to take on the program graph map easily to the treehouse weights. If we move a weight on one side to correspond to editing a symbol on the Turing machine's tape, then the treehouse falls over. If we move some weight on the other side to keep things balanced, the counterfactual may temporarily succeed, but the program P' will stop obeying the monks' foreordained guarantee; essentially P' may get thrown off course and fail to counterbalance the next move correctly, destroying the treehouse. If we just add hidden weights to counterbalance our change, the hidden weights could become wrong later.

When such a program P' exists, the treehouse argument is basically pointing out that the "straightforward local counterfactuals" we put in the target graph were not enough; there can be "logical entanglements" which keep us from successfully pointing at the embedding of P.

Of course, we could include in our mapping between program graph and target graph that corresponding to any counterfactual part of the program graph, we create a physical structure which props up the tree within the target graph. This is beginning to become suspicious since the states used for counterfactuals look so much different than the states corresponding to a non-counterfacted run of the program. One thing we want to avoid is having a notion of simulation which could consider an empty patch of air to be computing P, simply by mapping counterfactuals to a case where a computer suddenly appears. (Such a problem would be even worse than the usual functionalist triviality problem.) When we say that a state of some part of the Turing machine corresponds to some state in the target graph, we should mean the same thing whether we're in a counterfactual part of the graphs or not.

The "computer from thin air" problem actually seems very similar to the Monk Treehouse problem. One way of trying to intuitively argue that the monks are executing the program P would be to suppose that all the monks switch to using lightweight tokens to do their computation instead of the heavy stones. This seems to be almost as thorough a change as making a counterfactual computer in empty air. Yet it seems to me that the monk treehouse is really computing P, whereas the air is not.

Suppose we have a definition of simulation which can handle this; that is, somehow the treehouse doesn't fall over under counterfactuals but the notion of simulation doesn't build computers of thin air. It seems to me there are still further complications along the same lines. These monks are good mathematicians, and have devoted their lives to the faithful execution of the program P. One can only imagine that they will be keeping careful records and double checking that every stone is moved correctly. They also will look for any mathematical properties of the process which can be proven and used to double check. When we attempt to map counterfactual Turing machine states to counterfactual stone positions, we will be pinned down by a complicated verification process;. If we just move the stone, the monks will notice it and move it back; if we try to change the monks' beliefs too, then they will notice contradictions in their mathematics, lose faith, and abandon the calculations.

What might work would be to find every single case where a monk calculates the stone position, both before and after the actual stone movement, and map counterfactuals on the program graph to a counterfactual where every single one of these calculations is changed. This likely has problems of its own, but it's interesting as a suggestion that in order to find one simulation of the program P, you need to track down basically all simulations of P (or at least all the ones that might interact).

To make my analogy clear: both emergent subagents within an AI and instances of one's decision algorithm in the environment could be embedded in their surroundings in ways not easily teased out by "straightforward local counterfactuals". Nonetheless, the notion of simulation here seems intuitive, and I can't help but imagine there's an appropriate definition somewhere just beyond ones I've seen or tried. I'd be happy to look at any suggestions.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

4

Monk Treehouse: some problems defining simulation

4

A Strawman

Monk Treehouse