A Self-Embedded Probabilistic Model

AI ALIGNMENT FORUM
AF

A Self-Embedded Probabilistic Model — AI Alignment Forum

One possibly-confusing point from the Embedded Agents sequence: it’s actually not difficult to write down a self-embedded world model. Just as lazy data structures can represent infinite sequences in finite space, a lazily-evaluated probabilistic model can represent a world which is larger than the data structure representing the model - including worlds in which that data structure is itself embedded. The catch is that queries on that model may not always be computable/well-defined, and even those which are computable/well-defined may take a long time to compute - e.g. more time than an agent has to make a decision.

In this post, we’ll see what this looks like with a probabilistic self-modelling Turing machine. This is not the most elegant way to picture a self-embedded probabilistic model, nor the most elegant way to think about self-modelling Turing machines, but it does make the connection from probabilistic models to quining and diagonalization explicit.

The Model

Let’s write out a Turing machine as a probabilistic model.

Pieces:

Tape
Head
k-bit-per-timestep input channel
k-bit-per-timestep output channel
k-bit-per-timestep random bit channel

I’m including an input channel so we can have data come in at every timestep, rather than just putting all the input on one tape initially (which would let the machine perform arbitrary computation between each data point arriving). Similarly with the output channel: it will have a bit in it every timestep, so the machine can’t perform arbitrary computation between output bits. This is a significant difference between this model and the usual model, and it makes this model a lot more similar to real-world computational systems, like CPUs or brains or .... It is, effectively, a bounded computation model. This will play only a minor role for current purposes, but if you’re thinking about how decision-theoretic queries work, it becomes much more relevant: the machine will only have finite time to answer a query before a decision must be made.

The relationships:

The tape-state at time and position $p o s i t i o n$ , $T a p e (t) [p o s i t i o n]$ , is a function of $T a p e (t - 1) [p o s i t i o n]$ and $H e a d (t)$ (if $p o s i t i o n = H e a d (t - 1) . p o s i t i o n$ ).
The head-state at time $t$ , $H e a d (t)$ , is a function of $H e a d (t - 1)$ , tape-state at the head’s previous position $T a p e (t - 1) [H e a d (t - 1) . p o s i t i o n]$ , and the input bits $I n (t)$ and $R a n d (t)$ .
The output-state at time $t$ , $O u t (t)$ , is a function of $H e a d (t)$ .

Also, we’ll assume that we know the initial state $h e a d_{0}$ of the head and $t a p e_{0}$ of the tape, and that only a finite section of the tape starts out nonzero. Put that all together into a probability distribution factorization, and we get

$P [T a p e, H e a d, O u t | I n, R a n d] = I [H e a d (0) = h e a d_{0}] I [T a p e [0] = t a p e_{0}] \cdot$

$\prod_{t} P [H e a d (t) | H e a d (t - 1), T a p e (t - 1) [H e a d (t - 1) . p o s i t i o n], I n (t), R a n d (t)] \cdot$

$\prod_{t} P [T a p e (t) | T a p e (t - 1), H e a d (t - 1) . p o s i t i o n, H e a d (t)] \cdot$

$\prod_{t} P [O u t (t) | H e a d (t)]$

Each line here handles the update for one component - the head, tape and output - with the initial conditions on the first line. We can also further break apart the tape-update term, since only the tape-position where the head is located depends on the head-state:

$P [T a p e (t) | T a p e (t - 1), H e a d (t - 1) . p o s i t i o n, H e a d (t)] =$

$P [T a p e (t) [H e a d (t - 1) . p o s i t i o n] | T a p e (t - 1) [H e a d (t - 1) . p o s i t i o n], H e a d (t)] \cdot$

$\prod_{p o s i t i o n \neq H e a d (t - 1) . p o s i t i o n} P [T a p e (t) [p o s i t i o n] | T a p e (t - 1) [p o s i t i o n]]$

Each of the "atomic" conditional probabilities in these formulas would be given by the machine's local update-rules - e.g. head state as a function of previous head state, state of previous tape location, and inputs, or tape state as a function of previous tape-state (usually the identity function). We can also incorporate a model of the rest of the world, which would give $P [I n | O u t]$ and $P [R a n d]$ , with $I n$ at later times depending only on $O u t$ at earlier times so that the whole thing works out to a well-behaved (i.e. acyclic) factorization.

The important thing to notice is that, while the distribution is over an infinite space, we can express it in a finite set of equations (i.e. the equations above). We can also perform ordinary probabilistic calculations using these equations: it’s just a plain old Bayes net, and there’s a bounded number of nonzero variables at each timestep. More to the point, we can hard-code a representation of these equations in the initial state of the Turing machine (quining to fit the representation of the initial state inside the representation of the model inside the initial state), and then the machine itself can perform ordinary probabilistic calculations using these equations. We can treat the equations defining the model as a lazy data structure, and run queries on them.

Diagonalizing

So, what happens if we try to diagonalize our self-modelling Turing machine? What happens if we look at the first non-null output of the machine and then flip it, and we program the machine to output the most likely value of this flipped output? Well, that scheme falls apart at “first non-null output of the machine”. There’s no guarantee that the machine ever outputs a non-null output. Let’s write this out concretely, in terms of the probabilistic model and queries involved.

We’ll assume that $O u t$ consists of two bits. The machine always outputs “00” until its computation completes, at which point it outputs either “10” if it wants to pass a logical-zero output or “11” for a logical-one output. The program on the initial tape is some ordinary probabilistic reasoning program, and we hardcode our query into it.

What should our query be? Well, we want to look at the first non-null output, so we’ll have to write something like “ $O u t (m i n_{t} t s . t . O u t (t) \neq “ 00 ”)$ ” - the minimum time t such that the output is not "00". Then we want the machine to output the least-likely value of that variable, so we’ll ask for something like

$a r g m i n_{v a l} P [O u t (m i n_{t} t s . t . O u t (t) \neq “ 00 ”) = “ 0 ” + v a l]$

Then, when the machine goes to compute $m i n_{t} t s . t . O u t (t) \neq “ 00 ”$ , the computation may not complete at all, in which case there will not be any time t which satisfies that condition.

The main thing to notice in this example is that the query itself was not well-defined. It’s not just that our machine can’t answer the query; even an outside observer with unlimited computational power would conclude that the answer is not well-defined, because the time $t$ which it asks about does not exist. Our query is trying to address a variable which doesn’t exist. The “non-halting weirdness” comes from passing in weird queries to the model, not from any weirdness in the model itself. If we stick to “normal” queries - i.e. ask about the value of a specific variable - then there isn’t any conceptual problem (though it may take a while for the query to run). So from this perspective, the central problem of self-embedded world models is not representation or interpretation of the model, but rather the algorithmic problem of expanding the set of queries we can answer “without any weirdness”.

In this example, there is another possible behavior: the machine may output a logical zero with probability ½ and a logical one with probability ½, using its random bit source. This would require a probabilistic reasoning algorithm quite different from what we normally use, but would be entirely self-consistent and non-weird. That’s an example of what it might mean to “expand the set of queries we can answer without any weirdness”.

What Queries Do We Care About?

We do not care about all queries equally. Depending on the details of our underlying model/logic, there may be lots of queries which are uncomputable/undefined, but which we don’t actually care about answering. We want a theory of embedded agents, but that does not necessarily imply that we need to handle every possible query in some very expressive logic.

So which queries do we need to handle, in order to support a minimum viable model of agency?

This is a hard question, because it depends on what decision theory we’re using, and exactly what kind of counterfactuals that decision theory contains (and of course some decision theories don’t directly use a probabilistic model at all). But there are at least some things we’ll definitely want - in particular, if we’re using a probabilistic model at all, we’ll want some way to do something like a Bayesian update. That doesn’t necessarily mean updating every probability of every state explicitly; we could update lazily, for instance, i.e. just store the input data directly and then go look it up if and when it’s actually relevant to a query. More generally, we want some data structure which summarizes whatever info we have from the inputs in a form suitable to answering whatever queries we’re interested in.

(To me, this sounds like an obvious use-case for abstraction: throw out info which is unlikely to be relevant to future queries of interest.)

Another interesting class of queries is optimization queries, of the sort needed for decision theories or game-theoretic agents. One way to think about the decision theoretic problems of embedded agency is that we want to figure out what the “right” class of queries is, in order to both (a) guarantee that the optimization query is actually solved solved correctly and quickly, and (b) get good performance in a wide variety of environments. (Of course, this isn’t the only way to think about it.)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

A Self-Embedded Probabilistic Model

16

The Model

Diagonalizing

What Queries Do We Care About?