ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life.
We’ll use the game of life as our example instead of real physics because it’s much simpler, but everything in the post would apply just as well to identifying “strawberry” within a model of quantum field theory. More importantly, we’re talking about identifying latent structures in physics because it’s very conceptually straightforward, but I think the same ideas apply to identifying latent structure within messier AI systems.
The game of life is a cellular automaton where an infinite grid of cells evolves over time according to simple rules. If you haven’t encountered it before, you can see the rules at wikipedia and learn more at conwaylife.com.
A glider is a particular pattern of cells. If this pattern occurs in empty space and we simulate 4 steps of the rules, we end up with the same pattern shifted one square to the right and one square down.
Let’s imagine some scientists observing the game of life via a finite set of sensors. Each sensor is located at a cell, and at each timestep the sensor reports whether its cell is empty (“dead”) or full (“alive”). For simplicity, we’ll imagine just two sensors A and B which lie on a diagonal 25 cells apart. So in any episode our scientists will observe two strings of bits, one from each sensor.
(To be more realistic we could consider physically-implemented sensors, e.g. patterns of cells in the game of life which measure what is happening by interacting with the grid and then recording the information in a computer also built inside the game of life. But that adds a huge amount of complexity without changing any of our analysis, so for now we’ll just talk about these supernatural sensors.)
These scientists don’t know how the game of life works. All they see are the reports of the sensors. They observe that sensor B often fires 100 timesteps after sensor A, which they call an “A-B pattern.” Perhaps sensor A fires on 0.1% of timesteps, and sensor B fires on 0.1% of timesteps, but A-B patterns occur on 0.01% of timesteps instead of the 0.0001% you would have expected if the events were independent.
Scientists hypothesize that A-B patterns occur due to an object traveling between sensor A and sensor B. They call this hypothetical object a “glider,” but they have no idea that a glider consists of a particular pattern of 5 live cells.
To help understand the world, the scientists collect a large training set of sensor readings and train a generative model which can predict those readings, i.e. a function that maps a uniformly random string of bits ε to a sequence of observations for sensor A and sensor B. This generative model is trained to assign a high probability to the sequences in the training set.
The generative model makes good predictions, and in particular it reproduces the surprising frequency of A-B patterns. As a result, the scientists expect that “gliders” must correspond to some latent structure in the model.
Let’s suppose that the generative model uses the random seed ε to fill in a large grid, with 10% of cells alive and 90% dead, simulates the grid for 1000 timesteps, and then outputs the 1000 bits observed by each of the sensors.
To find gliders, we’ll search for a “probabilistic heuristic argument” based on the presumption of independence that A-B patterns are common. We expect that argument to somehow reflect the fact that gliders travel from sensor A to sensor B. We write about formalizing probabilistic heuristic arguments here, but in the context of this post we will make use of only extremely simple arguments.
The most naive heuristic argument is to treat every cell as independent and ask: at each timestep, how likely is it that a cell is alive?
This approximation greatly overestimates the decay rate, because after the first timestep the status of adjacent cells are extremely highly correlated. It also underestimates the limiting density of living cells, because once a group of cells is stable they are likely to remain stable indefinitely (this is called “ash”). A more sophisticated heuristic argument could take these spatial and temporal correlations into account, but these errors aren’t important for our purposes and we’ll keep working with this extremely simple argument.
This argument predicts that sensor A and sensor B are completely independent, and so the rate of A-B patterns should be the product of (A frequency) and (B frequency). So we haven’t yet explained the surprisingly high rate of A-B patterns.
One way we can try to improve the argument to explain A-B patterns is by explicitly describing a series of events that can give rise to an A-B pattern:
So overall we conclude that A-B patterns should occur at a rate of about 0.004% per timestep. This is a massive increase compared to the naive heuristic argument. It’s still not very good, and it would be easy to improve the estimate with a bit of work, but for the purpose of this post we’ll stick with this incredibly simple argument.
Suppose that our scientists are interested in gliders and that they have found this explanation for A-B patterns. They want to use it to define a “glider-detector,” so that they can distinguish A-B patterns that are caused by gliders from A-B patterns caused by coincidence (or different mechanisms).
(Why do they care so much about gliders? I don’t know, it’s an illustrative thought experiment. In reality we’d be applying these ideas to identifying and recognizing safe and happy humans, and distinguishing observations caused by actual safe humans from observations caused by sensor tampering or the AI lying.)
This explanation is simple enough that the scientists could look at it, understand what’s going on, and figure out that a “glider” is a particular pattern of 5 cells. But a lot of work is being done by that slippery word “understand,” and it’s not clear if this approach will generalize to complicated ML systems with trillions of parameters. We’d like a fully-precise and fully-automated way to use this explanation to detect gliders in a new example.
Our explanation pointed to particular parts of the model and said “often X and Y will happen, leading to an A-B pattern” where X = “the 5 cells of a glider will appear” and Y = “nothing will get in the way.”
To test whether this explanation captures a given occurrence of an A-B pattern, we just need to check if X and Y actually happened. If so, we say the pattern is due to a glider. If not, we say it was a coincidence (or something else).
More generally, given an argument for why a behavior often occurs, and a particular example where the behavior occurs, we need to be able to ask “how much is this instance of the behavior captured by that argument?” It’s not obvious if this is possible for general heuristic arguments, and it’s certainly more complex than in this simple special case. We are tentatively optimistic, and we at least think that it can be done for cumulant propagation in particular (the heuristic argument scheme defined in D here). But this may end up being a major additional desideratum for heuristic arguments.
We gave a simple heuristic argument that A-B patterns occur at a rate of 0.004%. But a realistic heuristic argument might suggest many different reasons that an A-B pattern can occur. For example, a more sophisticated argument might identify three possibilities:
Now if the scientists look at a concrete example of an A-B pattern and ask if this explanation captures the example, they will get a huge number of false positives. How do they pick out the actual gliders from the other terms?
One simple thing they can do is be more specific about what observations the glider creates. Gliders cause most A-B patterns, but they cause an even larger fraction of the A-B correlation. But “A and B both fire by coincidence” doesn’t contribute to that correlation at all. Beyond that, the scientists probably had other observations that led them to hypothesize an object traveling from A to B — for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B — and they can search for explanations of these additional observations.
However, even after trying to point to gliders in particular as specifically as they can, the scientists probably can’t rule everything else out. If nothing else, it’s possible to create a machine that is trying to convince the scientists that a glider is present. Such machines are possible in the game of life (at least if we use embedded sensors), and they do explain some (vanishingly tiny!) part of the correlation between sensor A and sensor B.
So let’s assume that after specifying gliders as precisely as we can, we still have multiple explanations: perhaps gliders explain 99.99% of the A-B correlation, and acorns explain the remaining 0.01%. Of course these aren’t labeled conveniently as “gliders” and “acorns,” it’s just a big series of deductions about a generative model.
Our approach is for scientists to pick out gliders as the primary source of the A-B correlation on the training distribution. We’ll imagine they set some threshold like 99.9% and insist that gliders must explain at least 99.9% of the A-B correlation. There are two ways we can leverage this to get a glider detector rather than a glider-or-acorn detector:
Consider an example where a glider gun randomly forms and emits a sequence of gliders, each causing an A-B pattern. Intuitively we would say that each A-B pattern is caused by a glider. But the methodology we’ve described so far probably wouldn’t recognize this as an A-B pattern caused by a glider. Instead it would be characterized as an anomaly. That’s an appropriate judgment, but the scientists really wanted a glider detector, not an anomaly detector!
The problem is that our argument for A-B patterns specifies not only the fact that a glider passes from A to B, but also the fact that gliders are formed randomly at initialization. If a glider is formed in an unusual place or via an unusual mechanism, then the argument may end up not applying.
If the scientists had examples where A-B patterns were caused by glider guns, then they could include those and fix the problem by considering only arguments that capture those cases as well. But they may not have any way to get those labels, e.g. because glider guns are only ever produced in worlds involving complex AI systems who could also tamper with sensors directly in a way that fools the human labelers.
Without the ability to create glider guns on their own, how can the scientists point more specifically to the concept of “glider,” without inadvertently pointing to the entire causal process that produces A-B patterns including the events causally upstream of the glider?
One way to point to gliders is to use two sensors to triangulate: gliders are a common cause of sensor A firing and of sensor B firing, and scientists could try to pick out gliders as the “latest” common cause. This potentially gives us a way to point to a particular point on the causal chain to A-B patterns, rather than indiscriminately pointing to the whole thing.
In practice they would also have many other regularities and relationships they could use to pick out gliders. But this simplest example seems to capture the main idea. We’ll try to make this idea precise in the next section.
Rejecting unusual gliders seems like it might be a serious problem — having an anomaly detector could still be useful but would not constitute a solution to ELK. So a lot depends on whether there is some way to actually point to gliders rather than just flagging anything unusual as anomaly.
We might be able to implement this by searching over explanations that capture (i) the A-B correlation on the training set, (ii) the fact that sensor A fires on the new example, (iii) the fact that sensor B fires on the new example. We say that the A-B correlation is “potentially not due to a glider” if there is any explanation of this form which fails to capture the A-B correlation on the new example.
Our hope is that any argument which captures the A-B patterns on the training set needs to argue that “if a glider passes through A and encounters no debris, then you get an A-B pattern.” And any argument which predicts that sensor B fires on the new example needs to establish that “On the new input a glider passes through A and encounters no debris.” So an argument which does both things would necessarily need to capture the A-B pattern on the new example.
In contrast, if sensors A and B had fired for separate reasons (or because of some other mechanism like an acorn) then we could imagine an argument that separately argues that each of A and B fires without arguing that an A-B pattern occurs.
One obvious subtlety is that we are separately talking about predicting A, predicting B, and predicting the A-B correlation on a single example, and it’s not clear what it even means to talk about the A-B correlation on a single example. We think it is possible to make sense of this in the context of cumulant propagation, but we don’t know how generally that will hold.
Overall we’re not really able to evaluate this proposal concretely given the informal way we’ve talked about heuristic arguments so far, and we’re just not sure if it will work.
If this all works as we’re sketching, we can speculate about which A-B patterns would be described as “due to gliders” vs “not due to gliders:”
Suppose that for an arbitrary event X (like X=”cell 33004 is on, cell 40092 is off…”) we can divide the argument “sometimes gliders appear and travel from A to B” into two sub-arguments:
Such that the first sub-argument captures the A-B pattern on inputs where X is true and the second captures the A-B pattern inputs where X is false.
Then for any given new example, we could pick some property X which is true for the new example but rare on the training distribution. By selecting the sub-argument that assumes X is false, we can always find an argument explaining the training behavior but not the new example. As a result, we will always conclude that a new input is potentially anomalous.
It is possible to special case proofs in this way: I can prove (Y is common) by proving (X is common) and X → Y, even if the proof of X → Y completely ignores the assumption X.
We could try to rule this out by looking for the simplest argument, as mentioned above. That might be a workable approach, but it’s not the one ARC is currently exploring.
Instead we are hoping to rule this out structurally. There are two ways this could work, both of which seem plausible:
Eliezer sometimes talks about the problem of getting a very smart AI to put a strawberry on a plate. If successful, ARC’s approach to ELK should directly address that problem. In particular, it should be possible to evaluate whether a proposed action puts an actual real strawberry on a particular plate. And that solution should work even if the proposed actions are very complex, e.g. if an AI needs to carry out a complicated military campaign to defend a convoy delivering the last surviving strawberry. Moreover, it should work even if actions of similar complexity could replace the strawberry with a convincing fake (or tamper with sensors) in a way that humans would never notice.
(In Eliezer’s most recent discussions of this problem he specifies that the AI should actually make two strawberries that are “identical down to the cellular but not molecular level.” Defining “identical down to the cellular level” using our approach is more complicated and we think less interesting, though still possible; “put an actual real strawberry on a plate” is more important and more similar to the kinds of subproblems we need to solve.)
In our approach, we’d try to point to a strawberry as the cause of a certain strawberry-observation-correlation we’ve noticed: the shape, the red color, the green stem, the seeds, the flavor, and so on. So to evaluate whether an action is predicted to put a strawberry on a plate we’d evaluate whether it’s predicted to lead to all of those observations, and whether it does so based on the normal common cause.
For concreteness, let’s imagine an AI that has a probability distribution over initial states of affairs, together with a detailed physical model for how those states evolve over time. “Strawberry” exists as part of the prior over initial states — to describe the world reasonably you need to know that supermarkets contain objects with the whole correlated set of strawberry-properties. And this gives rise to a number of arguments that explain the strawberry-observation-correlation:
Now we can imagine the same approach, where we say that something “there is a strawberry on the plate” if we make the strawberry-observation and any argument that explains 99.9999% of the strawberry-observation-correlation also captures the strawberry-observation in this case. What would this approach classify as a strawberry vs not a strawberry?
It’s interesting to me that an atomic replica of a strawberry would clearly not be considered a strawberry. Initially I thought this seemed like a bug, but now I’m pretty convinced it’s exactly the right behavior. Similarly, if I ask my AI to move me from point A to point B, it will not consider it acceptable to kill me and instantly replace me with a perfect copy (even if from its enlightened perspective the atoms I’m made of are constantly changing and have no fixed identity anyway).
In general I’ve adopted a pretty different perspective on which abstractions we want to point to within our AI, and I no longer think of “a particular configuration of atoms that behaves like a strawberry” as a plausible candidate. Instead we want to find the thing inside the model that actually gives rise to the strawberry-correlations, whether that’s ontologically fundamental strawberries in the prior over initial states, or the correlation between different strawberries’ properties that emerges from their shared evolutionary history. None of those are preserved by making a perfect atomic copy.
FWIW, I found the Strawberry Appendix especially helpful for understanding how this approach to ELK could solve (some form of) outer alignment.
Other readers, consider looking at the appendix even if you don't feel like you fully understand the main body of the post!
+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix - this seems like it would unnecessarily discourage people from reading it.
I'm convinced, I relabeled it.
“We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0.”
we the readers (or I guess specifically the heuristic argument itself) can do this, but the “scientists” cannot, because the
“scientists don’t know how the game of life works”.
Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?
Another thing confusing to me along these lines:
“for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B”
where do they (the scientists) notice these fewer live cells? Do they have some deep interpretability technique for examining the generative model and "seeing" its grid of cells?
The scientists don't start off knowing how the game of life works, but they do know how their model works.
The scientists don't need to follow along with the heuristic argument, or do any ad hoc work to "understand" that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it's important that their methods open up the model even if they never do.
Intuitively, the scientists are like us evaluating heuristic arguments about how activations evolve in a neural network without necessarily having any informal picture of how those activations correspond to the world.
This was confusing shorthand.
They notice that the A-B correlation is stronger when the A and B sensors are relatively quiet. If there are other sensors, they also notice that the A-B pattern is more common when those other sensors are quiet.
That is, I expect they learn a notion of "proximity" amongst their sensors, and an abstraction of "how active" a region is, in order to explain the fact that active areas tend to persist over time and space and to be accompanied by more 1s on sensors + more variability on sensors. Then they notice that A-B correlations are more common when the area around A and B is relatively inactive.
But they can't directly relate any of this to the actual presence of live cells. (Though they can ultimately use the same method described in this post to discover a heuristic argument explaining the same regularities they explain with their abstraction of "active," and as a result they can e.g. distinguish the case where the zone including A and B is active (and so both of them tend to exhibit more 1s and more irregularity) from the case where there is a coincidentally high degree of irregularity in those sensors or independent pockets of activity around each of A and B.