A Mike's-Eye View of ARC's Research

Mikewins

Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like it, has begotten a host of concrete technical problems; progress on those problems has given us more philosophical clarity on the big picture, which has led to even more technical progress. The two most recent public discussions of ARC's research (Jacob's A Bird's Eye View of ARC's Research and David's Obstacles in ARC's research agenda) both came out before this flywheel really got spinning, and a lot of what we now consider central to the agenda isn't reflected in either of them. The goal of this post is to give a clear, updated picture of what we're actually trying to do. This is written from my point of view; I don't speak for my whole organization.

Here is ARC's hoped-for pipeline for aligning a powerful AI: monitor training to detect structure as it is added to the model; convert that structure into advice that improves an MSP-style mechanistic estimator of the model's behavior; use the resulting estimator, together with a description of the relevant input distribution, to estimate a safety-relevant quantity such as the probability of catastrophic failure;^[1] then optimize the model against that estimate. The key advantage over black-box evaluation is that we are not waiting for catastrophic behavior to appear often enough in samples, or even to have so much as a single sample on which the model behaves catastrophically. We are trying to infer, from facts about the learned algorithm itself, how often rare but unacceptable behaviors are likely to occur.

To make this pipeline a reality, we need roughly the following ingredients:

Wide-ranging mechanistic estimators, in the spirit of the Matching Sampling Principle. These take a description of a computation — e.g. the weights of a neural network — and estimate some property of its behavior (e.g. expected loss on a distribution) without relying on input-output samples.
Tools for identifying structure as it is added to the weights and converting it into advice that improves the MSP estimators.
A way to deal with real-world distributions (e.g. the distribution of inputs seen by ChatGPT rather than uniformly random bits), often defined only implicitly through a large number of points.
Something to align to. We need some notion of what behavior we want to reward. The "type signature" here is a mathematically well-defined function (even a slow and impractical one) that takes in model outputs (or perhaps states of the world) and assigns them goodness scores.
(Optional) Mechanistic Anomaly Detection. A tool for determining whether model outputs look good "for the right reasons."

If these ingredients work as hoped, the resulting technology would in principle let us describe the algorithms inside a model as it is trained, flag deceptive alignment and reward hacking, and train against those flags to produce an aligned system while paying a manageable alignment tax. The plan is to treat "how often will the model cause catastrophe" as an estimation problem, build an adversarially robust estimator, and train the model until the estimate of its catastrophic behavior is acceptably small.

Matching Sampling Principle

The (average case) MSP states that for any architecture and degree of precision, there is a mechanistic estimator that at least matches the performance of sampling over random instances of that architecture. A lot of what we do is look at various quantities and architectures and think "what is the right way to estimate this," and chug along until we have something that (often) far out-performs sampling. At the time of writing, one of the crown jewels of this approach is an algorithm that takes in the weights of a multilayer perceptron , and outputs an estimate which approximates to within additive error averaged over assignments of , while running more efficiently than a Monte Carlo estimator.^[2]

This type of research is extremely parallelizable; it's also great for parcelling out to academics in different communities who each have their own function classes they understand well. We can ask one academic to extend our MLP work to transformers, another to think about Turing Machines, another to think about some weird thing that shows up in condensed matter physics. At this stage in ARC's development, we still learn new and important lessons every time we make an MSP estimator for a new architecture.

We use the word "mechanistic", even though we don't have a clear definition of it. I'm going to be up-front that after reading this writeup, you will not have a complete sense of this concept, but I hope you'll have some idea of what it's pointing at.

The definition ARC gives is usually something like "never assume the input-output behavior you haven't seen looks like the behavior you have seen, for any object." For instance:

Never assume that because a model's loss was low on 100 random inputs, its average loss is low.
Never assume that because the activation of neuron A in layer 6 is correlated with neuron B in layer 8 on 100 samples of the input distribution, they are correlated.
Never assume that because the activation of neuron A in layer 6 is correlated with neuron B in layer 8 on 100 samples drawn from a mechanistically calculated representation of the activations on layer 4, they are correlated.
Never assume that if you construct a ridiculously complicated object mechanistically from the model, and give that object 100 samples, the sample average equals that ridiculously complicated object's average behavior.

Hopefully, this definition gives some sense of why mechanisticness might be useful for dealing with deceptive alignment. In some sense, not only are we never trusting the input-output behavior of the full model, we are never trusting the input-output behavior of components of the model, or ridiculously complicated objects mechanistically derived from the model.

Here's another property I associate with mechanisticness, which I suspect is much weaker. To me, it's all about taking a big, terrifying object and breaking it down into pieces that can't conspire against you. One (extremely bad) approach you might take is to say "well a single neuron can't be deceptively aligned, so I'll analyze the model one neuron at a time, and then there won't be deceptive alignment." This approach would fail, of course, because deceptive alignment, like all cognition, doesn't live in any single component — it lives in how the components interact. So decomposition into simple pieces isn't enough on its own; we need a decomposition where the pieces also can't conspire. Each piece has to be both individually benign and unable to coordinate with others toward a bad outcome. A way to operationalize this is to require that each of the things you look at is quite simple, and also require that they are independent (for instance, an individual neuron's behavior given the preceding activations is simple, but is correlated both with its behavior on other inputs and with the behavior of other neurons). Many of our mechanistic algorithms involve expressing our final estimate as a combination of terms which are subjectively^[3] uncorrelated with each other.

This is something we can actually give concrete examples of. Suppose we want to estimate what fraction of numbers below have an even number of prime factors. Concretely, we want to estimate

where the Liouville function is if has an even number of prime factors and if has an odd number of prime factors. One strategy here would be to sample inputs at random between and , calculate for each of them, and return . This technique assumes the inputs we haven't seen look like the ones we have seen. It is not mechanistic.

A different strategy is to presume that is about half the time, and that the values for distinct values of are subjectively uncorrelated.^[4] Since is half the time and half the time, we can start with an estimate of zero. There are a number of ways one could proceed here; one way is to select a subset of values of (perhaps the values from 1 to , or perhaps random numbers) and evaluate on those inputs. Since the value of at different points is subjectively uncorrelated, knowing that the value of is doesn't change our guess for other values of , it only moves our estimate for by . A valid mechanistic estimator of after we've evaluated at points is . Note that this only differs from the sampling estimator by a factor out front — instead of .

I want to make several points here.

This sort of estimator could have terrible performance. We made two heuristic assumptions^[5] when we derived this, a mean-zero assumption and an independence assumption for . A 19th century mathematician wouldn't have been able to prove either of those facts, and even today we only know a proof for the first one. ARC doesn't necessarily claim that future mathematicians will eventually prove the independence claim. But we do argue that if the independence claim is false, there is a heuristic argument one could give as advice that could be incorporated into the estimator and make it work. For more discussion of this, see the next section.
This sort of estimator can be randomized. I didn't specify how we chose the subset of points to analyze in our estimator. They could absolutely have been chosen randomly. Using randomness to determine what aspects of a system to analyze more carefully doesn't violate mechanisticness. The only violation would be to assume the things we haven't carefully examined yet look like the things we have.

Although we've only had the Matching Sampling Principle for about a year, the idea is descended from much older concepts like heuristic estimation.

Identifying Structure / Plugging Structure into MSP

This is among the oldest parts of ARC's agenda: the idea that anytime you have a strange behavior you wouldn't predict heuristically (e.g. a finite number of twin primes, or a random reversible circuit evaluating to the identity, or a neural net being many standard deviations from our MSP estimate that works for random NNs^[6]), there must be some structural reason for it. This has been referred to as the No-Coincidence Principle.

We certainly don't have a periodic table of every type of structure, a good way of encoding it, or a complete understanding of what to do when structure is pointed out. Our current best guess is that the best way to communicate structure is with certain types of compression, and ideas inspired by Kolmogorov Complexity and Sophistication.^[7]

It's a simple exercise to show that any time a model has any strange property, this is reflected in a lower Kolmogorov complexity. But a stronger pattern we've noticed is that when mechanistic estimators fail, some sort of resource-bounded complexity or sophistication should come into play.

A nice example to think about is AlphaZero, a powerful neural net with no training set. In principle, a short computer program and a random seed are enough to tell you everything you need to know about AlphaZero (just encode the rules of some board games and the algorithm for reinforcement learning). Perhaps that short program, plus a longer computer program that takes less time to run, and a still longer program that takes even less time to run, et cetera, are the advice we need to point out the structure of AlphaZero. This would suggest that advice should take the form of a sequence of computation-bounded sophistication constructions.

A different way of phrasing this strategy is an attempt to transform worst-case MSP instances into average-case ones. The whole notion of sophistication is that a very unusual string/set of weights can be thought of a typical string/set of weights drawn from some other distribution.

As you can tell, this prong of ARC's agenda has a lot of half-baked thoughts. I blame this on the fact that we don't have a single non-trivial problem of this type we fully understand. Good toy examples on simple models of computation will be a huge win here, and continued failure will be a huge red flag.

There is one other point I want to hammer home: the similarity between ARC's plan to watch the neural network's training process, and ARC's plan to create a short description of a process that would give rise to this neural network. Let me spell that out: most cryptographers believe in the existence of indistinguishability obfuscation (iO) — that it is possible to obfuscate any algorithm, including the algorithm embodied in a neural network, such that no polynomial-time observer can learn anything about the original algorithm beyond its input-output behavior by looking at the finished code. If a model's weights could be adversarially obfuscated in this way, standard mechanistic analysis would be hopeless in the worst case. ARC's plan for getting around this is to watch the training process, making note of new structures as they are being etched into the weights. We believe that training-process monitoring is necessary to create adversarially robust probes.^[8]

There's a connection here that we don't fully understand: on one side it may be that watching the creation of a neural network is necessary to have robust descriptions of their structure (and thus robust estimates of their behavior). On the other side, those same robust descriptions might take the form of (lists of) resource-bounded programs to create our very same neural network.

Dealing with Real Data

The MSP is usually written in terms of inputs coming from a well-defined and easy-to-sample distribution. But the distribution we care most about looks more like "plausible inputs for GPT-6." During training, we are given some number of examples from an unknown distribution, and we need to estimate the average over this distribution using less compute and more sample efficiency than just doing forward passes. For this to make sense, we probably need at least some prior over the parameters of the input distribution. But once we have that (e.g., the input distribution is a Gaussian with unknown mean selected uniformly over the reals and covariance selected according to such and such power law, or the input distribution is the result of pushing a Gaussian through a generative diffusion model of such and such a depth with parameters selected i.i.d. Gaussian) it seems we can often beat sampling at estimating given samples from .

Mechanisticness is (even more) confusing in this context. There is no way to mechanistically determine how often people ask for homework help versus relationship advice; you need to measure it empirically. How does this square with our goal of avoiding sampling? The rough answer is that the parameters of the model have been heavily optimized by gradient descent to look good to us, and quite possibly to fool us. The parameters of nature haven't.^[9] The model needs to be carefully picked apart for signs of danger, whereas the data-generating process can be safely understood using only black-box methods. So we are fine learning from samples of the data generating process, as long as our understanding of the model is mechanistic.

The simplest version of this is straightforward. Suppose we want to calculate given samples of . Suppose that our prior over the distribution of tells that it is drawn from a 1-D Gaussian distribution with known variance but unknown mean. We can use the samples to infer the mean, and then mechanistically compute the expectation of over the inferred Gaussian. The expected squared error of this approach scales as . However, this more mechanistic process achieves a better constant in the scaling of the MSE than simply propagating all samples through and averaging — nothing deep is happening, the data is just being used to estimate the distribution parameters, and the rest reduces to a standard MSP problem. If the MSP is true, then in cases like this (and also ones far more complicated than this), we can beat sampling in terms of both compute- and sample-efficiency.

ARC has spent very little time thinking concretely about these problems, but in the past few months it's gone from a problem we don't know how to approach to a series of bounded technical questions that we just haven't gotten around to. (Unfortunately, at the moment it requires too much background to be "parcelled out" to others. I don't think this is inherent, and think with some work we could carve off some nice modular chunks of this problem for our friends in academia.)

Aligned to What?

This is something we've spent very little time thinking about recently. People have been grappling for years (millennia?) with the fact that no specific, concise set of mathematically defined rules does a good job capturing morality. The current best plans involve a mix of deferring to your future self (corrigibility) and deferring to the model's guess about some idealized future self (indirect normativity).

The version of indirect normativity I find most plausible isn't the science-fiction "committee of the thousand greatest minds" picture. It's more local: we want an AI that helps the user stay safe and learn about the real world based on their current preferences, and defers to the user's future self for hard questions where the current self doesn't yet know the answer. A lot of the technical complexity here comes from the need for decoupled feedback — we want the AI to take actions that genuinely make the world better according to the user's eventual judgment, rather than actions that influence the user into judging the world favorably. Distinguishing "make the world good" from "make the user think the world is good" is a real technical problem, not a philosophical aside, and it's the kind of thing ARC's tools would actually need to engage with.^[10]

I suspect that if ARC's tools work at all, they can train a model to do either of these, but we certainly haven't put in the legwork to figure out ways of doing that, and won't be able to test those techniques for years. I consider this to be the most neglected part of ARC's agenda. Unfortunately, thinking about this in a productive way requires a strong understanding of how ARC's methods would work and at least a decent understanding of the alignment landscape, and those two things don't coexist in many people at the moment. I'm hoping I can beef myself up enough to start seriously tackling it soon.

Mechanistic Anomaly Detection

The hope here is that if you understand why a model is getting a low loss, you can understand when it's getting a low loss "for the wrong reasons" on a particular input. In other words, we could detect anomalies where a low loss was produced by the wrong mechanism. Although that hope seems sensible enough phrased in English, nobody has managed to turn it into a mathematically well-defined conjecture, let alone provided serious evidence for it.

If we got MAD working, it would be a cure for deceptive alignment and also for reward hacking.^[11] However, I can't even begin to articulate how MAD would work, and I don't get the sense that anyone else at ARC can spell it out either. While many of my colleagues are still optimistic that such a thing is possible, it is mostly on the back burner while we work on understanding why models get low loss.

Including direct risks like coups, or indirect risks such as sabotaging the training of a newer model, which may then perform a coup. ↩︎
The general strategy is something we call cumulant propagation. We know the distribution of the inputs, we use them to figure out the cumulants of the next layer, and then the layer after that, and so on through the model. For most values of , including more and more cumulants will give a more and more accurate estimate in a way that beats Monte Carlo estimation. ↩︎
What do we mean by "subjective" correlation? Informally, subjective probability is the best guess that a smart but still computationally bounded gambler would put on a certain empirical or logical fact. For instance, an observer might say there is a 52 percent chance the blues will win the next election, a 50 percent chance that the trillionth digit of is even, and zero correlation between the two. By contrast, there might be a positive subjective correlation between and in computer science (though reasonable people can disagree about what the subjective correlation is). ↩︎
The first of these is provably true using somewhat advanced techniques. The second is unproven, but would follow from the Riemann Hypothesis. ↩︎
In this writeup, I use 'heuristic' to mean 'a quantitative approximation made without necessarily having a rigorous justification for it'. The paradigmatic example of heuristic approximation is the presumption of independence. Other examples include things like 'treat any distribution we see as Gaussian' or 'treat every function as a low-degree polynomial'. We typically arrive at mechanistic estimations by chaining together many heuristic assumptions. ↩︎
Each of these is something which a quick heuristic derivation would say is unlikely. For the twin prime conjecture, the simplest heuristic argument comes from the Prime number theorem and the presumption of independence. For the random reversible circuit, we can use the conjecture that even fairly small such circuits are like random permutations of , meaning that (heuristically), there is only a chance that a given circuit would be the identity. For the MSP estimate, we are constructing it so it should be right on most neural networks. Notably, each of these heuristic arguments is extremely flawed, and could turn out to be wrong. ARC doesn't claim that every heuristic argument is correct all the time, we are saying that when they fail, there is a more sophisticated argument that would let us understand the failure. ↩︎
The Kolmogorov complexity of a string is the length of the shortest program that outputs on a fixed universal Turing machine. For instance, a string of all zeroes has very low Kolmogorov Complexity, whereas a random string will have Kolmogorov complexity equal to its length. K-complexity captures incompressibility: a string is "random" if . But Kolmogorov complexity alone does not capture "structure". A uniformly random string has near-maximal , yet we can describe everything important about it in one sentence: "it's random." This motivated the definition of sophistication. Informally, the sophistication of is the length of the shortest program that specifies a set containing , where is small enough that knowing it captures most of the structural information about , and the remaining bits needed to pin down within look random. (For a thorough treatment, see Antunes and Fortnow; for an accessible discussion connecting sophistication to thermodynamics and dynamical systems, see Aaronson's blog posts on complextropy and Kolmogorov complexity.) Both and sophistication are uncomputable as stated, since they quantify over all programs with no runtime bound. They also don't always correspond to our intuitions about how complicated/structured a system is. Resource-bounded variants restrict to programs halting within some time bound . The -bounded Kolmogorov complexity is the length of the shortest program outputting in at most steps; one defines -bounded sophistication analogously. These variants are computable and connect to pseudorandomness: a string has low bounded complexity iff it can be efficiently generated, and low bounded sophistication iff its structure can be efficiently described. Aaronson's complextropy is essentially a resource-bounded sophistication measure. For a recent formal development of computation-bounded sophistication, see Finzi, Qiu, Jiang, Izmailov, Kolter, and Wilson. ↩︎
The rough idea is that any time the model learns a new fact, that fact is also incorporated into the advice string of our estimator. So if the model's evil plan involves some scientific fact that humans don't know but the model does, the estimate of catastrophe will catch it. This is true both for scientific facts (e.g. that these chemical compounds will make refrigerators work better but ruin the ozone layer) and for mathematical rules of thumb (e.g. when you're playing Go and the board looks like this, you should focus on moves that look like that). Arguments from cryptography suggest it would be completely infeasible to extract all these facts from the finished model, but it seems likely you can extract them all from the training process to create an "epistemically competitive" overseer. Some people ask whether we are essentially training a second model which we know to be inherently honest. I think this is very loosely correct (though training a transformer/neural network/some architecture that was chosen for performance with no particular thought towards anything like interpretability/alignment is a necessary step in the process, and for economic reasons what we will probably actually do is use the mechanistic model to align the neural network, and then deploy the neural network). ↩︎
Modulo fears of data poisoning. ↩︎
One reason MAD would be valuable, if we got it working, is exactly that it might let us distinguish "the universe looks good because it is good" from "the universe looks good because the cameras were hacked" — i.e., it pushes against the failure mode where the AI influences the measurement rather than the underlying reality. ↩︎
It might also help with finding a good thing to align to. We want the model to produce universes that look good because they are good, not because the cameras were hacked to show lots of smiling people. ↩︎

26