In "Formalizing the Presumption of Independence", we defined a heuristic estimator to be a hypothetical algorithm that estimates the values of mathematical expression based on arguments. That is, a heuristic estimator is an algorithm G that takes as input
A formally specified real-valued expression Y; and
A set of formal "arguments" π1,…,πm --
-- and outputs an estimate of the value of Y that incorporates the information provided by π1,…,πm. We denote this estimate by G(Y∣π1,…,πm).[1]
In that paper, we introduced the following question: is there a computationally efficient heuristic estimator that formalizes intuitively valid reasoning about the values of mathematical quantities based on arguments? We studied the question by introducing intuitively desirable coherence properties (one such property is linearity: a heuristic estimator's estimate of X+Y should equal its estimate of X plus its estimate of Y) and working to satisfy those properties. Ultimately, we left the question open.
The main technical contribution of our new work is to outline a new type of coherence property: a heuristic estimator should not be able to predict its own errors. We call this intuitive statement the principle of unpredictable errors. The principle is loosely inspired by the law of iterated expectations from probability theory, as well as the martingale property: a Bayesian reasoner's estimate of their future estimate of a quantity should be equal to its current estimate. One of the main purposes of this work is to explore ways to formalize this principle.
Our paper is structured as follows:
We begin by explaining the core motivating intuition behind heuristic estimators through three analogies: proof verification, conditional expectations, and subjective probabilities.
We explain why we believe in the principle of unpredictable errors. We then describe a natural attempt to formalize the principle: to simplify, we ask that G(Y−G(Y∣Π)∣Π)=0 for every expression Y and set of arguments Π={π1,…,πm}. (We call this property iterated estimation and also define a more complex property which we call error orthogonality.) We then discuss important drawbacks of this formalization, which stem from the nested G's in the definition of the properties.
Taking inspiration from these properties, we define a cluster of accuracy properties, which -- roughly speaking -- replace the outer G with an expected value over a distribution of expressions Y. The simplest of these properties states that EY∼D[Y−G(Y∣Π)∣Π]=0.
We examine the accuracy properties in the context of two estimation problems: (1) estimating the expected product of jointly normal random variables and (2) estimating the permanent of a matrix. In both cases, we encounter barriers to satisfying the accuracy properties, even when the set of heuristic arguments is small and simple. This leads us to reject accuracy as a formalization of the principle of unpredictable errors. We leave open the question of how to correctly formalize this principle.
We conclude with a discussion of our motivations for pursuing this line of research. While the problem of heuristic estimation is deeply interesting from a theoretical standpoint, we believe that it could have applications for understanding the behavior of neural networks. We discuss three potential applications of heuristic estimation to understanding neural network behavior: mechanistic anomaly detection,[2] safe distillation, and low probability estimation.
This blog post summarizes the paper, with proportionally more emphasis on the main ideas and less emphasis on the mathematical details.
What is a heuristic estimator?
In "Formalizing the Presumption of Independence", we described a heuristic estimator as an efficient program that forms a "subjective expected value" of a quantity based on arguments. We gave several examples of heuristic estimation, such as estimating the number of twin primes in a given range and estimating the probability that some 256-bit input to the SHA-256 circuit has an all-zeros output.
In our new paper, we expand on the intuition behind heuristic estimators through one example and three analogies.
Example: Sum of sixth digits of square roots
Let d6(√n) denote the sixth digit of √n past the decimal point (in base 10), and let Y:=∑120n=101d6(√n). What is your best guess for the value of Y?
Without actually calculating any square roots, your best bet is to estimate each of the twenty digits as 4.5 (the average of 0 through 9); this gives an estimate of 20⋅4.5=90. This is perhaps how we want our heuristic estimator to behave when given no arguments; in other words, we want our heuristic estimator G to satisfy G(Y∣∅)=90.[3]
Now, let πn be a computation of the sixth digit of √n. When given πn as an argument, G should update its estimate of Y accordingly. For example, the sixth digit of √101 happens to be 5. Correspondingly, we would like G(Y∣π101)=5+19⋅4.5=90.5. If G is additionally given π102, which shows that the sixth digit of √102 is 4, G should again update its estimate: G(Y∣π101,π102)=5+4+18⋅4.5=90 -- and so on. If G is given all of π101 through π120, then it should be able to compute the correct answer.
(The purpose of this example is to provide a very simple and intuitive picture of how we expect G to update based on arguments. In practice, we expect the arguments given to G to be much more complex.)
Analogy #1: Proof verification
A proof verifier is a program that takes as input a formal mathematical statement and a purported proof of the statement, and checks whether the proof is valid. This is very similar to a heuristic estimator, which is a program that takes as input a formal mathematical expression and some arguments about the expression, and outputs an estimate of the value of the expression in light of those arguments.
Just as a proof verifier does not attempt to generate its own proof of the statement -- it just checks whether the given proof is valid -- a heuristic estimator does not attempt to estimate the given quantity using its own arguments. Its only purpose is to incorporate the arguments that it is given into an estimate. Moreover, we can think of a heuristic estimator as a generalized proof verifier: we expect heuristic estimators to respect proofs, in the sense that if an argument π proves that ℓ≤Y≤h, then G(Y∣π) should lie between ℓ and h. (See Section 4 here.)
This table (adapted from chapter 9 here) illustrates the analogy between proof verifiers and heuristic estimators in more detail.
Heuristic estimation
Proof verification
Heuristic estimator
Proof verifier
Formal mathematical expression
Formal mathematical statement
List of heuristic arguments
Purported proof of statement
Formal language for heuristic arguments
Formal language for proofs
Desiderata for estimator
Soundness and completeness
Algorithm's estimate of expression
Verifier's output (accept or reject)
Analogy #2: Conditional expectation
In some ways, a heuristic estimator is analogous to a conditional expected value. For a random variable X and event A, E[X∣A] is the average value of X conditioned on A -- or put otherwise, it is the estimate of X given by an observer who knows that A occurred and nothing else. Similarly, G(Y∣Π) is the estimate of Y given by an observer who has not computed the exact value of Y and has instead only done the computations described in the arguments in Π. Although there is a particular correct value of Y, the observer does not know this value, and G(Y∣Π) is a subjective "best guess" about Y given only Π. Both quantities can thus be thought of as a best guess conditional on a state of knowledge.
Analogy #3: Subjective probabilities and estimates
Perhaps the best intuitive explanation of heuristic estimation is this: a heuristic estimator is a procedure that extracts a subjective expectation from a state of knowledge. Under this view, Π formally describes a set of facts known by an observer, and G(Y∣Π) is a subjective estimate of Y in light of those facts.
By "subjective expectation", we mean "expected value under the subjectivist view of probability". The subjectivist view of probability interprets probability as the subjective credence of an observer. For example, suppose that I have chosen a random number p uniformly from [0,1], minted a coin that comes up heads with probability p, and then flipped it. What is the probability that the coin came up heads?
To an observer who only knows my procedure (but doesn't know p), the subjective probability that the coin came up heads is 12. To an observer who knows p (but hasn't seen the outcome of the coin flip), the subjective probability that the coin came up heads is p. And to an observer who saw the outcome of the coin flip, the probability is either 0 (if the coin came up tails) or 1 (if it came up heads).
Much as observers can have subjective probabilities, they can also have subjective expectations. For example, a typical mathematician does not know the 6th digit past the decimal point of √101, but would subjectively assign a uniform probability to each of 0,…,9, which means that their subjective expectation for the digit is 4.5. Recalling our example from earlier, the mathematician's subjective expectation for Y:=d6(√101)+⋯+d6(√120) is 20⋅4.5=90. But if the mathematician were to learn that d6(√101)=5, they would update their subjective expectation to 90.5. This is exactly how we want our heuristic estimator G to operate.
How can heuristic estimation help us understand neural networks?
While heuristic estimation is a deep and interesting topic in its own right, our research is primarily motivated by potential applications to understanding neural network behavior.
Historically, researchers mostly understood neural network behavior through empirical observation: the model's input-output behavior on particular inputs. However, this approach has important drawbacks: for example, any understanding gained about a model's behavior on one input distribution may not carry over to a different input distribution.
More recently, there has been substantial work on neural network interpretability via understanding how models internally represent concepts. Interpretability research aims to address the barriers faced by methods that rely only on input-output behavior. However, current interpretability techniques tend to only work under strong assumptions about how neural networks represent information (such as the linear representation hypothesis). Also, for the most part, these techniques can only work insofar as neural representations of concepts are understandable to humans.
A different approach is formal verification: formally proving properties of neural networks such as accuracy or adversarial robustness. While formal verification does not rely on human understanding, we believe that formally proving tight bounds about interesting behaviors of large neural networks is out of reach.
By contrast, heuristic arguments about properties of neural networks may have important advantages of both formal verification and interpretability. On the one hand, heuristic arguments (like proofs) are formal objects that are not required to be human-understandable. This means that heuristic arguments could be used to reason about properties of neural networks for which no compact human-understandable explanation exists. On the other hand, heuristic arguments (like interpretability approaches) do not require perfect certainty to be considered valid. This allows for short heuristic arguments of complex properties of large models, even when no short proofs of those properties exist.[4] (See our earlier post on surprise accounting for further discussion.)
In the rest of this section, I will give three examples of problems that we believe cannot be solved in full generality with current approaches, but that they may be solvable with heuristic arguments. (All three examples will just be sketches of possible approaches, with many details left to be filled in.)
Mechanistic anomaly detection
Let M be a neural network that was trained on a distribution D of inputs x using the loss function L(x,M(x)).[5] Suppose that M successfully learns to achieve low loss: that is, Ex∼D[L(x,M(x))] is small. Let x∗ be a (perhaps out-of-distribution) input. We call x∗ a mechanistic anomaly for M if M gets a low loss on x∗, but for a "different reason" than the reason why it gets low average loss on D. In other words, mechanistic anomalies are inputs on which M acts in a seemingly reasonable way, but via anomalous internal mechanisms.[6] To detect a mechanistic anomaly, reasoning about M's internal structure may be necessary.
How could we use a heuristic estimator to detect mechanistic anomalies? Suppose that we find a set of arguments Π such that the following quantity is low:[7]
G(Ex∼D[L(x,M(x))]∣Π).
That is, Π explains why M attains low average loss on D.[8] Given an out-of-distribution input x∗ such that L(x∗,M(x∗)) is once again low, we consider the quantity G(L(x∗,M(x∗))∣Π). This represents a heuristic estimate of M's loss on x∗ based only on the reasons provided in Π: that is, based only on the facts necessary to explain M's low loss on D.
If G(L(x∗,M(x∗))∣Π) is (correctly) low, then the reasons why M performs well on D also explain why M performs well on x∗. By contrast, if G(L(x∗,M(x∗))∣Π) is (incorrectly) high, then M performs well on x∗ for a different reason that why M performs well on D. As a result, we flag x∗ as a mechanistic anomaly for M.
(See here for a more detailed discussion of mechanistic anomaly detection.)
Safe distillation
Let f ("fast") and s ("slow") be two neural networks that were trained on a distribution D of inputs to complete the same task. Thus, f and s behave similarly on D. Suppose that we trust s to be aligned (e.g. we trust s to generalize well off-distribution) and do not similarly trust f, but that s is much slower than f.[9] Given an out-of-distribution input x∗, we would like to estimate s(x∗) without running s. We could do this by running f on x∗ and hoping that f generalizes well to x∗. However, this approach is not very robust. Instead, we can attempt to use the internal activations of f to predict s(x∗).
Concretely, suppose for simplicity that f and s output vectors, and suppose that we find a set of arguments Π such that the following quantity is low:
G(Ex∼D[∥f(x)−s(x)∥2]∣Π).
That is, Π explains why f and s produce similar outputs on D. Given an out-of-distribution input x∗, we consider the quantity
G(s(x∗)∣Π,computational trace of f on x∗).
This represents a heuristic estimate of s(x∗) given the computations done by f and the argument Π for why f and s are similar on D. If the reason why f and s behave similarly on D also extends to x∗, then G will correctly estimate s(x∗) to be similar to f(x∗). On the other hand, if the reason why f and s behave similarly on D does not extend to x∗, then G's estimate of s(x∗) may be different from f(x
Last week, ARC released a paper called Towards a Law of Iterated Expectations for Heuristic Estimators, which follows up on previous work on formalizing the presumption of independence. Most of the work described here was done in 2023.
A brief table of contents for this post:
In "Formalizing the Presumption of Independence", we defined a heuristic estimator to be a hypothetical algorithm that estimates the values of mathematical expression based on arguments. That is, a heuristic estimator is an algorithm G that takes as input
-- and outputs an estimate of the value of Y that incorporates the information provided by π1,…,πm. We denote this estimate by G(Y∣π1,…,πm).[1]
In that paper, we introduced the following question: is there a computationally efficient heuristic estimator that formalizes intuitively valid reasoning about the values of mathematical quantities based on arguments? We studied the question by introducing intuitively desirable coherence properties (one such property is linearity: a heuristic estimator's estimate of X+Y should equal its estimate of X plus its estimate of Y) and working to satisfy those properties. Ultimately, we left the question open.
The main technical contribution of our new work is to outline a new type of coherence property: a heuristic estimator should not be able to predict its own errors. We call this intuitive statement the principle of unpredictable errors. The principle is loosely inspired by the law of iterated expectations from probability theory, as well as the martingale property: a Bayesian reasoner's estimate of their future estimate of a quantity should be equal to its current estimate. One of the main purposes of this work is to explore ways to formalize this principle.
Our paper is structured as follows:
This blog post summarizes the paper, with proportionally more emphasis on the main ideas and less emphasis on the mathematical details.
What is a heuristic estimator?
In "Formalizing the Presumption of Independence", we described a heuristic estimator as an efficient program that forms a "subjective expected value" of a quantity based on arguments. We gave several examples of heuristic estimation, such as estimating the number of twin primes in a given range and estimating the probability that some 256-bit input to the SHA-256 circuit has an all-zeros output.
In our new paper, we expand on the intuition behind heuristic estimators through one example and three analogies.
Example: Sum of sixth digits of square roots
Let d6(√n) denote the sixth digit of √n past the decimal point (in base 10), and let Y:=∑120n=101d6(√n). What is your best guess for the value of Y?
Without actually calculating any square roots, your best bet is to estimate each of the twenty digits as 4.5 (the average of 0 through 9); this gives an estimate of 20⋅4.5=90. This is perhaps how we want our heuristic estimator to behave when given no arguments; in other words, we want our heuristic estimator G to satisfy G(Y∣∅)=90.[3]
Now, let πn be a computation of the sixth digit of √n. When given πn as an argument, G should update its estimate of Y accordingly. For example, the sixth digit of √101 happens to be 5. Correspondingly, we would like G(Y∣π101)=5+19⋅4.5=90.5. If G is additionally given π102, which shows that the sixth digit of √102 is 4, G should again update its estimate: G(Y∣π101,π102)=5+4+18⋅4.5=90 -- and so on. If G is given all of π101 through π120, then it should be able to compute the correct answer.
(The purpose of this example is to provide a very simple and intuitive picture of how we expect G to update based on arguments. In practice, we expect the arguments given to G to be much more complex.)
Analogy #1: Proof verification
A proof verifier is a program that takes as input a formal mathematical statement and a purported proof of the statement, and checks whether the proof is valid. This is very similar to a heuristic estimator, which is a program that takes as input a formal mathematical expression and some arguments about the expression, and outputs an estimate of the value of the expression in light of those arguments.
Just as a proof verifier does not attempt to generate its own proof of the statement -- it just checks whether the given proof is valid -- a heuristic estimator does not attempt to estimate the given quantity using its own arguments. Its only purpose is to incorporate the arguments that it is given into an estimate. Moreover, we can think of a heuristic estimator as a generalized proof verifier: we expect heuristic estimators to respect proofs, in the sense that if an argument π proves that ℓ≤Y≤h, then G(Y∣π) should lie between ℓ and h. (See Section 4 here.)
This table (adapted from chapter 9 here) illustrates the analogy between proof verifiers and heuristic estimators in more detail.
Analogy #2: Conditional expectation
In some ways, a heuristic estimator is analogous to a conditional expected value. For a random variable X and event A, E[X∣A] is the average value of X conditioned on A -- or put otherwise, it is the estimate of X given by an observer who knows that A occurred and nothing else. Similarly, G(Y∣Π) is the estimate of Y given by an observer who has not computed the exact value of Y and has instead only done the computations described in the arguments in Π. Although there is a particular correct value of Y, the observer does not know this value, and G(Y∣Π) is a subjective "best guess" about Y given only Π. Both quantities can thus be thought of as a best guess conditional on a state of knowledge.
Analogy #3: Subjective probabilities and estimates
Perhaps the best intuitive explanation of heuristic estimation is this: a heuristic estimator is a procedure that extracts a subjective expectation from a state of knowledge. Under this view, Π formally describes a set of facts known by an observer, and G(Y∣Π) is a subjective estimate of Y in light of those facts.
By "subjective expectation", we mean "expected value under the subjectivist view of probability". The subjectivist view of probability interprets probability as the subjective credence of an observer. For example, suppose that I have chosen a random number p uniformly from [0,1], minted a coin that comes up heads with probability p, and then flipped it. What is the probability that the coin came up heads?
To an observer who only knows my procedure (but doesn't know p), the subjective probability that the coin came up heads is 12. To an observer who knows p (but hasn't seen the outcome of the coin flip), the subjective probability that the coin came up heads is p. And to an observer who saw the outcome of the coin flip, the probability is either 0 (if the coin came up tails) or 1 (if it came up heads).
Much as observers can have subjective probabilities, they can also have subjective expectations. For example, a typical mathematician does not know the 6th digit past the decimal point of √101, but would subjectively assign a uniform probability to each of 0,…,9, which means that their subjective expectation for the digit is 4.5. Recalling our example from earlier, the mathematician's subjective expectation for Y:=d6(√101)+⋯+d6(√120) is 20⋅4.5=90. But if the mathematician were to learn that d6(√101)=5, they would update their subjective expectation to 90.5. This is exactly how we want our heuristic estimator G to operate.
How can heuristic estimation help us understand neural networks?
While heuristic estimation is a deep and interesting topic in its own right, our research is primarily motivated by potential applications to understanding neural network behavior.
Historically, researchers mostly understood neural network behavior through empirical observation: the model's input-output behavior on particular inputs. However, this approach has important drawbacks: for example, any understanding gained about a model's behavior on one input distribution may not carry over to a different input distribution.
More recently, there has been substantial work on neural network interpretability via understanding how models internally represent concepts. Interpretability research aims to address the barriers faced by methods that rely only on input-output behavior. However, current interpretability techniques tend to only work under strong assumptions about how neural networks represent information (such as the linear representation hypothesis). Also, for the most part, these techniques can only work insofar as neural representations of concepts are understandable to humans.
A different approach is formal verification: formally proving properties of neural networks such as accuracy or adversarial robustness. While formal verification does not rely on human understanding, we believe that formally proving tight bounds about interesting behaviors of large neural networks is out of reach.
By contrast, heuristic arguments about properties of neural networks may have important advantages of both formal verification and interpretability. On the one hand, heuristic arguments (like proofs) are formal objects that are not required to be human-understandable. This means that heuristic arguments could be used to reason about properties of neural networks for which no compact human-understandable explanation exists. On the other hand, heuristic arguments (like interpretability approaches) do not require perfect certainty to be considered valid. This allows for short heuristic arguments of complex properties of large models, even when no short proofs of those properties exist.[4] (See our earlier post on surprise accounting for further discussion.)
In the rest of this section, I will give three examples of problems that we believe cannot be solved in full generality with current approaches, but that they may be solvable with heuristic arguments. (All three examples will just be sketches of possible approaches, with many details left to be filled in.)
Mechanistic anomaly detection
Let M be a neural network that was trained on a distribution D of inputs x using the loss function L(x,M(x)).[5] Suppose that M successfully learns to achieve low loss: that is, Ex∼D[L(x,M(x))] is small. Let x∗ be a (perhaps out-of-distribution) input. We call x∗ a mechanistic anomaly for M if M gets a low loss on x∗, but for a "different reason" than the reason why it gets low average loss on D. In other words, mechanistic anomalies are inputs on which M acts in a seemingly reasonable way, but via anomalous internal mechanisms.[6] To detect a mechanistic anomaly, reasoning about M's internal structure may be necessary.
How could we use a heuristic estimator to detect mechanistic anomalies? Suppose that we find a set of arguments Π such that the following quantity is low:[7]
G(Ex∼D[L(x,M(x))]∣Π).That is, Π explains why M attains low average loss on D.[8] Given an out-of-distribution input x∗ such that L(x∗,M(x∗)) is once again low, we consider the quantity G(L(x∗,M(x∗))∣Π). This represents a heuristic estimate of M's loss on x∗ based only on the reasons provided in Π: that is, based only on the facts necessary to explain M's low loss on D.
If G(L(x∗,M(x∗))∣Π) is (correctly) low, then the reasons why M performs well on D also explain why M performs well on x∗. By contrast, if G(L(x∗,M(x∗))∣Π) is (incorrectly) high, then M performs well on x∗ for a different reason that why M performs well on D. As a result, we flag x∗ as a mechanistic anomaly for M.
(See here for a more detailed discussion of mechanistic anomaly detection.)
Safe distillation
Let f ("fast") and s ("slow") be two neural networks that were trained on a distribution D of inputs to complete the same task. Thus, f and s behave similarly on D. Suppose that we trust s to be aligned (e.g. we trust s to generalize well off-distribution) and do not similarly trust f, but that s is much slower than f.[9] Given an out-of-distribution input x∗, we would like to estimate s(x∗) without running s. We could do this by running f on x∗ and hoping that f generalizes well to x∗. However, this approach is not very robust. Instead, we can attempt to use the internal activations of f to predict s(x∗).
Concretely, suppose for simplicity that f and s output vectors, and suppose that we find a set of arguments Π such that the following quantity is low:
G(Ex∼D[∥f(x)−s(x)∥2]∣Π).That is, Π explains why f and s produce similar outputs on D. Given an out-of-distribution input x∗, we consider the quantity
G(s(x∗)∣Π,computational trace of f on x∗).This represents a heuristic estimate of s(x∗) given the computations done by f and the argument Π for why f and s are similar on D. If the reason why f and s behave similarly on D also extends to x∗, then G will correctly estimate s(x∗) to be similar to f(x∗). On the other hand, if the reason why f and s behave similarly on D does not extend to x∗, then G's estimate of s(x∗) may be different from f(x