[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC; Erik Jenner; Leon Lang

This is the appendix to Natural Abstractions: Key Claims, Theorems, and Critiques. It contains additional details that we expect are only relevant to some readers. We also have a pdf with more mathematical details, which contains the proofs of the Telephone and generalized KPD theorems, which is different content than what you find in this appendix.

Overview of John’s writing on natural abstractions

John has written many posts on natural abstractions over the years and it can be hard to get an overview and decide what to read. This section is our take on the most important posts and how they fit together.

A decent place to start reading is Public Static: What is Abstraction?, which doesn’t discuss John’s newer ways of thinking about abstractions but is deliberately written for an audience that hasn’t read any other abstraction posts. It covers the key ideas from the 2020 sequence on abstractions.

On the technical side, the following posts contain the key claims and results we also discussed:

The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints (the main result in the “information at a distance” line of work)
Abstractions as Redundant Information (which introduces a second way of thinking about abstractions other than as information at a distance)
The "Minimal Latents" Approach to Natural Abstractions (which discusses a third characterization of natural abstractions, in addition to information at a distance and redundant information)

John has also written about Generalizing Koopman-Pitman-Darmois (gKPD) with the goal of showing that natural abstractions are described by exponential family distributions. Maxent and Abstractions: Current Best Arguments summarizes the state of that line of work last year. The generalized KPD theorem is significantly more technical than the other posts mentioned here. Based on private communication with John, it is unclear how relevant gKPD is to natural abstractions, so readers may wish to deprioritize it.

For the high-level agenda, there have been four posts over the past ~2 years:

The first two are about a more specific project and tie together some of the technical posts above. The Plan posts focus more on John’s overall view of alignment but also touch on natural abstractions.

Finally, Alignment by Default discusses one of the ways in which natural abstractions could be important to alignment in detail.

As an alternative to reading posts, we recommend the AXRP episode with John Wentworth, which discusses natural abstractions in quite some detail.

Case study: the Telephone theorem

We’ve discussed several limitations of the current theory around natural abstractions, as well as ways in which we think the presentation of results could be improved in our Discussion section. We didn’t go into much detail there for the sake of brevity. In this appendix, we illustrate some of our points in more detail using the Telephone theorem as an example. Compared to our earlier discussion of the Telephone theorem, we'll focus less on formalization and more on how it should be interpreted.

Our main point here will be that we should carefully distinguish between claims that are mathematically implied by the theorem versus claims that are only "metaphorically true" or unproven conjectures.

What are the summaries/constraints?

First, let’s look at how John describes the Telephone theorem. The most precise formulation appears to be from this comment and says:

We have a sequence of Markov blankets forming a Markov chain . Then in the limit $n \to \infty$ , $f_{n} (M_{n})$ mediates the interaction between $M_{1}$ and $M_{n}$ (i.e. the distribution factors according to $M_{1} \to f_{n} (M_{n}) \to M_{n}$ ), for some $f_{n}$ satisfying
$f_{n} (M_{n}) = f_{n + 1} (M_{n + 1})$
with probability 1 in the limit.

As mentioned in the appendix of the Telephone theorem post, $f_{n} (M_{n})$ is simply the function $x \mapsto P (M_{1} = x | M_{n})$ , i.e. the conditional distribution of $M_{1}$ given $M_{n}$ . That immediately explains the factorization $M_{1} \to f_{n} (M_{n}) \to M_{n}$ : the information that $M_{n}$ has about $M_{1}$ is of course exactly $P (M_{1} | M_{n})$ . (We prove this fact as Lemma 3 in the accompanying pdf.)

So the (informal) core statement of the Telephone theorem simply becomes

Let $M_{1} \to M_{2} \to \dots$ be a Markov chain. Then $P (M_{1} | M_{n}) = P (M_{1} | M_{n + 1})$ with probability 1 in the limit $n \to \infty$ .

In our view, stating the theorem in terms of the $f_{n}$ without saying what they are risks obscuring certain points. For example, talking about $f_{n}$ instead of writing out $P (M_{1} | M_{n})$ makes it easier to think that $f_{n}$ is an interpretable concept, or that it is a low-dimensional summary of $M_{1}$ . This may well be true in practical cases but the theorem doesn't say anything about it, which is an important distinction.

Especially the claim of low dimensionality anecdotally seems to be a common interpretation. This includes John himself sometimes citing the Telephone theorem as an argument for low-dimensional summaries, e.g. here:

We have a large system with lots of parts “far away” from each other. The Telephone Theorem then says that they will indeed interact only via some relatively low-dimensional summary.

As far as we can tell, this is false: the Telephone theorem itself makes no statements about the dimension or "size" of the summary $P (M_{1} | M_{n})$ for large $n$ (or even in the infinite limit). Indeed, the Telephone theorem still applies in Markov chains in which all information is preserved at infinite distances. We would agree that the Telephone theorem is connected "in spirit" to claims about low-dimensional summaries at large distances, but we think such claims should be distinguished more carefully and ideally made more precisely.

In what sense are the constraints deterministic?

The Telephone theorem talks about "deterministic constraints" in the limit:

$f_{n} (M_{n}) = f_{n + 1} (M_{n + 1})$
with probability 1 in the limit $n \to \infty$ .

What does this mean more precisely? Arguably the most straightforward interpretation of these words would be

lim n \to \infty P (f_{n} (M_{n}) = f_{n + 1} (M_{n + 1})) = 1 .

In other words, the probability that the deterministic constraint holds approaches one—the constraint is violated increasingly rarely over time.^[1] But this is wrong! It may very well be that $P (f_{n} (M_{n}) = f_{n + 1} (M_{n + 1})) = 0$ for all $n$ . (See e.g. here for an example by John where this is the case.)

To be clear, John has an entire section in the Telephone theorem post on the fact that there aren't necessarily any actually deterministic constraints in the Markov chain. He's certainly aware of this! But our guess would be that a significant number of people miss this point, especially if they learn about the Telephone theorem only from a summary elsewhere. For example, the NAH Project Update has an entire section summarizing the Telephone theorem that doesn't mention the approximate nature of the deterministic constraints at all—instead it says

All information is either perfectly conserved or completely lost in the long run. And, more interestingly, information can only be perfectly conserved when it is carried by deterministic constraints - i.e. quantities which are exactly equal between two parts of the system. (highlight ours)

The phrase "information can only be perfectly conserved when it is carried by deterministic constraints" is true in one interpretation: if e.g. $M_{2}, M_{3}$ contain exactly the same information about $M_{1}$ , then (trivially) $P (M_{1} | M_{2}) = P (M_{1} | M_{3})$ , so all the information about $M_{1}$ is carried by a deterministic constraint.

But the overall claim suggested by the quote above seems to be "only the information carried by (exactly) deterministic constraints remains in the limit" (since only that information is "perfectly conserved" and any other information is "completely lost"). As discussed above, this interpretation would be wrong. Information can in fact be transmitted by inexact deterministic constraints over infinite distances.

Redundancy about a target variable

John told us about a "new Telephone theorem" that doesn't require a Markov chain. Below, we give a sketch of our understanding of this result. Instead of interpreting it as a "Telephone theorem", we can also view it as talking about redundant information, but redundant information about a target variable instead of within the system itself.

Let $X_{1}, \dots, X_{n}, Y$ be jointly distributed random variables. We will remove variables $X_{i}$ one at a time, choosing variables to reduce the marginal mutual information with $Y$ as much as possible at each step.

Formally, for $i \in {1, \dots, n}$ , we write $[i] := {1, \dots, i} .$ Then, construct an order (bijection) $σ : [n] \to [n]$ with the property that is successively selects the variable $X_{σ (i)}$ with the greatest marginal drop in mutual information:

σ (i) := argmax j \in [n] ∖ σ ([i - 1]) δ (σ ([i - 1]), j),

with

δ (σ ([i - 1]), j) = I (Y; X_{[n] ∖ σ ([i - 1])}) - I (Y; X_{[n] ∖ (σ ([i - 1]) \cup {j})}) .

Choose $ϵ > \frac{H (Y)}{n}$ , which in the "limit" of a large number of variables can be a very small constant. Now, these delta's cannot indefinitely be larger than $ϵ$ since there is not enough entropy $H (Y)$ available:

Proposition. There is $i \in [n]$ with $δ (σ ([i - 1]), σ (i)) < ϵ$ .

Proof. Assume otherwise. Then we have

H (Y) < n \cdot ϵ = n \sum i = 1 ϵ \leq n \sum i = 1 δ (σ ([i - 1]), σ (i)) = I (Y; X_{[n]}) \leq H (Y),

a contradiction. $□$

Now, set $I_{i} := I (Y; X_{[n] ∖ σ ([i])}) .$ Let $i \in [n]$ be given such that $δ (σ ([i - 1]), σ (i)) < ϵ$ , which exists according to the proposition. This means that $I_{i - 1} - I_{i} < ϵ$ , meaning the sequence $I_{1}, \dots, I_{n}$ has what one could call an $ϵ$ -plateau at $(i - 1, i)$ . At this point, the remaining variables are $ϵ$ -redundant about $Y$ : All the remaining information about $Y$ can approximately be recovered after removing any of the variables. The hypothesis is then that there live interesting abstractions at these plateaus.

Thoughts on future work

We think the most important places for future work are the limitations we mentioned in our discussion section, e.g. the focus on infinite limits, the fact that representations/encodings aren't discussed, and the dependency on variable choices. But additionally, this appendix lists some more specific directions for anyone not convinced by those limitations, who wants to fill some gaps in formalization and empirical tests instead. Again, we currently think that the ideas in this section are not the most important thing to focus on; for instance, some of the things we discuss below might become entirely obsolete with an improved framework that takes the finite regime seriously. But we had the ideas below written down anyway, and perhaps including them here will be helpful to someone.

Further mathematical formalizations

We have provided formalizations of some of John’s work— namely, the Telephone and generalized KPD theorems. Here, we detail which further formalizations we would find desirable. These mainly deal with clarifying the exact definition and “universal” properties of redundant information and generalizations to “high redundancy” and finite distances.

Formalizations of already existing work

Most importantly, we think it would be valuable if the redundant information viewpoint was better formalized. One pathway might be the following:

For the infinite variable case, it would be desirable to formalize “how to take the limits”, as already mentioned in the original post. I.e., in what order or relative strength do the number of variables and the number of resampling steps go to infinity?
What exactly is $X^{\infty}$ ? One potential route is to define random variables $F_{t} (X^{0}) := P (X^{t} | X^{0})$ by the resampling process. Then, one could define $F (X^{0})$ as the in-probability limit of the sequence $F_{t} (X^{0})$ , if this can be shown to exist. One could then call this limit $P (X^{\infty} | X^{0})$ , which then implicitly defines the random variable $X^{\infty}$ .
For the purpose of showing that “abstractions are functions of redundant information”, we envision the following process:
- First, show that $F$ is invariant under resampling, that is: if $x^{0}$ is sampled from $P (X^{0})$ and $x^{1}$ is resampled from $x^{0}$ , then show that $F (x^{0}) = F (x^{1})$ .
- Next, show that $F$ is universal among functions with this property, that is: for any other function $A$ of $X^{0}$ that is invariant under resampling (i.e., $A (x^{0}) = A (x^{1})$ for $x^{1}$ resampled from $x^{0}$ ), show that there is a function $G$ such that $A (X^{0}) = G (F (X^{0}))$ .
- Next, show that any “Telephone limit” $f^{\infty}$ and minimal latent variable $Λ^{*}$ can be interpreted as a function of $X^{0}$ (or simply define the sample space $Ω$ as the space of values of $X^{0}$ , which omits this issue).
- Next, show that all telephone limits $f^{\infty}$ and minimal sufficient statistics $Λ^{*}$ are indeed invariant under resampling as well. Then, the universality property will show that they are both functions of redundant information.
To show that the theory of minimal latent variables is not “empty”, it may be desirable to prove that minimal latent variables indeed exist in a fairly large, realistic setting.

Generalizations of results

As mentioned earlier, we think that interesting abstractions for a “game of telephone” do not live “in the limit” but at a finite stage. It might thus be desirable to formalize or quantify a notion of “large” that is meaningful in this context, or to look at “plateaus” of the information sequence $I (X_{0}; X_{t})$ .

Additionally, the exact definition of redundancy could be further refined. As mentioned already in the minimal latents post, one should also study highly redundant information—i.e., information that remains indefinitely after resampling a finite amount of variables at once. To understand why this is necessary, think about the pathological situation where $X^{0} = Y_{1}, Y_{1}, \dots, Y_{k}, Y_{k}$ contains each random variable exactly twice. In this case, all information is redundant.

More generally, we can imagine that there are often uninteresting local couplings between variables in a system that make them highly redundant at a low level. It is unclear whether such low-level couplings carry interesting abstractions for the purpose of cognitive systems.

Generalizing the theory like this is likely significant work, especially since this raises the new question of how to choose the number of variables to be resampled in a principled way.

Empirical and algorithmic work

We already mentioned earlier in Relevance to Alignment when discussing the universality hypothesis that it is ultimately an empirical claim, and that there have not been many publicly written experiments supporting it. While we are sympathetic to the idea that some experiments first need a firm theoretical grounding before they can be run, we think that some experiments can be run already now. Additionally, we think it is important to at least think about what concrete experiments might look like since this may already provide a useful feedback loop for further work on theory.

John already did some experiments, but they have not been written up, as clarified In his plan update for 2022:

As of The Plan, by six months ago I was hoping to have efficient algorithms for computing natural abstractions in simulated environments, and that basically didn’t happen. I did do a couple interesting experiments (which haven’t been written up):
Both Jeffery Andrade and myself tried to calculate natural abstractions in the Game of Life, which basically did not work.
I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

Writing up some takeaways from these experiments might replace the need for certain experiments, though we also expect more extensive experiments would still be helpful.

For an illustration, we describe the following fairly concrete idea: We first look at how one might study minimal latent variables of MNIST, then relate this to an investigation of redundant information in MNIST, and finish by thinking about a simplification and limitation of the idea. If successfully run, these experiments could provide evidence for or falsify the idea that minimal latent variables and redundant information contain “abstractions” in a human-interpretable, discrete way. This would empirically support claims 1b, 1d, 2a, and 2b. Additionally, the experiments could empirically demonstrate or falsify the equivalence of minimal latents and redundancy.

We are aware of several serious issues with this specific idea (see a brief discussion at the end), but would be excited about more discussion on how to turn this into feasible experiments.

Minimal Latents of MNIST

We now describe a preliminary idea for computing abstractions in MNIST in the form of minimal latents. The hope is that explicitly computed minimal latent variables would actually “track abstract information” in the intuitive sense. A specific approach might be:

Train a generative model $p$ such that $p (X)$ approximates the probability of the MNIST image $X$ . For example, one could define $p$ to be a normalizing flow model based on., e.g., RealNVP. Once it produces realistic images, fix $p$ for the rest of the experiment.
Initialize a parameterized function $Λ (X)$ that computes a “latent variable” for $X$ . Initially, one might start with the identity function $Λ = Id$ .
Optimize $Λ (X)$ such that it minimizes mutual information with $X$ : $min I (X; Λ (X))$ , with $X$ being distributed according to the learned model $p (X)$ . Since $Λ (X)$ is a function of $X$ , we have $I (X; Λ (X)) = H (Λ (X))$ , i.e., the mutual information is simply the entropy of $Λ (X)$ .
Since $Λ$ is initially the identity, the dimensions $X_{i}$ of $X$ are mutually independent conditioned on $Λ (X)$ . This property of a latent variable needs to be preserved in the whole optimization process of $Λ$ . That can be achieved by ensuring that $I (X_{i}; X_{\neq i} | Λ (X))$ stays close to zero by adding it to the loss of the optimization process.^[2]

At the end of this training process, the property that $I (X_{i}; X_{\neq i} | Λ (X))$ is close to zero for all $i$ makes sure that $Λ (X)$ is a latent variable in John’s framing. That $I (X; Λ (X)) = H (Λ (X))$ is minimal ideally makes $Λ (X)$ a minimal latent variable.^[3]

One important question is whether $Λ (X)$ tracks any important latent factor of $X$ . An optimistic hypothesis would be that $Λ (X)$ essentially assigns a “label” to each digit $X$ and that this label exactly corresponds to the digit value (i.e., $Λ (X)$ classifies $X$ into 10 categories, one for each digit). Why might one approximately expect this? Here’s the optimistic story:

Clearly, digit dimensions $X_{i}$ are not (non-conditionally) independent. After all, if you reveal an image of a digit piece by piece, then it may be possible to predict the digit value fairly early. This then constrains the possibilities for the non-revealed pieces since they need to combine into a coherent image.
However, if you condition on the digit value (e.g., a six), you then reveal half of the image, and it looks like a six, then this doesn’t tell you anything on top of what you condition on. The revealed part and the non-revealed part are now independent.

Of course, this is at best approximately correct: not all sixes look the same; seeing part of an image showing a six can make you infer facts about the drawing style that go beyond the pure digit value. At best, we expect that $Λ$ corresponds to an assignment of “digit value + style information”. If $Λ (X)$ is in any way interpretable like this, then we would consider this clear evidence that minimal latent variables track fairly natural abstractions. This would make us more confident that the formalizations are on a good track.

Furthermore, things may be messier in reality, with more than “digit value + style information” encoded in the minimal latent variable. And given that the minimal latent variable may not be forced to find a legible representation, similar to our discussion in an earlier section, it may be hard to interpret what the minimal latent encodes even if it keeps only information that is in principle highly interpretable. This makes it hard to clearly distinguish uninterpretable and hard-to-interpret cases.

Finally, note that it is not a priori clear how to interpret $Λ$ . Initially, we would suggest simply “looking” at the outputs of $Λ$ based on different sampled digits. In the results, check whether $Λ$ throws away most low-level information while still clearly clustering different digits and styles into different results. For this to work, it may be important to put careful thought into the exact representation of $Λ$ . For example, in an ideal case, $Λ$ would end up having separate dimensions that encode “digits” and “style information”, but initializing $Λ$ as a function $R^{784} \to R^{784}$ as proposed above may not be the best approach for that.

Variation: Do the same procedure for the case that MNIST is pre-filtered to only contain digits with the same value. Then, check whether $Λ (X)$ only keeps “style information”, and no digit information anymore.

Redundancy in MNIST

For understanding redundancy in MNIST, one could resample digits pixel by pixel and evaluate what remains after many iterations. For example, one might train conditional distributions $P (X_{i} | X_{\neq i})$ that approximate the conditional of the fixed parameterized distribution $p (X)$ from earlier. Then, start with an image $X$ sampled from $p$ and resample it pixel by pixel for many iterations. Look at the resulting image $X^{\infty}$ .

The “optimistic” prediction would be that some information remains after running this resampling process for a long time and that this remaining information exactly matches the information captured by the “minimal latent variable” $Λ (X)$ from above. More concretely:

Ideally, $X^{\infty}$ would be an “idealized” digit with the same value as $X$ , or oscillate between different digits of the same value when further resampled.
$X^{\infty}$ would match the “style” of $X$ in some idealized fashion, or again, oscillate between different versions of the digit with the same style.

Again, if this would work out, then we would consider this substantial evidence that the redundancy framework is on the right track to capture fairly natural abstractions. Additionally, this would be some empirical evidence for the not-yet-completely-formalized correspondence of minimal latents and redundant information.

Shortcomings and further thoughts

One disadvantage of the preceding experiment is that $X$ does not have the structure of a sparse graphical model (e.g., Bayesian network) since the “physical” latent — i.e., the human intention and writing style — is not present as a dimension in $X$ . This means that it is not trivially possible to study Markov blankets in $X$ . In particular, one cannot easily study abstractions along the lines of the original telephone theorem for MNIST. Studying this requires a different data set.

One caveat with all of this is that the models $p (X)$ and $p (X_{i} | X_{\neq i})$ we may train may not be a good enough approximation of the true distributions. For example, $p (X)$ will have adversarial examples (images with high assigned probability that do not look like digits), and redundancy in that model distribution may then not be what we are looking for.

Finally, optimizing the various mutual information terms may be intractable in practice or simply not work well enough.

Overall, we recommend that anyone interested in running experiments on natural abstractions take this only as inspiration or a jumping-off point, not as a workable and fleshed out proposal to just implement.

^{^}
Another interpretation could be almost sure convergence, i.e.
$P (lim n \to \infty | f_{n} (M_{n}) - f_{n + 1} (M_{n + 1}) | = 0) = 1 .$
We are unsure whether this interpretation would be correct—our proof only shows convergence in probability, which is strictly weaker.
^{^}
MNIST has 784 dimensions, meaning the loss function has 784 added terms. This would need to be made tractable. Alternatively, one could replace the mutual informations by total correlation or dual total correlation term; those being zero is an equivalent definition of mutual independence. One could then try to find out if there are efficient representations or approximations of these quantities that are also numerically stable.
^{^}
This might be wrong. A priori, there might be a second latent variable $f (X)$ that makes all $X_{i}$ mutually independent, with larger mutual information $I (X; f (X)) > I (X; Λ (X))$ , but such that $I (X; Λ (X) | f (X)) \neq 0$ . In this case, we do not have the Markov chain $Λ (X) \to f (X) \to X$ , meaning $Λ (X)$ is not minimal.
Thus to be more confident in the minimality of $Λ (X)$ , one might need to train further latent variables $f$ , fix them, and then minimize $I (X; Λ (X) | f (X))$ with respect to $Λ$ (alongside the other optimization objectives). This task is somewhat ad hoc since we see no principled way for obtaining a sufficiently “exhaustive” list of latents $f$ .

27