Natural Abstractions: Key Claims, Theorems, and Critiques

Leon Lang; Erik Jenner

Brief responses to the critiques:

Results don’t discuss encoding/representation of abstractions

Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.

Definitions depend on choice of variables

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. For instance, it doesn't make sense to use variables which "rotate together" the states of five different local patches of spacetime which are not close to each other. (For instance, those five different local patches will generally not be rotated together by default in an evolving agent's sensory feed.)

That does still leave degrees of freedom in how we represent all the local patches, but those are exactly the degrees of freedom which don't matter for natural abstraction. (Under the minimal latent formulation: we can represent each individual variable or set-of-variables-which-we're-making-independent-of-some-other-stuff in a different way without changing anything informationally. Under the redundancy formulation: assume our resampling process allows simultaneous resampling of small sets of variables, to avoid the thing where there's two variables very tightly coupled but they're otherwise independent of everything else. With that modification in place, same argument as the minimal latent formulation applies.)

Theorems focus on infinite limits, but abstractions happen in finite regimes

Totally agree with this one too, and it has also been a major focus for me over the past couple months.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Missing theoretical support for several key claims

Basically agree with this. In particular, I think the quoted block is indeed a place where I was a bit overexcited at the time and made too strong a claim. More generally, for a while I was thinking of "deterministic constraints" as basically implying "low-dimensional" in practice, based on intuitions from physics. But in hindsight, that's at least not externally-legibly true, and arguably not true in general at all.

Figuring out whether the Universality Hypothesis is true
... What we’re less convinced of is that the current theoretical approach is a good way to tackle this question. One worrying sign is that almost two years after the project announcement (and over three years after work on natural abstractions began), there still haven’t been major empirical tests, even though that was the original motivation for developing all of the theory. ... Of course sometimes experiments do require upfront theory work. But in this case, we think that e.g. empirical interpretability work is already making progress on the Universality Hypothesis, whereas we’re unsure whether the natural abstractions agenda is much closer to major empirical tests than it was two years ago.

See the section on "Low level of precision...". Also, You Are Not Measuring What You Think You Are Measuring is a very relevant principle here - I have lots of (not necessarily externally-legible) bits of evidence about a rough version of natural abstraction, but the details I'm still figuring out are (not coincidentally) exactly the details where it's hard to tell whether we're measuring the right thing.

Abstractions as a bottleneck for agent foundations: The high-level story for why abstractions seem important for formalizing e.g. values seems very plausible to us. It’s less clear to us whether they are necessary (or at least a good first step)

Yeah, I don't think this should be externally-legibly clear right now. I think people need to spend a lot of time trying and failing to tackle agent foundations problem themselves, repeatedly running into the need for a proper model of abstraction, in order for this to be clear.

Accelerating alignment research: The promise behind this motivation is that having a theory of natural abstractions will make it much easier to find robust formalizations of abstractions such as “agency”, “optimizer”, or “modularity”. ... To us, such an outcome seems unlikely, though it may still be worth pursuing

I probably put higher probability on success here then you do, but I don't think it should be legibly clear.

Interpretability: ... Figuring out the real-world meaning of internal network activations is one of the core themes of safety-motivated interpretability work. And reverse-engineering a network into “pseudocode” is not just some separate problem, it’s deeply intertwined. We typically understand the inputs of a network, so if we can figure out how the network transforms these inputs, that can let us test hypotheses for what the meaning of internal activations is.

An intuitive understanding of inputs plus a circuit is not, in general, sufficient to interpret the internal things computed by the circuit. Easy counterargument: neural nets are circuits, so if those two pieces were enough, we'd already be done; there would be no interpretability problem in the first place.

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

Low level of precision and formalization

I mentioned earlier the heuristic of "only add formality when it's the right formality; don't prematurely add ad-hoc formulations just for the sake of making things more formal".

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking. Formalization is one part of this: in academia, the incentive is usually to add ad-hoc formalization in order to get a full formal proof rather than a sketch, even if the ad-hoc formalization added does not match reality well. On the experimental side, the incentive is usually on bulletproof results, rather than gaining lots of information. (... and that's the better case. In the worse case, the incentive is on jumping through certain hoops which are nominally about bulletproofing, but don't even do that job very well, like e.g. statistical significance.) And yes, defensibility does have value even for truth-seeking, but there are tradeoffs and I advise against anchoring too much on academia.

With that in mind: both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly. For that purpose, it's important to make the stuff I think I understand as legible as possible so that others can help. I try to accurately convey my models and epistemic state. But it's not important to e.g. make it easy for others to point out mistakes in places where I didn't think the formality was right anway. If and when I have all the pieces, then I can worry about defensible proof.

That said, I agree with at least some parts of the critique. Being both precise and readable at the same time is hard, man.

Few experiments
As we briefly discussed earlier, we think it’s worrying that there haven’t been major experiments on the Natural Abstraction Hypothesis, given that John thinks of it as mostly an empirical claim. We would be excited to see more discussion on experiments that can be done right now to test (parts of) the natural abstractions agenda! We elaborate on a preliminary idea in the appendix (though it has a number of issues).

I do love your experiment ideas! The experiments I ran last summer had a similar flavor - relatively-simple checks on MNIST nets - though they were focused on the "information at a distance" lens rather than the redundancy or minimal latent lenses.

Anyway, similar answer here as the previous section: at this point I'm mainly trying to get to the right answers quickly, not trying to provide some impressive defensible proof. I run experiments insofar as they give me bits about what the right answers are.

[-]Erik Jenner3y111

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few variables. These local transformations aren't a problem because if we, say, resample variables at a time, then transforming $m < n$ variables doesn't affect redundant information.

I'm tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number $N$ of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of "reasonable" in mind that's more specific then the set of all descriptions physicists might want to use.

I also don't see yet how exactly to make this work given local transformations---e.g. I think my version above doesn't quite work because if you're resampling a finite number $n$ of variables at a time, then I do think transforms involving fewer than $n$ variables can sometimes affect redundant information. I know you've talked before about resampling any finite number of variables in the context of a system with infinitely many variables, but I think we'll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don't currently see how to avoid that.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Oh, I did not realize from your posts that this is how you were thinking about the results. I'm very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn't help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I'm also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you're dealing with from just a few examples. But the important bit is that you start with a part of the model that's actually a natural abstraction, which is why this approach doesn't work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking.

I agree with this. I've done stuff in some of my past papers that was just for defensibility and didn't make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)

On the other hand, I'd guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia, don't successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I'm not sure how much that would help.) And I think there's also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.

It also seems likely that different levels of precision are genuinely right for different people (e.g. I'm unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I'm aware you have lots of details in your head that aren't in your posts, and I'm also very much in favor of people just listening to their own research taste.

both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly.

Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).

Being both precise and readable at the same time is hard, man.

Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn't make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It's probably not that extreme, I think often it's at least possible to find a good trade-off that works for most people, though hard).

[-]So8res3y111

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

[-]Daniel Kokotajlo3y1121

Big thank you for doing this work!

[-]johnswentworth3y139

+1, this is probably going to be my new default post to link people to as an intro.

[-]Vanessa Kosoy1y70Review for 2023 Review

This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism.

To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).

Some thoughts about natural abstractions inspired by this post:

The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view).
A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs and $Φ$ , we say that $Ψ ⪯ Φ$ when some conditioning of $Ψ$ on a finite set of observations produces a refinement of some conditioning of $Φ$ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form $Ψ_{0} ≺ Ψ_{1} ≺ Ψ_{2} ≺ \dots$ Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences ${Ψ_{i}}$ and ${Φ_{i}}$ . The agreement theorem can then state that for all $i \in N$ , there exists $j \in N$ s.t. $Φ_{j} ⪯ Ψ_{i}$ (and vice versa). More precisely, this relation might hold up to some known function $ϵ (i, j)$ s.t. ${lim}_{j \to \infty} ϵ (i, j) = 0$ .
The "agreement" in the previous paragraph is purely semantic: the agents converge to believing in the same world, but this doesn't say anything about the syntactic structure of their beliefs. This seems conceptually insufficient for natural abstractions. However, maybe there is a syntactic equivalent where the preorder $⪯$ is replaced by morphisms in the category of some syntactic representations (e.g. string machines). It seems reasonable to expect that agents must use such representations to learn efficiently (see also frugal compositional languages).
In this picture, the graphical models used by John are a candidate for the frugal compositional language. I think this might be not entirely off the mark, but the real frugal compositional language is probably somewhat different.

[-]Arjun Pitchanathan8mo10

representation of a variable $X$ for variable $Y$

Hm, I don't understand what $Y$ is supposed to be here.

[-]Leon Lang8mo20

Y would be a target variable that one wants to predict, e.g. in supervised learning.

In other words, one wants to learn a model that predicts Y from X. As an intermediate step, one creates a representation T of X. T does not need to keep *all* information about X, but it needs to keep enough information to be able to extract Y from it. The remaining information about X is minimized.

Does this clarify it?

[-]romeostevensit3y10

Tangentially related: recent discussion raising a seemingly surprising point about LLM's being lossless compression finders https://www.youtube.com/watch?v=dO4TPJkeaaU

^{^}

John mentioned a caveat on this to us:

Note that I sometimes hedge about whether "the natural abstractions" are $F (X)$ itself, or whether they're a latent variable of which $F (X)$ is an estimate. The latter is probably the right answer, but we'd expect in typical systems that the estimate is very precise, so the distinction doesn't matter much. (Prototypical example: average particle energy in one chunk of a gas as an estimate of the temperature of the gas.)
[Further explanation after some discussion with us:]
Latent variables, in general, are not necessarily fully determined by the physical state of the universe; that much just naturally drops out of the math. Latents are just these mathematical constructs. They can be predictively useful and powerful, while still mathematically having uncertainty separate from the state of the world.

Another way to frame it: consider the Kolmogorov complexity/Solomonoff induction view. From a God's-eye view, we could observe the entire low-level state of the universe, then find the shortest program which outputs that state. And it's entirely possible that that shortest program contains some variables whose values we are unable to perfectly estimate, even knowing the entire low-level state of the universe. (In the Kolmogorov context, this means that there are multiple different programs with approximately-the-same length which all output the observed universe-state, and all have very similar structure, but assign different values to corresponding variables.) What our uncertainty is over is the values of the latent variables - i.e. the internal variables used by the programs which approximately-maximally compress the low-level universe state. Insofar as the programs are near-optimal compressions, that uncertainty should be small, but it's not necessarily zero. And of course those internal variables can be predictively useful and powerful for modeling the world, even if their values are not fully determinable from the full world-state.

We're not sure whether we fully understand his views here, and in any case think this distinction shouldn't matter too much for the rest of our post, so we won't discuss it further.

^{^}

The (slight) difference is that Gibbs sampling is typically defined as resampling $X_{1}$ , then $X_{2}$ , and so on, wrapping around to $X_{1}$ after each variable has been resampled once. In contrast, John proposes randomly choosing which variable to resample at each step.

^{^}

Note that it's currently not quite clear in which sense anything converges here, see appendix for some notes on further formalization of $X^{\infty}$ .

^{^}

It’s certainly possible that the connection between theoretical progress so far and future empirical tests is just not meant to be fully legible based on John’s public writing.

101

Natural Abstractions: Key Claims, Theorems, and Critiques

101

Results don’t discuss encoding/representation of abstractions

Theorems focus on infinite limits, but abstractions happen in finite regimes

Missing theoretical support for several key claims

Low level of precision and formalization

Few experiments

Introduction

What do we mean by abstractions?

Why expect abstractions to be natural?

Why study natural abstractions for alignment?

Existing writing on the natural abstractions agenda

Related work

Machine learning

Representation Learning

The universality hypothesis in machine learning

MCMC and Gibbs sampling

Information Decompositions and Redundancy

Neuroscience

(Cognitive) Psychology

Philosophy

Key high-level claims

0. Abstractability: Our universe abstracts well

1. The Universality Hypothesis: Most cognitive systems learn and use similar abstractions

1a. Most cognitive systems learn subsets of the same abstractions

1b. The space of abstractions used by most cognitive systems is roughly discrete

1c. Most general cognitive systems can learn the same abstractions

1d. Humans and ML models both use natural abstractions

2. The Redundant Information Hypothesis: A mathematical description of natural abstractions

2a. Natural abstractions are functions of redundantly encoded information

2b. Redundant information can be formalized via resampling or minimal latents

2c. In our universe, most information is not redundant

2d. Locality, noise, and chaos are the key mechanisms for most information not being redundant

Key Mathematical Developments and Proofs

The Telephone Theorem

Abstractions as Redundant Information

More Details on Redundant information as resampling-invariant information

Telephone Abstractions are a Function of Redundant Information

Minimal Latents as a Function of Redundant Information

The Generalized Koopman-Pitman-Darmois Theorem

An almost formal formulation of generalized KPD

The Speculative Connection between gKPD and Redundancy

How is the natural abstractions agenda relevant to alignment?

Four reasons to work on natural abstractions

1. The Universality Hypothesis being true or false has strategic implications for alignment

2. Defining abstractions is a bottleneck for agent foundations

3. A formalization of abstractions would accelerate alignment research

4. Interpretability

How existing results fit into the larger plan

Selection theorems

Discussion, limitations, and critiques

Gaps in the theory

Results don’t discuss encoding/representation of abstractions

Definitions depend on choice of variables Xi

Theorems focus on infinite limits, but abstractions happen in finite regimes

Missing theoretical support for several key claims

Missing formalizations

Relevance to alignment

Methodological critiques

Low level of precision and formalization

Few experiments

Little engagement with existing work

Should this all be delegated?

Conclusion

Acknowledgments

Definitions depend on choice of variables $X_{i}$