Background: The Language-Learning Argument
How many functions are there from a 1 megabyte image to a yes/no answer to the question “does this image contain an apple?”.
Well, there are 8M bits in the image, so possible images. The function can assign “yes” or “no” independently to each of those images, so possible functions. Specifying one such function by brute force (i.e. not leveraging any strong prior information) would therefore require bits of information - one bit specifying the output on each of the possible images. Even if we allow for some wiggle room in ambiguous images, those ambiguous images will still be a very tiny proportion of image-space, so the number of bits required would still be exponentially huge.
Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)
Point is: nearly-all the informational work done in a toddler’s mind of figuring out which pattern is referred to by the word “apple” must be performed by priors and general observations of the world, not by examples of apples specifically. Before the toddler ever hears the word, they must already have some internal data structure which represents the-things-we-call-apples as a natural category (though possibly implicitly rather than explicitly), just waiting to have a word attached to it. In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.
Furthermore, humans are empirically able to communicate; one human can say words, the other can guess what those words refer to, and then the two humans will both agree on what physical thing was referred to by the words. Admittedly this outcome is less robust than we might like (i.e. miscommunication still happens a lot), but it happens ridiculously more often than chance, to the point where exceptions are more notable than non-exceptions.
So, not only must unsupervised learning perform nearly-all the informational work of learning what words mean, but that unsupervised learning method must also somehow coordinate on (at least roughly) the same potential-word-meanings across different humans.
We can even go a step further: other kinds of minds, like dogs or neural nets, are also able to coordinate on (at least roughly) the same potential-word-meanings as humans. For now they usually require a lot more examples than human children, but nowhere near the exponentially huge number required to learn meanings by full brute force. Most of the informational work must still be done by some unsupervised process.
What We’ll Do In This Post
The language-learning argument adds up to a weak version of the Natural Abstraction Hypothesis: a wide variety of minds converge to using the same “natural abstractions”, i.e. the potential-word-meanings. Those abstractions are “natural” in the sense that they’re a fairly general property of minds (i.e. priors) plus the environment; a wide variety of minds are able to learn them in an unsupervised way just from the environment plus whatever priors are common to all these different kinds of minds.
My other writing on natural abstraction usually starts from thinking of natural abstractions as summaries of information relevant at a distance, or of information which is highly redundant - both properties which are mostly informational features of the environment, and mostly avoid explicitly talking about minds. They’re “bottom-up” notions of natural abstraction: they start from a low-level environment model, and build up to natural high-level features. The hope would be that we can later prove such features are convergently useful for minds via e.g. selection theorems.
But we can also come from a “top-down” direction: start from the idea of natural abstractions as features used convergently by a wide variety of minds. For instance, taking a theoretical angle, we can use a standard theoretical model of minds (i.e. Bayesianism), combine it with some natural criterion which might make a feature convergently useful, and ask what “natural abstractions” are implied.
That’s the idea behind this post.
More specifically, we’ll start from the usual idea of a Bayesian mind which can learn the environment-distribution arbitrarily well via observation, but has some freedom around which latent variables to use in order to represent that distribution efficiently. Those latent variables are the features/concepts/potential-word-meanings. Then, we’ll ask if there’s some “minimal” or “most general” latent variables, some latents such that any other latents must contain at least as much information - the idea being that such “minimal” latents might be naturally convergent, and natural coordination points for agents to coordinate on potential-word-meanings.
Then, the main technical result: these “minimal” latents are the same natural abstractions we get from the “information which is highly redundant” or “information relevant at a distance” views.
Background: Latent Variables
Here’s a standard conceptual model for how an agent figures out the concept of e.g. trees. First, the agent receives a bunch of observations from the world (images, for instance), many of which show what you or I would call trees. The agent then clusters together data based on similarity. More precisely, the agent looks for patterns which are well-modeled (i.e. compressed) by assuming that they were generated in the same way. For instance, all those subimages which show what you or I would call trees can be compressed a lot by assuming that there’s a common pattern-generator (i.e. the general concept of “trees”), which is mostly not generating the patterns in other subimages. That common pattern-generator, or whatever set of variables parameterize the pattern-generator, is the “latent” variable.
Mathematically, we usually assume that the latent variable summarizes all the information which any subset of the trees tells us about the rest - i.e. by observing particular trees, we learn general facts about trees (including general facts about particular tree-species), which in turn tell us about the rest of the trees. That means the individual tree-instances are all conditionally independent, given the latent variable. That’s the most common form used for Bayesian clustering models: we try to simultaneously cluster and fit latents such that instances in the same cluster are independent given that cluster’s latent.
The standard formula: we have a bunch of instances all in the same cluster. Then, conditional on the latent variable , the instances should be independent:
(There is some room to complain here - for instance, maybe it intuitively seems like the concept of trees should only include general facts about trees, not about specific species of tree, so instances from the same species should not be independent even conditional on the general concept of trees. We’ll come back to that near the end of the post, once we’ve connected this all back to other approaches to natural abstraction.)
Under this frame, the latents themselves take center stage, and the clusters become less central. Given the latent concept of trees, it’s pretty easy to recognize a tree; the interesting question is what models/parameterizations/etc to use for the general model of trees. More generally, given the latents, it’s relatively easy to compute the probability that any particular instance was generated by a particular latent (i.e. figure out what cluster the instance is in). The main question is what latent models/parameterizations/etc to use for each cluster.
And that’s where the notion of “minimal” or “most general” latents enters.
The key condition for a valid latent is that each instance be independent given the latent. All trees must be independent given the general concept of/facts about trees. But that still leaves a lot of options. For instance, we could declare that the tree-concept consists of all facts about all trees ever. Mathematically, . That is obviously a terrible choice; we gain nothing by using that latent. But it does technically satisfy the conditional independence requirement.
What we’d really like is a latent on the other end of the spectrum - one which includes as little information about the instances as possible, while still satisfying the conditional independence condition. Conceptually, I’ll operationalize that as follows: first, I want some latent which satisfies the conditional independence condition. Second, if anyone else comes along with another latent which satisfies the conditional factorization condition, then I should find that their latent contains at least all the information about the instances which is present in my latent. Or, to put it differently: all information in about must also be present in any other which satisfies the conditional independence condition. Third, for cleanliness, we’ll assume that doesn’t contain any extra “irrelevant” information, e.g. random bits which have nothing at all to do with . (The third condition just eliminates some obviously-silly possibilities; we’ll mostly ignore it.)
Mathematically: call my special minimal latent . First, must itself satisfy the conditional independence condition:
Second, given any other which satisfies the conditional independence condition, must contain at least all the information in about . Graphically,
That second diagram, in equation form:
Caution: the notation and wording have hidden some potentially-confusing conceptual pieces here. When I say e.g. “I have a latent ” or “I want some latent ”, what I really have/want is not the value of random variable , but rather the factorization of the distribution using , i.e.
In other words, I have/want a way to model the distribution of using as a latent, and is in some sense “defined by” the distributions I use. The relevant mathematical object is not itself, but rather the distributions and , which must reproduce the observed distribution over . Similarly, the condition on and asserts that there exists some joint distribution such that the factorization holds, and the marginals match the given distributions , , , . Another way to say this: if I later learn that some other person is using a different latent-factorization to model the same tree-cluster as me, with the same observable distribution over the same physically-observable trees , then I should be able to model their latent, and my latent and the observables all in one distribution , while still being consistent with all the marginals and the conditions.
Anyway, the big question now is: when can I find some which satisfies all conditions - i.e. which satisfies the conditional independence condition and contains at least as little info as any other which satisfies the conditional independence condition?
The Connection to Redundancy
Now for the main trick. To illustrate, we’ll start with a simplified case, in which we have only two instances and (e.g. only two trees).
Our “minimality” condition says that all information about which is in must also be in , for any other latent allowed by conditional independence. So, we’re allowed to pick any latent such that and are independent given . Well, is always independent of everything (including ) given itself. Likewise, is always independent of everything (including ) given itself. So, we could pick either or to be .
Our “minimal” must therefore satisfy:
- All information about which is in must also be in
- All information about which is in must also be in
In other words: all information about which is in must be redundantly present in both and . We must have both:
- independent of given , and
- independent of given
Generalizing beyond the two-instance case, we can’t just use as a latent any more; two tree are not necessarily independent given a third tree. But we can use all but one tree as a latent - i.e. . Any subset of the ’s are independent given all but one . Our “minimal” latent must therefore satisfy: all information about which is in must also be in for any . Any choice of all-but-one instance must still include the information in . In that sense, depends on only the “perfectly redundant” information in .
This connects us directly to natural abstractions as redundant information. Indeed, we can carry over the same idea from that formulation: since contains all the information relevant to , we can resample any individually without changing our information about . Therefore, can only depend on via quantities conserved under repeatedly resampling each individually.
Now to answer our big question from the previous section: I can find some satisfying the conditions exactly when all of the ’s are independent given the “perfectly redundant” information. In that case, I just set to be exactly the quantities conserved under the resampling process, i.e. the perfectly redundant information itself. Conversely, if the ’s are not independent given the perfectly redundant information, then I can’t find any satisfying the conditions.
(A somewhat more complete proof is in A Correspondence Theorem, though it doesn’t make explicit the connection to resampling because the resampling framework for natural abstraction didn’t yet exist when it was written. The results here are essentially a repackaging of that correspondence theorem, but with context, hindsight and a lot more understanding of natural abstraction from other angles.)
The Connection to Information At A Distance
Via redundancy, we can also connect to natural abstraction as information relevant at a distance. By the Telephone Theorem, in order for information to propagate over a long distance, it needs to be near-perfectly conserved through many intermediate layers; therefore the information is arbitrarily-perfectly redundantly represented in the variables constituting each of those layers. Conversely, in a local model (i.e. a model where most things only interact directly with a few “nearby” things), information needs to propagate over a long distance in order to reach very many variables, and therefore needs to propagate over a long distance in order to be redundantly represented in very many variables. Thus the equivalence between information relevant at a distance and highly redundant information, at least in local models. (Abstractions as Redundant Information contains more formal and fleshed-out versions of these arguments, though still not fully rigorous.)
That said, our conditions for so far aren’t quite equivalent to information relevant at a distance. Information relevant at a distance requires massive redundancy, information redundantly represented in many different “places” (i.e. many different or disjoint subsets thereof). In other words, we have to be able to throw away a whole bunch of ’s and still be able to reconstruct . Our conditions so far only require enough redundancy to throw away one at a time, and still reconstruct . If exactly two trees have some perfectly redundant information between them, and that information is entirely absent from all other trees, that would be included in under our current conditions, but might be omitted from information relevant at a distance (modulo some degrees of freedom around what to consider the low-level variables and what corresponding notion of “distance” to use, e.g. if we have separate variables for the state of each tree at each time then even a single tree may be very redundant and contain info relevant to that same tree at a far-distant time). So: includes all information relevant at a distance (at least when exists at all), but may include some other stuff too.
That said, since we’ve found that doesn’t always exist, our conditions probably aren’t quite right anyway. Let’s fix that.
Weakening the Conditional Independence Requirement
We have a minor problem: if the ’s are not conditionally independent given whatever perfectly redundant information they contain, then there isn’t any “minimal” latent satisfying the conditional independence condition.
One simple response: weaken the conditional independence condition. We already mentioned earlier that e.g. it intuitively seems like the concept of trees should only include general facts about trees, not about specific species of tree, so instances from the same species should not be independent even conditional on the general concept of trees. Maybe we can still require most trees to be independent given the general latent concept of trees, but allow for some exceptions?
Coming from another direction: when formulating natural abstractions as information relevant at a distance, the key condition is not that all the variables be conditionally independent, but that variables “far apart” be conditionally independent. Since locality in a large model implies that most variables are “far apart” from most other variables, weakening the conditional independence condition to conditional independence amongst most variables again makes sense.
So, here’s a new conditional independence condition for “large” systems, i.e. systems with an infinite number of ’s: given , any finite subset of the ’s must be approximately independent (i.e. mutual information below some small ) of all but a finite number of the other ’s. Now must satisfy this new condition, and be “minimal” (in the same sense as earlier) among latents which satisfy the new condition.
Returning to our argument about redundancy and resampling: in the step where we chose to be , we can now instead take for any finite subset . Otherwise, the argument remains the same. As before, we take to be whatever quantities are conserved by the resampling process, i.e. the highly redundant information. But now, perfect redundancy between two variables is no longer enough for conservation in the resampling process; will only include highly redundant information, information present in enough variables that we can drop any finite subset and still exactly recompute the information. Is that highly redundant information enough to satisfy our new independence condition?
At least in systems where each variable directly interacts with only a finite number of neighbors (e.g. a Bayes net or Markov random field of bounded degree), the answer is “yes”. Similar to the arguments in the “Connection to Information At A Distance” section earlier, information would have to propagate over a long distance in order to end up relevant to infinitely many variables, and therefore by the Telephone Theorem such information must be highly redundant, so that information is fully encoded in . Conditional on , then, only finitely many variables can have mutual information above a given cutoff . The relevant mathematical argument here is that information at a distance approaches zero conditional on the quantities conserved by resampling, which is in Abstractions as Redundant Information.
I also currently expect that we can extend beyond systems in which each variable directly interacts with only a finite number of neighbors. Intuitively, it seems like should exist for large systems in full generality, under our weakened conditional independence condition. It also seems like approximate versions of the arguments should go through. But I don’t have proofs or even proof sketches for those claims yet.
There are currently three main framings of natural abstraction which I think about: information at a distance, highly redundant information, and minimal latents. We can view the concept of trees as:
- A summary of information about a bunch of trees which is relevant to far-away trees
- Information which is highly redundant across many trees
- The minimal latent variables which induces approximate conditional independence between most trees
(In all cases, the abstract-summarization-method is coupled with clustering to figure out which chunks of reality to consider trees in the first place.)
For infinite systems with locality (i.e. variables only directly interact with a bounded number of neighbors), these three framings yield equivalent abstract summaries. Though I don’t have full proofs yet, I expect that the redundancy and minimal latent framings are equivalent even without locality (in which case the “information relevant far away” frame doesn’t apply, since it inherently requires locality). I'd love if someone else proved this claim, by the way!
The convergence between these three approaches to natural abstraction, which look so different at first glance, is currently one of my main pieces of mathematical evidence that they’re on the right track.
I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.
You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You're already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture.
(This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)
See also Superexponential Concept Space, and Simple Words, from the Sequences:
Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)
Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.
Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of most abstract object types. That fact could either be hard-coded in the toddler's prior, or learned insofar as it's already learning all these natural latents in an unsupervised way and can notice the sparsity. So, when a parent says "apple" while there's an apple in front of the toddler, sparsity dramatically narrows down the space of things they might be referring to.
As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for Λ∗? why does the direction of the arrow between Λ∗ and Λ seem inconsistent?).
I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the Xi) in the coslice category Ω↓Stoch where Stoch is the category of Markov kernels and Ω is the implicit sample space with respect to which all the random variables are defined.
What are your current thoughts on the exact type signature of abstractions? In the Telephone Theorem post, they're described as distributions over the local deterministic constraints. The current post also mentions that the "core" part of an abstraction is the distribution P[Λ], and its ability to explain variance in individual instances of Xi.
Applying the deterministic-constraint framework to trees, I assume it says something like "given certain ground-truth conditions (e. g., the environment of a savannah + the genetic code of a given tree), the growth of tree branches of that tree species is constrained like so, the rate of mutation is constrained like so, the spread of saplings like so, and therefore we should expect to see such-and-such distribution of trees over the landscape, and they'll have such-and-such forms".
Is that roughly correct? Have you arrived at any different framework for thinking about type signatures?
Roughly, yeah. I currently view the types of P[Λ] and P[X|Λ] as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.
Related background on the philosophical problem: gavagai
This touches on some issues I'd wanted to discuss: abstraction hierarchies, and incompatible abstraction layers.
Suppose we have a number of tree-instances X1,X2,...,Xn. Given a sufficiently large ϵ, we can compute a valid "general tree abstraction". But what if we've picked a lower ϵ, and are really committed to keeping it low, for some reason?
Here's a trick:
We separate tree-instances into sets S1,S2,...,Sm such that we can compute the corresponding "first-order" abstractions Λ1,Λ2,...,Λm over each set, and they would be valid, in the sense that any two Xi,Xj∈Sk would have mutual information below ϵ when conditioned on Λk. Plausibly, that would recover a set of abstractions corresponding to "tree species".
Then we repeat the trick: split the first-order abstractions Λ1,Λ2,...,Λm into sets, and generate second-order abstractions ΛII1,ΛII2,...,ΛIIq. That may recover, say, genuses.
We do this iteratively until getting a single nth-order abstraction ΛΩ, standing-in for "all trees".
I think it would all have sensible behavior. Conditioning any given tree-instance Xi on ΛΩ would only explain general facts about the trees, as we wanted. Conditioning on the appropriate lower-level abstractions would explain progressively more information about Xi. Conditioning a Xi∉Sj on ΛIj, in turn, would turn up some information that's in excess, or make some wrong predictions, but get the general facts right. (And you can also condition first-order abstractions on higher-order abstractions, etc.)
The question is: how do we pick ϵ? One potential answer is that, given some set of instances X1,X2,...,Xn, we always try for the lowest ϵ possible. Perhaps that's the mathematical description of taxonomy, even? "Given a set of instances, generate the abstraction hierarchy that minimizes ϵ at each abstraction-level."
There's a different way to go about it, though. Suppose that, instead of picking ϵ and then deciding on groupings, we first split instances X1,X2,...,Xn into sets, according to some rule? We have to be able to do that: we've somehow decided to abstract over these specific X1,X2,...,Xn to begin with, so we already have some way to generate groupings. (We've somehow arrived at a set of tree-instances to abstract over, instead of a mixture of cars, trees, towels, random objects...)
So, we pick some "rule", which is likely a natural abstraction in itself, or defined over one. Like "trees that are N years old" with separate set for every N, or "this tree has leaves" y/n, or "trees in %person%'s backyard" for every %person%. Then we split the instances into sets according to that rule, and try to summarize every set.
Important: that way, we may get meaningfully different ϵs for every set! For example, suppose we cluster trees by whose backyard they're in.
What does this approach yield us?
Consider humans, geopolitical entities, and ideological movements. They don't have a clear hierarchy: while humans are what constitutes the latter two "layers", ideological movements are not split across geopolitical lines (same ideologies can be present in different countries), and geopolitical entities are not split along ideological lines (a given government can have multiple competing ideologies). By implication, once you're viewing the world in terms of ideologies, you can't recover governments from this data; nor vice versa.
Similarly: As we've established, we can split trees by species ΛI1,...,ΛIn and by "whose backyard they're in" ΦI1,...,ΦIg. But: we would not be able to recover genuses ΛII1,...,ΛIId from the backyard-data ΦI1,...,ΦIg! Once we've committed to the backyard-classification, we've closed-off species-classification!
I propose calling such incompatible abstraction hierarchies abstraction layers. Behind every abstraction layer, there's some rule by which we're splitting instances into sets, and such rules are/are-defined-over natural abstractions, in turn.
Does all that make sense, on your model?
And, I guess, such that there's at least one set with more than one instance, to forbid the uninteresting trivial case where there's a one-member set for every initial instance. More generally, we'd want the number of sets to be "small" compared to the number of instances, in some sense of "small".
Reason for the change in notation from Λ will be apparent later.
Or maybe it's still useful, for general-purpose search via constraints?