The "Minimal Latents" Approach to Natural Abstractions

[-]Vivek Hebbar3y65

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the inductive bias.

[-]Rohin Shah3y67

You don't need to guess; it's clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most different functions, which is a tiny tiny fraction of the full space of $2^{2^{8, 000, 000}}$ possible functions. You're already getting at least $2^{8, 000, 000} - 64, 000, 000, 000, 000$ of the bits just by choosing the network architecture.

(This does assume things like "the neural network can learn the correct function rather than a nearly-correct function" but similarly the argument in the OP assumes "the toddler does learn the correct function rather than a nearly-correct function".)

[-]LawrenceC3y32

See also Superexponential Concept Space, and Simple Words, from the Sequences:

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.
[...]
From this perspective, learning doesn't just rely on inductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.

[-]DanielFilan3y40

Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)

Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)

[-]johnswentworth3y30

Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.

Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of most abstract object types. That fact could either be hard-coded in the toddler's prior, or learned insofar as it's already learning all these natural latents in an unsupervised way and can notice the sparsity. So, when a parent says "apple" while there's an apple in front of the toddler, sparsity dramatically narrows down the space of things they might be referring to.

[-]davidad3y42

As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for ? why does the direction of the arrow between $Λ^{*}$ and $Λ$ seem inconsistent?).

I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the $X_{i}$ ) in the coslice category $Ω ↓ S t o c h$ where $S t o c h$ is the category of Markov kernels and $Ω$ is the implicit sample space with respect to which all the random variables are defined.

[-]Thane Ruthenis3y30

What are your current thoughts on the exact type signature of abstractions? In the Telephone Theorem post, they're described as distributions over the local deterministic constraints. The current post also mentions that the "core" part of an abstraction is the distribution , and its ability to explain variance in individual instances of $X_{i}$ .

Applying the deterministic-constraint framework to trees, I assume it says something like "given certain ground-truth conditions (e. g., the environment of a savannah + the genetic code of a given tree), the growth of tree branches of that tree species is constrained like so, the rate of mutation is constrained like so, the spread of saplings like so, and therefore we should expect to see such-and-such distribution of trees over the landscape, and they'll have such-and-such forms".

Is that roughly correct? Have you arrived at any different framework for thinking about type signatures?

[-]johnswentworth3y30

Roughly, yeah. I currently view the types of and $P [X | Λ]$ as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.

[-]romeostevensit3y30

Related background on the philosophical problem: gavagai

[-]Thane Ruthenis3y*30

This touches on some issues I'd wanted to discuss: abstraction hierarchies, and incompatible abstraction layers.

So, here’s a new conditional independence condition for “large” systems, i.e. systems with an infinite number of ’s: given $Λ$ , any finite subset of the $X_{i}$ ’s must be approximately independent (i.e. mutual information below some small $ϵ$ ) of all but a finite number of the other $X_{i}$ ’s

Suppose we have a number of tree-instances $X_{1}, X_{2}, . . ., X_{n}$ . Given a sufficiently large $ϵ$ , we can compute a valid "general tree abstraction". But what if we've picked a lower $ϵ$ , and are really committed to keeping it low, for some reason?

Here's a trick:

We separate tree-instances into sets $S_{1}, S_{2}, . . ., S_{m}$ such that we can compute the corresponding "first-order" abstractions $Λ_{1}, Λ_{2}, . . ., Λ_{m}$ over each set, and they would be valid, in the sense that any two $X_{i}, X_{j} \in S_{k}$ would have mutual information below $ϵ$ when conditioned on $Λ_{k}$ ^[1]. Plausibly, that would recover a set of abstractions corresponding to "tree species".

Then we repeat the trick: split the first-order abstractions $Λ_{1}, Λ_{2}, . . ., Λ_{m}$ into sets, and generate second-order abstractions $Λ_{1}^{II}, Λ_{2}^{II}, . . ., Λ_{q}^{II}$ . That may recover, say, genuses.

We do this iteratively until getting a single nth-order abstraction $Λ^{Ω}$ , standing-in for "all trees".

I think it would all have sensible behavior. Conditioning any given tree-instance $X_{i}$ on $Λ^{Ω}$ would only explain general facts about the trees, as we wanted. Conditioning on the appropriate lower-level abstractions would explain progressively more information about $X_{i}$ . Conditioning a $X_{i} \notin S_{j}$ on $Λ_{j}^{I}$ , in turn, would turn up some information that's in excess, or make some wrong predictions, but get the general facts right. (And you can also condition first-order abstractions on higher-order abstractions, etc.)

The question is: how do we pick $ϵ$ ? One potential answer is that, given some set of instances $X_{1}, X_{2}, . . ., X_{n}$ , we always try for the lowest $ϵ$ possible^[1]. Perhaps that's the mathematical description of taxonomy, even? "Given a set of instances, generate the abstraction hierarchy that minimizes $ϵ$ at each abstraction-level."

There's a different way to go about it, though. Suppose that, instead of picking $ϵ$ and then deciding on groupings, we first split instances $X_{1}, X_{2}, . . ., X_{n}$ into sets, according to some rule? We have to be able to do that: we've somehow decided to abstract over these specific $X_{1}, X_{2}, . . ., X_{n}$ to begin with, so we already have some way to generate groupings. (We've somehow arrived at a set of tree-instances to abstract over, instead of a mixture of cars, trees, towels, random objects...)

So, we pick some "rule", which is likely a natural abstraction in itself, or defined over one. Like "trees that are N years old" with separate set for every N, or "this tree has leaves" y/n, or "trees in %person%'s backyard" for every %person%. Then we split the instances into sets according to that rule, and try to summarize every set.

Important: that way, we may get meaningfully different $ϵ$ s for every set! For example, suppose we cluster trees by whose backyard they're in.

Person A has trees of several different species growing in their yard. For them, we compute $S_{A}$ , the corresponding abstraction/summary $Φ_{A}$ ^[2], and some $ϵ_{A}$ that makes $Φ_{A}$ be a valid abstraction.
Person B only plants trees of a single species. Again, we compute $S_{B}$ , $Φ_{B}$ , $ϵ_{B}$ .
Obviously, $ϵ_{A} > ϵ_{B}$ .

What does this approach yield us?

It's a tool of analysis. We can try different rules on for size, and see if that reveals any interesting data. (Do most people grow only trees of a single species in their yard?)
It's potentially useful for general-purpose search via constraints. Consider two different first-order abstractions, "trees of species z" $Λ_{z}$ and trees-in-my-backyard $Φ_{my}$ . Computing the second-order abstraction from them would be rather arbitrary, but it's something we may want to do during a specific planning process!
- (Though note that combining any two nth-order abstractions would result in a (n+1)th-order abstraction that has at least as much information as $Λ^{Ω}$ . I. e., any given valid abstraction hierarchy over a given set of instances terminates in the same max-level abstraction. I'm not sure if that's useful.)
It allows abstraction layers, as outlined below.

Consider humans, geopolitical entities, and ideological movements. They don't have a clear hierarchy: while humans are what constitutes the latter two "layers", ideological movements are not split across geopolitical lines (same ideologies can be present in different countries), and geopolitical entities are not split along ideological lines (a given government can have multiple competing ideologies). By implication, once you're viewing the world in terms of ideologies, you can't recover governments from this data; nor vice versa.

Similarly: As we've established, we can split trees by species $Λ_{1}^{I}, . . ., Λ_{n}^{I}$ and by "whose backyard they're in" $Φ_{1}^{I}, . . ., Φ_{g}^{I}$ . But: we would not be able to recover genuses $Λ_{1}^{II}, . . ., Λ_{d}^{II}$ from the backyard-data $Φ_{1}^{I}, . . ., Φ_{g}^{I}$ ! Once we've committed to the backyard-classification, we've closed-off species-classification!

I propose calling such incompatible abstraction hierarchies abstraction layers. Behind every abstraction layer, there's some rule by which we're splitting instances into sets, and such rules are/are-defined-over natural abstractions, in turn.

Does all that make sense, on your model?

^{^}
And, I guess, such that there's at least one set with more than one instance, to forbid the uninteresting trivial case where there's a one-member set for every initial instance. More generally, we'd want the number of sets to be "small" compared to the number of instances, in some sense of "small".
^{^}
Reason for the change in notation from $Λ$ will be apparent later.
^{^}
Or maybe it's still useful, for general-purpose search via constraints?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

The "Minimal Latents" Approach to Natural Abstractions

16

Background: The Language-Learning Argument

What We’ll Do In This Post

Background: Latent Variables

“Minimal” Latents

The Connection to Redundancy

The Connection to Information At A Distance

Weakening the Conditional Independence Requirement

Takeaways