How uniform is the neocortex?
The neocortex is the part of the human brain responsible for higher-order functions like sensory perception, cognition, and language, and has been hypothesized to be uniformly composed of general-purpose data-processing modules. What does the currently available evidence suggest about this hypothesis?
"How uniform is the neocortex?” is one of the background variables in my framework for AGI timelines. My aim for this post is not to present a complete argument for some view on this variable, so much as it is to:
- present some considerations I’ve encountered that shed light on this variable
- invite a collaborative effort among readers to shed further light on this variable (e.g. by leaving comments about considerations I haven’t included, or pointing out mistakes in my analyses)
There’s a long list of different regions in the neocortex, each of which appears to be responsible for something totally different. One interpretation is that these cortical regions are doing fundamentally different things, and that we acquired the capacities to do all these different things over hundreds of millions of years of evolution.
A radically different perspective, first put forth by Vernon Mountcastle in 1978, hypothesizes that the neocortex is implementing a single general-purpose data processing algorithm all throughout. From the popular neuroscience book On Intelligence, by Jeff Hawkins:
[...] Mountcastle points out that the neocortex is remarkably uniform in appearance and structure. The regions of cortex that handle auditory input look like the regions that handle touch, which look like the regions that control muscles, which look like Broca's language area, which look like practically every other region of the cortex. Mountcastle suggests that since these regions all look the same, perhaps they are actually performing the same basic operation! He proposes that the cortex uses the same computational tool to accomplish everything it does.
Mountcastle [...] shows that despite the differences, the neocortex is remarkably uniform. The same layers, cell types, and connections exist throughout. [...] The differences are often so subtle that trained anatomists can't agree on them. Therefore, Mountcastle argues, all regions of the cortex are performing the same operation. The thing that makes the vision area visual and the motor area motoric is how the regions of cortex are connected to each other and to other parts of the central nervous system.
In fact, Mountcastle argues that the reason one region of cortex looks slightly different from another is because of what it is connected to, and not because its basic function is different. He concludes that there is a common function, a common algorithm, that is performed by all the cortical regions. Vision is no different from hearing, which is no different from motor output. He allows that our genes specify how the regions of cortex are connected, which is very specific to function and species, but the cortical tissue itself is doing the same thing everywhere.
If Mountcastle is correct, the algorithm of the cortex must be expressed independently of any particular function or sense. The brain uses the same process to see as to hear. The cortex does something universal that can be applied to any type of sensory or motor system.
The rest of this post will review some of the evidence around Mountcastle’s hypothesis.
Cortical function is largely determined by input data
When visual inputs are fed into the auditory cortices of infant ferrets, those auditory cortices develop into functional visual systems. This suggests that different cortical regions are all capable of general-purpose data processing.
Humans can learn how to perform forms of sensory processing we haven’t evolved to perform—blind people can learn to see with their tongues, and can learn to echolocate well enough to discern density and texture. On the flip side, forms of sensory processing that we did evolve to perform depend heavily on the data we’re exposed to—for example, cats exposed only to horizontal edges early in life don’t have the ability to discern vertical edges later in life. This suggests that our capacities for sensory processing stem from some sort of general-purpose data processing, rather than innate machinery handed to us by evolution.
Blind people who learn to echolocate do so with the help of repurposed visual cortices, and they can learn to read Braille using repurposed visual cortices. Our visual cortices did not evolve to be utilized in these ways, suggesting that the visual cortex is doing some form of general-purpose data processing.
There’s a man who had the entire left half of his brain removed when he was 5, who has above-average intelligence, and went on to graduate college and maintain steady employment. This would only be possible if the right half of his brain were capable of taking on the cognitive functions of the left half of the brain.
The patterns identified by the primary sensory cortices (for vision, hearing, and seeing) overlap substantially with the patterns that numerous different unsupervised learning algorithms identified from the same data, suggesting that the different cortical regions (along with the different unsupervised learning algorithms) are all just doing some form of general-purpose pattern recognition on its input data.
Deep learning and cortical generality
The above evidence does not rule out the possibility that the cortex's apparent adaptability stems from developmental triggers, rather than some capability for general-purpose data-processing. By analogy, stem cells all start out very similar, only to differentiate into cells with functions tailored to the contexts in which they find themselves. It’s possible that different cortical regions have hard-coded genomic responses for handling particular data inputs, such that the cortex gives one hard-coded response when it detects that it’s receiving visual data, another hard-coded response when it detects that it’s receives auditory data, etc.
If this were the case, the cortex’s data-processing capabilities can best be understood as specialized responses to distinct evolutionary needs, and our ability to process data that we haven’t evolved to process (e.g. being able to look at a Go board and intuitively discern what a good next move would be) most likely utilizes a complicated mishmash of heterogeneous data-processing abilities acquired over evolutionary timescales.
Before I learned about any of the advancements in deep learning, this was my most likely guess about how the brain worked. It had always seemed to me that the hardest and most mysterious part of intelligence was intuitive pattern-recognition, and that the various forms of intuitive processing that let us recognize images, say sentences, and play Go might be totally different and possibly arbitrarily complex.
So I was very surprised when I learned that a single general method in deep learning (training an artificial neural network on massive amounts of data using gradient descent) led to performance comparable or superior to humans’ in tasks as disparate as image classification, speech synthesis, and playing Go. I found superhuman Go performance particularly surprising—intuitive judgments of Go boards encode distillations of high-level strategic reasoning, and are highly sensitive to small changes in input. Neither of these is true for sensory processing, so my prior guess was that the methods that worked for sensory processing wouldn’t have been sufficient for playing Go as well as humans.
This suggested to me that there’s nothing fundamentally complex or mysterious about intuition, and that seemingly-heterogeneous forms of intuitive processing can result from simple and general learning algorithms. From this perspective, it seems most parsimonious to explain the cortex’s seemingly general-purpose data-processing capabilities as resulting straightforwardly from a general learning algorithm implemented all throughout the cortex. (This is not to say that I think the cortex is doing what artificial neural networks are doing—rather, I think deep learning provides evidence that general learning algorithms exist at all, which increases the prior likelihood on the cortex implementing a general learning algorithm.)
The strength of this conclusion hinges on the extent to which the “artificial intuition” that current artificial neural networks (ANNs) are capable of is analogous to the intuitive processing that humans are capable of. It’s possible that the “intuition” utilized by ANNs is deeply analogous to human intuition, in which case the generality of ANNs would be very informative about the generality of cortical data-processing. It's also possible that "artificial intuition" is different in kind from human intuition, or that it only captures a small fraction of what goes into human intuition, in which case the generality of ANNs would not be very informative about the generality of cortical data-processing.
It seems that experts are divided about how analogous these forms of intuition are, and I conjecture that this is a major source of disagreement about overall AI timelines. Shane Legg (a cofounder of DeepMind, a leading AI lab) has been talking about how deep belief networks might be able to replicate the function of the cortex before deep learning took off, and he’s been predicting human-level AGI in the 2020s since 2009. Eliezer Yudkowsky has directly talked about AlphaGo providing evidence of "neural algorithms that generalize well, the way that the human cortical algorithm generalizes well" as an indication that AGI might be near. Rodney Brooks (the former director of MIT’s AI lab) has written about how deep learning is not capable of real perception or manipulation, and thinks AGI is over 100 years away. Gary Marcus has described deep learning as a “wild oversimplification” of the "hundreds of anatomically and likely functionally [distinct] areas" of the cortex, and estimates AGI to be 20-50 years away.
Canonical microcircuits for predictive coding
If the cortex were uniform, what might it actually be doing uniformly?
The cortex has been hypothesized to consist of canonical microcircuits that implement predictive coding. In a nutshell, predictive coding (aka predictive processing) is a theory of brain function which hypothesizes that the cortex learns hierarchical structure of the data it receives, and uses this structure to encode predictions about future sense inputs, resulting in “controlled hallucinations” that we interpret as direct perception of the world.
On Intelligence has an excerpt that cleanly communicates what I mean by “learning hierarchical structure”:
[...] The real world's nested structure is mirrored by the nested structure of your cortex.
What do I mean by a nested or hierarchical structure? Think about music. Notes are combined to form intervals. Intervals are combined to form melodic phrases. Phrases are combined to form melodies or songs. Songs are combined into albums. Think about written language. Letters are combined to form syllables. Syllables are combined to form words. Words are combined to form clauses and sentences. Looking at it the other way around, think about your neighborhood. It probably contains roads, schools, and houses. Houses have rooms. Each room has walls, a ceiling, a floor, a door, and one or more windows. Each of these is composed of smaller objects. Windows are made of glass, frames, latches, and screens. Latches are made from smaller parts like screws.
Take a moment to look up at your surroundings. Patterns from the retina entering your primary visual cortex are being combined to form line segments. Line segments combine to form more complex shapes. These complex shapes are combining to form objects like noses. Noses are combining with eyes and mouths to form faces. And faces are combining with other body parts to form the person who is sitting in the room across from you.
All objects in your world are composed of subobjects that occur consistently together; that is the very definition of an object. When we assign a name to something, we do so because a set of features consistently travels together. A face is a face precisely because two eyes, a nose, and a mouth always appear together. An eye is an eye precisely because a pupil, an iris, an eyelid, and so on, always appear together. The same can be said for chairs, cars, trees, parks, and countries. And, finally, a song is a song because a series of intervals always appear together in sequence.
In this way the world is like a song. Every object in the world is composed of a collection of smaller objects, and most objects are part of larger objects. This is what I mean by nested structure. Once you are aware of it, you can see nested structures everywhere. In an exactly analogous way, your memories of things and the way your brain represents them are stored in the hierarchical structure of the cortex. Your memory of your home does not exist in one region of cortex. It is stored over a hierarchy of cortical regions that reflect the hierarchical structure of the home. Large-scale relationships are stored at the top of the hierarchy and small-scale relationships are stored toward the bottom.
The design of the cortex and the method by which it learns naturally discover the hierarchical relationships in the world. You are not born with knowledge of language, houses, or music. The cortex has a clever learning algorithm that naturally finds whatever hierarchical structure exists and captures it.
The clearest evidence that the brain is learning hierarchical structure comes from the visual system. The visual cortex is known to have edge detectors at the lowest levels of processing, and neurons that fire when shown images of particular people, like Bill Clinton.
What does predictive coding say the cortex does with this learned hierarchical structure? From an introductory blog post about predictive processing:
[...] the brain is a multi-layer prediction machine. All neural processing consists of two streams: a bottom-up stream of sense data, and a top-down stream of predictions. These streams interface at each level of processing, comparing themselves to each other and adjusting themselves as necessary.
The bottom-up stream starts out as all that incomprehensible light and darkness and noise that we need to process. It gradually moves up all the cognitive layers that we already knew existed – the edge-detectors that resolve it into edges, the object-detectors that shape the edges into solid objects, et cetera.
The top-down stream starts with everything you know about the world, all your best heuristics, all your priors, [all the structure you’ve learned,] everything that’s ever happened to you before – everything from “solid objects can’t pass through one another” to “e=mc^2” to “that guy in the blue uniform is probably a policeman”. It uses its knowledge of concepts to make predictions – not in the form of verbal statements, but in the form of expected sense data. It makes some guesses about what you’re going to see, hear, and feel next, and asks “Like this?” These predictions gradually move down all the cognitive layers to generate lower-level predictions. If that uniformed guy was a policeman, how would that affect the various objects in the scene? Given the answer to that question, how would it affect the distribution of edges in the scene? Given the answer to that question, how would it affect the raw-sense data received?
As these two streams move through the brain side-by-side, they continually interface with each other. Each level receives the predictions from the level above it and the sense data from the level below it. Then each level uses Bayes’ Theorem to integrate these two sources of probabilistic evidence as best it can.
“To deal rapidly and fluently with an uncertain and noisy world, brains like ours have become masters of prediction – surfing the waves and noisy and ambiguous sensory stimulation by, in effect, trying to stay just ahead of them. A skilled surfer stays ‘in the pocket’: close to, yet just ahead of the place where the wave is breaking. This provides power and, when the wave breaks, it does not catch her. The brain’s task is not dissimilar. By constantly attempting to predict the incoming sensory signal we become able [...] to learn about the world around us and to engage that world in thought and action.”
The result is perception, which the PP theory describes as “controlled hallucination”. You’re not seeing the world as it is, exactly. You’re seeing your predictions about the world, cashed out as expected sensations, then shaped/constrained by the actual sense data.
An illustration of predictive processing, from the same source:
This demonstrates the degree to which the brain depends on top-down hypotheses to make sense of the bottom-up data. To most people, these two pictures start off looking like incoherent blotches of light and darkness. Once they figure out what they are (spoiler) the scene becomes obvious and coherent. According to the predictive processing model, this is how we perceive everything all the time – except usually the concepts necessary to make the scene fit together come from our higher-level predictions instead of from clicking on a spoiler link.
Predictive coding has been hailed by prominent neuroscientists as a possible unified theory of the brain, but I’m confused about how much physiological evidence there is that the brain is actually implementing predictive coding. It seems like there’s physiological evidence in support of predictive coding being implemented in the visual cortex and in the auditory cortex, and there’s a theoretical account of how the prefrontal cortex (responsible for higher cognitive functions like planning, decision-making, and executive function) might be utilizing similar principles. This paper and this paper review some physiological evidence of predictive coding in the cortex that I don’t really know how to interpret.
My current take
I find the various pieces of evidence that cortical function depends largely on data inputs (e.g. the ferret rewiring experiment) to be pretty compelling evidence of general-purpose data-processing in the cortex. The success of simple and general methods in deep learning across a wide range of tasks suggests that it’s most parsimonious to model the cortex as employing general methods throughout, but only to the extent that the capabilities of artificial neural networks can be taken to be analogous to the capabilities of the cortex. I currently consider the analogy to be deep, and intend to explore my reasons for thinking so in future posts.
I think the fact that predictive coding offers a plausible theoretical account for what the cortex could be doing uniformly, which can account for higher-level cognitive functions in addition to sensory processing, is itself some evidence of cortical uniformity. I’m confused about how much physiological evidence there is that the brain is actually implementing predictive coding, but I’m very bullish on predictive coding as a basis for a unified brain theory based on non-physiological evidence (like our subjective experiences making sense of the images of splotches) that I intend to explore in a future post.
Thanks to Paul Kreiner, David Spivak, and Stag Lynn for helpful suggestions and feedback, and thanks to Jacob Cannell for writing a post that inspired much of my thinking here.
This blog post comment has some good excerpts from On Intelligence. ↩︎
Deep learning is a general method in the sense that most tasks are solved by utilizing a handful of basic tools from a standard toolkit, adapted for the specific task at hand. Once you’ve selected the basic tools, all that’s left is figuring out how to supply the training data, specifying the objective that lets the AI know how well it’s doing, throwing a lot of computation at the problem, and fiddling with details. My understanding is that there typically isn’t much conceptual ingenuity involved in solving the problems, that most of the work goes into fiddling with details, and that trying to be clever doesn't lead to better results than using standard tricks with more computation and training data. It's also worth noting that most of the tools in this standard toolkit have been around since the 90's (e.g. convolutional neural networks, LSTMs, reinforcement learning, backpropagation), and that the recent boom in AI was driven by using these decades-old tools with unprecedented amounts of computation. ↩︎
AlphaGo did simulate future moves to achieve superhuman performance, so the direct comparison against human intuition isn't completely fair. But AlphaGo Zero's raw neural network, which just looks at the "texture" of the board without simulating any future moves, can still play quite formidably. From the AlphaGo Zero paper: "The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan." (AlphaGo Fan beat the European Go champion 5-0.) ↩︎
Eliezer Yudkowsky has an insightful exposition of this point in a Facebook post. ↩︎
Thanks for writing this up! I love this topic and I think everyone should talk about it more!
On cortical uniformity:
My take (largely pro-cortical-uniformity) is in the first part of this post. I never did find better or more recent sources than those two book chapters, but have gradually grown a bit more confident in what I wrote for various more roundabout reasons. See also my more recent post here.
On the similarity of neocortical algorithms to modern ML:
I am pretty far on the side of "neocortical algorithms are different than today's most popular ANNs", i.e. I think that both are "general" but I reached that conclusion independently for each. If I had to pick one difference, I would say it's that neocortical algorithms use analysis-by-synthesis—i.e., searching through a space of generative models for one that matches the data—and relatedly planning by probabilistic inference. This type of algorithm is closely related to probabilistic programming and PGMs—see, for example, Dileep George's work. In today's popular ANNs, this kind of analysis-by-synthesis and planning is either entirely absent or arguably present as a kind of add-on, but it's not a core principle of the algorithm. This is obviously not the only difference between neocortical algorithms and mainstream ANNs. Some are really obvious: the neocortex doesn't use backprop! More controversially, I don't even think the neocortex even uses real-valued variables in its models, as opposed to booleans—well, I would want to put some caveats on that, but I believe something in that general vicinity.
So basically, I think the algorithms most similar to the neocortex are a bit of a backwater within mainstream ML research, with essentially no SOTA results on popular benchmarks ... which makes it a bit awkward for me to argue that this is the corner from which we will get AGI. Oh well, that's what I believe anyway!
On predictive coding:
Depending on context, I'll say I'm either an enthusiastic proponent or strong critic of predictive coding. Really, I have a particular version of it I like, described here. I guess I disagree with Friston, Clark, etc. most strongly in that they argue that predictive coding is a helpful way to think about the operation of the whole brain, whereas I only find it helpful when discussing the neocortex in particular. Again see here for my take on the rest of the brain. My other primary disagreement is that I don't see "minimizing prediction error" as a foundational principle, but rather an incidental consequence of properly-functioning neocortical algorithms under certain conditions. (Specifically, from the fact that the neocortex will discard generative models that get repeatedly falsified.)
I think there is a lot of evidence for the neocortex having a zoo of generative models that can be efficiently searched through and glued together, not only for low-level perception but also for high-level stuff. I guess the evidence I think about is mostly introspective though. For example, this book review about therapy has (in my biased opinion) an obvious and direct correspondence with how I think the neocortex processes generative models.
This post is what first gave me a major update towards "an AI with a simple single architectural pattern scaled up sufficiently could become AGI", in other words, there doesn't necessarily have to be complicated fine-tuned algorithms for different advanced functions–you can get lots of different things from the same simple structure plus optimization. Since then, as far as I can tell, that's what we've been seeing.