(Last revised: December 2024. See changelog at the bottom.)

The backstory: Dopamine-supervised learning in mammals

According to a well-known theory from the 1990s, dopamine signals in mammal brains can serve as Reward Prediction Error signals that drive reinforcement learning, i.e. finding good actions within a high-dimensional space of possible actions.

I for one strongly agree that this is one of the things that dopamine does.

But meanwhile I've been advocating that dopamine also plays a different role in other parts of the mammal brain: a dopamine signal can provide the supervisory signal for supervised learning in a one-dimensional action space.

(Or 47 dopamine signals can supervise learning in a 47-dimensional action space, etc. etc.)

The main (though not exclusive) part of the brain where I think dopamine-supervised learning is happening is the “extended striatum” (caudate, putamen, nucleus accumbens, part of the amygdala, and a couple other areas). I think these parts of the brain house many copies of basically the same supervised learning algorithm (which I call a short-term predictor), each sending its output down to the hypothalamus & brainstem, and each in turn receiving an appropriate supervisory dopamine signal coming back up from the brainstem.

For example, one of the supervised learning circuits might have an output line to the hypothalamus & brainstem whose signals mean: “We need to salivate now!” The brainstem then has a dopamine supervisory signal going back up to that circuit, which it uses to correct those suggestions with the benefit of hindsight. Thus, if I suddenly find myself with a mouth chock-full of salt, then in hindsight, I should have been salivating in advance. The brainstem knows that I should have been salivating—the brainstem has a direct input from the taste buds—and thus the brainstem can put that information into the dopamine supervisory signal.

Here I’ll zoom out a bit so that we can see both kinds of dopamine signals, the reinforcement learning dopamine, and the supervised learning dopamine:

See my post Big picture of motivation, decision-making, and RL for details and terminology. I think all the arrows shown as coming up from the bottom box involve dopamine.

The right side is the array of dopamine-supervised learning algorithms—the downward arrows are the learning algorithm outputs, and the upward arrows are the supervisory signals.

Where is the reinforcement learning dopamine (i.e. reward prediction error) in this diagram? That would be the arrows related to “valence”. (See my Valence series for much more on how I think about valence.) Interestingly, you might notice that “valence” shows up on both sides—i.e., the reinforcement learning system involves a dopamine-supervised learning component. What’s going on there? Well, in actor-critic reinforcement learning, the critic is fundamentally in the category of supervised learning. After all, it’s outputting a one-dimensional signal, and it’s getting updated in hindsight. However, the valence signal on the left is related to the “actor” part of actor-critic reinforcement learning. This part is fundamentally different from supervised learning; rather, it involves explore-exploit in a high-dimensional space of possibilities.

So anyway, that’s my hypothesis about dopamine-supervised learning in mammals. I think there are good reasons to believe it (see here), but the reasons are all a bit indirect and suggestive. I don't have a super solid case.

So, I was delighted when (in 2021) a friend (Adam Marblestone) sent me crystal-clear evidence of dopamine-supervised learning!

The only catch was … it was in drosophila! So not exactly what I was looking for. But still cool!

By the way, right now is a golden era in drosophila research. All 135,000 drosophila neurons have recently been mapped; see the FlyWire papers and website, plus “Drosophila Connectome” on wikipedia for more history. For my part, I’m very far from a drosophila expert—pretty much everything I know about drosophila comes from “The connectome of the adult Drosophila mushroom body provides insights into function”, Li et al. 2020, and also Larry Abbott’s 2018 talk which is nice and pedagogical. Also, I have some drosophila living in my compost bin at home. So yeah, I’m not an expert. But I’ll do my best, and please let me know if you see any errors.

Obvious question: If I’m correct that there’s dopamine-supervised learning in mammals … and if there’s also dopamine-supervised learning in drosophila … could they be homologous? It’s possible! See Tomer et al. 2010 and Strausfeld & Hirth 2013 for two slightly different proposed homologies, which are at least vaguely compatible with the functional analogies I’m suggesting in this post—i.e., between the fruit fly “mushroom body” (see below), and part of the mammal forebrain. But also, a lot about nervous system architecture has changed in the last 600 million years since our last common ancestor with fruit flies (the so-called Urbilaterian), so I’m not too concerned about the details. At the very least, it’s an interesting point of comparison, and I think worth the time to learn about, even if you (like me) are ultimately only interested in humans. (Or AIs.)

Algorithmic background—cerebellum-style supervised learning

Ironically, after getting you all excited about that possible homology, I will now immediately switch from supervised learning in the mammalian cortex to supervised learning in the mammalian cerebellum. The cerebellum does not use dopamine as a supervisory signal—it uses climbing fibers instead. I think any resemblance between drosophila and the cerebellum is almost definitely convergent evolution. But still, it’s a relatively clean and straightforward resemblance, whereas the mammal forebrain has extra complications that I don’t want to get into here.

So, let’s talk about the cerebellum. As far as I can tell, the cerebellum functions partly like a giant memoization system: it watches the activity of other parts of the brain (including parts of the neocortex and amygdala, and maybe other things too), it memorizes patterns in what signals those parts of the brain send under different circumstances, and when it learns such a pattern, it starts sending those same signals itself—just earlier. The cerebellum also does the same time-travel trick for sensory signals coming in from the periphery, in which case the engineering term is not “memoization” but rather “Smith predictors”. But the algorithm is the same either way. (See my post here (§4.6) for much more.)

How does it do that?

Simplified cerebellum-like supervised learning algorithm. Black dots = synapses that get edited by the learning algorithm.

The basic idea is: we have a bunch of “context” lines carrying information about various different aspects of what's happening in the world. The more different context lines, the better, by and large—the algorithm will eventually find the lines bearing useful predictive information, and ignore the rest. Then there’s one or more pairs of (output signal, supervisor signal). The learning algorithm’s goal is for each output to reliably fire a certain amount of time before the corresponding supervisor.

Here's one way that something like this might work. Let’s say I want Output 1 to fire 0.5 seconds before Supervisor 1. Well, for each context signal, I can track (1) how likely it is to fire 0.5 seconds before Supervisor 1; and (2) how likely it is to fire in general. If that ratio is high, then I’ve found a good predicting signal! So I would edit the synapse strength between that context signal and the output line. (This is just a toy example; see Fine Print at the end.)

Here’s a diagram with a bit more detail:

The small-font labels with slashes are of the format “*Drosophila* terminology / Cerebellum terminology”. Copied from Li *et al.* 2020.

The new thing that this diagram adds—besides the anatomical labels—is the “pattern separation” step at the left. Basically, there’s a trick where you take some context lines, randomly combine them in tons of different ways, sprinkle in some nonlinearity, and voila, you have way more context lines than you started with! This enables the system to learn a wider variety of possible patterns in a single neural-network layer. This is the function of the tiny granule cells in the cerebellum, which famously comprise more than half of neurons in the human brain.

Now let’s talk about drosophila!

The drosophila equivalent of the pattern-separating cerebellar granule cells is “Kenyon Cells” (KCs), which take the ≈35-dimensional space of detectable odors and turn it into a ≈1000-dimensional space of odor patterns (ref). The axons of these cells are the “context lines” that our learning algorithms will sculpt into a predictive model. More recently, Li et al. 2020 found Kenyon Cells with other kinds of context information besides odor, including visual information, temperature, and taste. These were (at least partly) segregated—this allows the genome to, say, train a model that makes predictions based on odor information, and also train a model that makes predictions based on visual information, and then give one of those two models a veto over the other. That’s just a made-up example, but I can imagine things like that being useful.

The Kenyon Cell axons, carrying context information, form a big bundle of parallel fibers, called the “mushroom body”. (Eventually this splits into a handful of smaller bundles of parallel fibers.)

That brings us to the Mushroom Body Output Neurons (MBONs)—the supervised learning model output lines, the drosophila equivalent of Purkinje cells in the cerebellum. The synapses between the context lines and the MBON are edited by the learning algorithm. And as far as I can tell, the point of this system is just like the cerebellum: the MBON signals (i.e., the outputs from this trained model) will learn to approximate the supervisory signal, but shifted a bit earlier in time. (It could also be sign-flipped, and there are other complications—see “Fine Print” section below.) The “shifted earlier in time” allows the fly to predict problems and opportunities, instead of merely reacting to them. Otherwise there would hardly be any point in going to all this effort! After all, we already have the supervisory signal!

(Well, OK, supervised learning is good for a couple other things besides time-shifting—see the novelty-detection example below—but I suspect that time-shifting is the main thing here.)

Li et al. 2020 also found that there were “atypical” MBONs that connected to not only Kenyon Cells but also a grab-bag of other signals in the brain. I figure we should think of these as just even more context signals. Again, by and large, the more different context signals, the better the trained model!

They also found some MBON-to-MBON connections. If any of those synapses are plastic, I would just assume it’s the same story: one MBON is just serving as yet another context line for another MBON. (I surmise that the recurrent connections in the cerebellum are there for the same reason.) Are the synapses plastic though? I hear that it’s unknown, but that smart money says “probably not plastic”. So in that case, I guess the MBONs are probably gating each other, or doing some other such logical operation (see discussion of “Business Logic” here).

Finally, the last ingredient is the supervisory signal. Each supervised learning algorithm (MBON) has its own supervisory dopamine signal. (Well, more or less—see below.) In other words, dopamine is playing pretty much the same role in fruit flies as I was thinking for the mammal striatum / amygdala etc. Very cool!

Different MBON types are compartmentalized (left), and those compartments exactly correspond to the different dopamine signals (right). Screenshot from Larry Abbott’s talk here, based on Aso *et al.* 2014

Copied from Aso *et al.* 2014. Each gray box is a “compartment”—I would call it a supervised learning module—consisting of a specific dopamine neuron (“DAN”) signal (bottom) that supervises the learned connections to a correspondingly specific sub-type of mushroom body output neuron (“MBONs”, top). Well, in a couple cases there is more than one dopamine signal per compartment, I think either because of “subcompartments” (see Li *et al.* 2020) or modulatory dopamine signals (see below).

Fine print

I oversimplified a couple things above for readability; here are more details.

1. Dopamine could have the opposite sign, and could be either ground truth or error, I dunno.

I’ve been talking in broad terms about dopamine being a supervisory signal. There are various ways it could work in detail. For example, dopamine could be either an “error signal” (the circuit fired too much / too little), or a “ground truth signal” (the circuit should have fired / should have not fired). That depends on whether the subtraction is upstream or downstream of the dopamine neurons. I don’t know which is right, and anyway it doesn’t matter for the high-level discussion here.

Also, the dopamine could be the opposite sign, saying “you should NOT have fired just now”.

1A. Novelty-detector example.

…In fact, opposite-sign dopamine is probably more likely, because we have at least one example where it definitely works that way—see Hattori et al. 2017:

The fruit-fly “MBON-α’3” neuron sends outputs which we can interpret as: “I need to execute an alerting response!!”
The corresponding PPL1-α’3 dopamine supervisory signal can be interpreted as “Things are fine! There was no need to execute an alerting response just now!”

What would happen, for the sake of argument, if the output neuron were simply wired directly to the dopamine neuron? It would turn into a novelty detector, right? Think about it: Whatever odor environment it’s in, the supervisory signal gradually teaches it that this odor environment does not warrant an alerting response. Pretty cool!

(Of course, to get a novelty detector, you also need some maintenance processes that gradually resets the synapses back to some trigger-happy baseline. Otherwise it would eventually learn to never fire at all.)

2. Dopamine is involved in inference, not just learning

In machine learning, you need both a “learning algorithm” (how to change the stored data to enable better outputs in the future) and an “inference algorithm” (how to query the stored data to pick an output right now). For example, in deep learning, the “learning algorithm” would be a gradient descent step, and the “inference algorithm” would be the forward pass.

Anyway, in reinforcement learning, I think dopamine plays a role in both the learning algorithm and the inference algorithm. I have a fun diagram (from here) to illustrate:

That’s for reinforcement learning, but (at least in fruit flies) some of the supervised learning modules likewise have dopamine signals affecting not only the learning algorithm but also the inference algorithm. In particular, there can be dopamine signals that function as a modulator / gate on the circuit. This was clearly documented in Cohn et al. 2015—see Fig. 6, where the ability of KC firing to trigger MBON firing is strongly modulated by the presence or absence of a certain dopamine signal under experimental control. Another example is Krashes et al. 2009, which showed that flipping certain hunger-related dopamine neurons on and off made the fruit fly start and stop taking actions based on learned food-related odor preferences. Both these papers demonstrated an inference-time gating effect, but I would certainly presume that the same signal gates the learning algorithm too. I’m getting all these references from Larry Abbott’s talk, by the way.

3. I oversimplified the cerebellum, sorry

The mammalian cerebellum has a bunch more bells and whistles wrapped around that core algorithm, that are not present in drosophila and are not particularly relevant for this post. For example, there are various preprocessing steps on the context lines (ref), the Purkinje cells are actually a hidden layer rather than the output (ref), there are dynamically-reconfigurable oscillation modules for perfecting the timing, etc. I think all these things exist for the same reason: to make the supervised learning algorithm work better (to learn faster, and ultimately to wind up with more accurate predictions). Also, as with dopamine above, I don't know whether the climbing fiber signals are ground truth vs errors, and they may also be sign-flipped. Just wanted to mention those things for completeness.

(Thanks Jack Lindsey & Adam Marblestone for critical comments on earlier drafts.)

Changelog

December 2024: Since the initial version in 2021, I’ve learned a whole lot, so I went through and fixed miscellaneous errors in the diagrams and discussion. I also updated some links to point towards my more recent posts, rather than earlier posts on the same topic but written when I was more confused.