Abstract: We can think of the brain crudely as (1) a neocortex which runs an amazingly capable quasi-general-purpose learning-and-planning algorithm, and (2) subcortical structures (midbrain, etc.), one of whose functions is to calculate rewards that get sent to up the neocortex to direct it. But the relationship is actually more complicated than that. "Reward" is not the only informational signal sent up to the neocortex; meanwhile information is also flowing back down in the opposite direction. What's going on? How does all this work? Where do emotions fit in? Well, I'm still confused on many points, but I think I'm making progress. In this post I will describe my current picture of this system.
Background & motivation
I'm interested in helping ensure a good post-AGI future. But how do we think concretely about AGI, when AGI doesn't exist and we don't know how to build it? Three paths:
- We can think generally about the nature of intelligence and agency—a research program famously associated with MIRI, Marcus Hutter, etc.;
- We can think about today's AI systems—a research program famously associated with OpenAI, DeepMind, CHAI, etc.;
- We can start from the one "general intelligence" we know about, i.e. the human brain, and try to go from there to lessons about how AGI might be built, what it might look like, and how it might be safely and beneficially used and controlled.
I like this 3rd research program; it seems to be almost completely neglected, and I think there's a ton of low-hanging fruit there. Also, this program will be especially important if we build AGI in part by reverse-engineering (or reinventing) high-level neocortical algorithms, which (as discussed below) I think is very plausible, maybe even likely—for better or worse.
Now, the brain is divided into the neocortex and the subcortex.
Start with the neocortex The neocortex does essentially all the cool exciting intelligent things that humans do, like building an intelligent world-model involving composition and hierarchies and counterfactuals and analogies and meta-cognition etc., and using that thing to cure diseases and build rocket ships and create culture etc. Thus, both neuroscientists and AI researchers focus a lot of attention onto the neocortex, and on understanding and reverse-engineering its algorithms. Textbooks divide the neocortex into lots of functional regions like "motor cortex" and "visual cortex" and "frontal lobe" etc., but microscopically it's all a pretty uniform 6-layer structure, and I currently believe that all parts of the neocortex are performing more-or-less the same algorithm, but with different input and output connections. These connections are seeded by an innate gross wiring diagram and then edited by the algorithm itself. See Human Instincts, Symbol Grounding, and the Blank-Slate Neocortex for discussion and (heavy!) caveats on that claim. And what is this algorithm? I outline some of (what I think are) the high-level specifications at Predictive coding = RL + SL + Bayes + MPC. In terms of how the algorithm actually works, I think that researchers are making fast progress towards figuring this out, and that a complete answer is already starting to crystallize into view on the horizon. For a crash course on what's known today on how the neocortex does its thing, maybe a good starting point would be to read On Intelligence and then every paper ever written by Dileep George (and citations therein).
The subcortex, by contrast, is not a single configuration of neurons tiled over a huge volume, but rather it is a collection of quite diverse structures like the amygdala, cerebellum, tectum, and so on. Unlike the neocortex, this stuff does not perform some miraculous computation light-years beyond today's technology; as far as I can tell, it accomplishes the same sorts of things as AlphaStar does. And the most important thing to understand (for AGI safety) is this:
The subcortex provides the training signals that guide the neocortex to do biologically-useful things.
Now, if people build AGI that uses algorithms similar to the neocortex, we will need to provide it with training signals. What exactly are these training signals? What inner alignment issues might they present? Suppose we wanted to make an AGI that was pro-social for the same underlying reason as humans are (sometimes) pro-social (i.e., thanks to the same computation); is that possible, how would we do it, and would it work reliably? These are questions we should answer well before we finish reverse-engineering the neocortex. I mean, really these questions should have been answered before we even started reverse-engineering the neocortex!! I don't have answers to those questions, but I'm trying to lay groundwork in that direction. Better late than never…
(Update 1 year later: These days I say "hypothalamus & brainstem" instead of subcortex, and I'm inclined to lump almost the entire rest of the brain—the whole telencephalon plus cerebellum—in with the neocortex as the subsystem implementing a from-scratch learning algorithm. See here)
Things to keep in mind
Before we get into the weeds, here are some additional mental pictures we'll need going forward:
Simple example: Fear of spiders
My go-to example for the relation between subcortex and neocortex is fear of spiders. Besides the visual cortex, humans have a little-known second vision system in the midbrain (superior colliculus). When you see a black scuttling thing in your field of view, the midbrain vision system detects that and sends out a reaction that makes us look in that direction and increase our heart rate and flinch away from it. Meanwhile, the neocortex is simultaneously seeing the spider with its vision system, and it's seeing the hormones and bodily reaction going on, and it connects the dots to learn that "spiders are scary". In the future, if the neocortex merely imagines a spider, it might cause your heart to race and body to flinch. On the other hand, after exposure therapy, we might be able to remain calm when imagining or even seeing a spider. How does all this work?
(Note again the different capabilities of the midbrain and neocortex: The midbrain has circuitry to recognize black scuttling things—kinda like today's CNNs can—whereas the neocortex is able to construct and use a rich semantic category like "spiders".)
We'll be returning to this example over and over in the post, trying to work through how it might be implemented and what the consequences are.
The neocortex is a black box from the perspective of the subcortex
The neocortex's algorithm, as I understand it, sorta learns patterns, and patterns in the patterns, etc., and each pattern is represented as an essentially randomly-generated set of neurons in the neocortex. So, if X is a concept in your neocortical world-model, there is no straightforward way for an innate instinct to refer directly to X—say, by wiring axons from the neurons representing X to the reward center—because X's neurons are not at predetermined locations. X is inside the black box. An instinct can incentivize X, at least to some extent, but it has to be done indirectly.
I made a list of various ways that we can have universal instincts despite the neocortex being a black-box learning algorithm: See Human Instincts, Symbol Grounding, and the Blank-Slate Neocortex for my list.
This blog post is a much deeper dive into how a couple of these mechanisms might be actually implemented.
Finally, here is the current picture in my head:
(Update 1 year later: I no longer would draw it this way—see Big picture of phasic dopamine for what I now think instead. The main difference is: I would not draw a direct line from neocortex to a hormone change (for example); instead the cortex would tell the subcortex (hypothalamus + brainstem) to make that hormone change, and then the subcortex might or might not comply with that recommendation. (I guess the way I drew it here is more like somatic marker hypothesis.))
There’s a lot here. Let's go through it bit by bit.
Emotions, "emotion concepts", and "reactions"
One aspect of this picture is emotions. There's a school of thought, popularized by Paul Ekman and the movie Inside Out, that there are exactly six emotions (anger, disgust, fear, happiness, sadness, surprise), each with its own universal facial expression. (I've seen other lists of emotions too, and sometimes there's also a list of social emotions like embarrassment, jealousy, guilt, shame, pride, etc.) That was my belief too, until I read the book How Emotions Are Made by Lisa Feldman Barrett, which convincingly argues against it. Barrett argues that a word like "anger" lumps together a lot of very different bodily responses involving different facial expressions, hormones, etc. Basically, emotional concepts, like other concepts, are arbitrary categories describing things that we find useful to lump together. Sure, they might be lumped together because they share a common hormone change or a common facial expression, but they might just as likely be lumped together because they share a common situational context, or a common set of associated social norms, or whatever else. And an emotion concept with an English-language name like "anger" is not fundamentally different from an idiosyncratic emotion concept like "How Alice must have felt in that TV episode where...".
(Incidentally, while I think Barrett's book is right about that, I am definitely not blanket-endorsing the whole book—there are a lot of other claims in it that I don't agree with, or perhaps don't understand. I think Barrett would strongly disagree with most of this blog post, though I could be wrong.)
So instead of putting "emotions" in the subcortex, I instead put there a bunch of things I'm calling "reactions" for clarity. I imagine that there are dozens to hundreds of these (...and separating them into a discrete list is probably an oversimplification of a more complicated computational architecture, but I will anyway). There's the reaction that gets triggered when your midbrain vision system sees a spider moving towards you out of the corner of your eye, as discussed above. And there's a different reaction that gets triggered when you stand at the edge of a precipice and peer over the edge. Both of those reactions might be categorized as "fear" in the neocortex, but they're really different reactions, involving (I presume) different changes to heart rate, different bodily motions, different facial expressions, different quantities of (negative) reward, etc. (Reactions where peripheral vision is helpful will summon a wide-eyed facial expression; reactions where visual acuity is helpful will summon a narrow-eyed facial expression; and so on.)
As described above for the spider example, the neocortex can see what the subcortex does to our hormones, body, face, etc., and it can learn to predict that, and build those expectations into its predictive world-model, and create concepts around that.
(I also put "pain concept" in the neocortex, again following Barrett. A giant part of the pain concept is nociception—detecting the incoming nerve signals we might call "pain sensations". But at the end of the day, the neocortex gets to decide whether or not to classify a situation as "pain", based on not only nociception but also things like context and valence.)
The neocortex's non-motor outputs
From the above, our neocortex comes to expect that if we see a scuttling spider out of the corner of our eye, our heart will race and we'll turn towards it and flinch away. What's missing from this picture? The neocortex causing our heart to race by anticipating a spider. It's easy to see why this would be evolutionarily useful: If I know (with my neocortex) that a poisonous spider is approaching, it's appropriate for my heart to start racing even before my midbrain sees the black scuttling blob.
Now we're at the top-left arrow in the diagram above: the neocortex causing (in this case) release of stress hormones. How does the neocortex learn to do that?
There are two parts of this "how" question: (1) what are the actual output knobs that the neocortex can use, and (2) how does the neocortex decide to use them? For (1), I have no idea. For the purpose of this blog post, let us assume that there is a set of outgoing axons from the neocortex that (directly or indirectly) cause hormone release, and also assume that "hormone release" is the right thing to be talking about in terms of controlling valence, arousal, and so on. I have very low confidence in all this, but I don't think it matters much for what I want to say in this post. (Update 1 year later: I understand (1) better now, but it still doesn't matter here.)
I mainly want to discuss question (2): given these output knobs, how does the neocortex decide to use them?
Recall again that in predictive coding, the neocortex finds generative models which are consistent with each other, which have not been repeatedly falsified, and which predict that reward will happen.
My first thought was: No additional ingredients, beyond that normal predictive coding picture, are needed to get the neocortex to imitate the subcortical hormone outputs. Remember, just like my post on predictive coding and motor control, the neocortex will discover and store generative models that entail “self-fulfilling prophecies”, where a single generative model in the neocortex simultaneously codes for a prediction of stress hormone and the neocortical output signals that actually cause the release of this stress hormone. Thus (...I initially thought...), after seeing spiders and stress hormones a few times, the neocortex will predict stress hormones when it sees a spider, which incidentally creates stress hormones.
But I don’t think that’s the right answer, at least not by itself. After all, the neocortex will also learn a generative model where stress hormone is generated exogenously (e.g. by the subcortical spider reaction) and where the neocortex’s own stress hormone generation knob is left untouched. This latter model is issuing perfectly good predictions, so there is no reason that the neocortex would spontaneously throw it out and start using instead the self-fulfilling-prophecy model. (By the same token, in the motor control case, if I think you are going to take my limp arm and lift it up, I have no problem predicting that my arm will move due to that exogenous force; my neocortex doesn’t get confused and start issuing motor commands.)
So here's my second, better story:
Reward criterion (one among many): when the subcortex calls for a reaction (e.g. cortisol release, eyes widening, etc.), it rewards the neocortex with dopamine if it sees that those commands have somehow already been issued.
(Update 2021/06: Oops, that was wrong too. I think I got it on the third try though; see here.)
So if the subcortex computes that a situation calls for cortisol, the neocortex is rewarded if the subcortex sees that cortisol is already flowing. This example seems introspectively reasonable: Seeing a spider out of the corner of your eye is bad, but being surprised to see a spider when you were feeling safe and relaxed is even worse (worse in terms of dopamine, not necessarily worse in terms of valence—remember wanting ≠ liking). Presumably the same principle can apply to eye-widening and other things.
To be clear, this is one reward criterion among many—the subcortex issues positive and negative rewards according to other criteria too (as in the diagram above, I think different reactions inherently issue positive or negative rewards to the neocortex, just like they inherently issue motor commands and hormone commands). But as long as this "reward criterion" above is permanently in place, then thanks to the laws of the neocortex's generative model economy, the neocortex will drop those generative models that passively anticipate the subcortex's reactions, in favor of models that actively anticipate / imitate the subcortical reactions, insofar as that's possible (the neocortex doesn't have output knobs for everything).
Predicting, imagining, remembering, empathizing
The neocortex’s generative models appear in the context of (1) prediction (including predicting the immediate future as it happens), (2) imagination, (3) memory, and (4) empathetic simulation (when we imagine someone else reacting to a spider, predicting a spider, etc.). I think all four of these processes rely on fundamentally the same mechanism in the neocortex, so by default the same generative models will be used for all four. Thus, we get the same hormone outputs in all four of these situations.
Hang on, you say: That doesn’t seem right! If it were the exact same generative models, then when we remember dancing, we would actually issue the motor commands to start dancing! Well, I answer, we do actually sometimes move a little bit when we remember a motion! I think the rule is, loosely speaking, the top-down information flow is much stronger (more confident) when predicting, and much weaker for imagination, memory, and empathy. Thus, the neocortical output signals are weaker too, and this applies to both motor control outputs and hormone outputs. (Incidentally, I think motor control outputs are further subject to thresholding processes, downstream of the neocortex, and therefore a sufficiently weak motor command causes no motion at all.)
As discussed more below, the subcortex relies on the neocortex’s outputs to guess what the neocortex is thinking about, and issue evolutionarily-appropriate guidance in response. Presumably, to do this job well, the subcortex needs to know whether a given neocortical output is part of a prediction, or memory, or imagination, or empathetic simulation. From the above paragraph, I think it can distinguish predictions from the other three by the neocortical output strength. But how does it tell memory, imagination, and empathetic simulation apart from each other? I don’t know! Then that suggests to me an interesting hypothesis: maybe it can’t! What if some of our weirder instincts related to memory or counterfactual imagination are not adaptive at all, but rather crosstalk from social instincts, or vice-versa? For example, I think there’s a reaction in the subcortex that listens for a strong prediction of lower reward, alternating with a weak prediction of higher reward; when it sees this combination, it issues negative reward and negative valence. Think about what this subcortical reaction would do in the three different cases: If the weak prediction it sees is an empathetic simulation, well, that’s the core of jealousy! If the weak prediction it sees is a memory, well, that’s the core of loss aversion! If the weak prediction it sees is a counterfactual imagination, well, that’s the core of, I guess, that annoying feeling of having missed out on something good. Seems to fit together pretty well, right? I’m not super confident, but at least it’s food for thought.
Opening a window into the black-box neocortex
Each subcortical reaction has its own profile of facial, body, and hormone changes. The "reward criterion" above ensures that the neocortex will learn to imitate the characteristic consequences of reaction X whenever it is expecting, imagining, remembering, or empathetically simulating reaction X. This is then a window for the subcortex to get a glimpse into the goings-on inside the black-box neocortex.
In our running example, if the spider reaction creates a certain combination of facial, body, and hormone changes, then the subcortex can watch for this set of changes to happen exogenously (from its perspective), and if it does, the subcortex can infer that the neocortex was maybe thinking about spiders. Perhaps the subcortex might then issue its own spider reaction, fleshing out the neocortex's weak imitation. Or perhaps it could do something entirely different.
I have a hunch that social emotions rely on this. With this mechanism, it seems that the subcortex can build a hierarchy of increasingly complicated social reactions: "if I'm activating reaction A, and I think you're activating reaction B, then that triggers me to feel reaction C", "if I'm activating reaction C, and I think you're activating reaction A, then that triggers me to feel reaction D", and so on. Well, maybe. I'm still hazy on the details here and want to think about it more.
Complex back-and-forth between neocortex and subcortex
The neocortex can alter the hormones and body, which are among the inputs into the subcortical circuits. The subcortical circuits then also alter the hormones and body, which are among the inputs into the neocortex! Around and around it goes! So for example, if you tell yourself to calm down, your neocortex changes your hormones, which in turn increases the activation of the subcortical "I am safe and calm" reaction, which reinforces and augments that change, which in turn makes it easier for the neocortex to continue feeling safe and calm! … Until, of course, that pleasant cycle is broken by other subcortical reactions or other neocortical generative models butting in.
"Overcoming" subcortical reactions
Empirically, we know it's possible to "overcome" fear of spiders, and other subcortical reactions. I'm thinking there are two ways this might work. I think both are happening, but I’m not really sure.
First, there's subcortical learning … well, "learning" isn't the right word here, because it's not trying to match some ground truth. (The only "ground truth" for subcortical reaction specifications is natural selection!) I think it's more akin to the self-modifying code in Linux than to the weight updates in ML. So let's call it subcortical input-dependent dynamic rewiring rules.
(By the way, elsewhere in the subcortex, like the cerebellum, there is also real stereotypical “learning” going on, akin to the weight updates in ML. That does happen, but it’s not what I’m talking about here. In fact, I prefer to lump the cerebellum in with the neocortex as the learning-algorithm part of the brain.)
Maybe one subcortical dynamic rewiring rule says: If the spider-detection reaction triggers, and then within 3 seconds the "I am safe and calm" reaction triggers, then next time the spider reaction should trigger more weakly.
Second, there’s neocortical learning—i.e., the neocortex developing new generative models. Let's say again that we're doing exposure therapy for fear of spiders, and let's say the two relevant subcortical reactions are the spider-detection reaction (which rewards the neocortex for producing anxiety hormones before it triggers) and the "I am safe and calm" reaction (which rewards the neocortex for for producing calming hormones before it triggers). (I'm obviously oversimplifying here.) The neocortex could learn generative models that summon the “I am safe and calm” reaction whenever the spider-detection reaction is just starting to trigger. That generative model could potentially get entrenched and rewarded, as the spider-detection reaction is sorta preempted and thus can’t issue a penalty for the lack of anxiety hormones, whereas the “I am safe and calm” reaction does issue a reward for the presence of calm hormones. Something like that?
I have no doubt that the second of these two processes—neocortical learning—really happens. The first might or might not happen, I don’t know. It does seem like something that plausibly could happen, on both evolutionary and neurological grounds. So I guess my default assumption is that dynamic rewiring rules for subcortical reactions do in fact exist, but again, I’m not sure, I haven't thought about it much.
Things I still don't understand
I lumped together the subcortex into a monolithic unit. I actually understand very little about the functional decomposition beyond that. The tectum and tegmentum seem to be doing a lot of the calculations for what I'm calling "reactions", including the colliculi, which seem to house the subcortical sensory processing. What computations does the amygdala do, for example? It has 10 million neurons, they have to be calculating something!!! I really don't know. (Update 1 year later: On the plus side, I feel like I understand the amygdala much better now; on the minus side, I was wrong to lump it in with "subcortex" rather than "neocortex". See discussion here.)
As discussed above, I don't understand what the non-motor output signals from the neocortex are (update: see here), or whether things like valence and arousal correspond to hormones or something else. (Update: attempt to understand valence here.)
I'm more generally uncertain about everything I wrote here, even where I used a confident tone. Honestly, I haven't found much in the systems neuroscience literature that's addressing the questions I'm interested in, although I imagine it's there somewhere and I'm reinventing lots of wheels (or re-making classic mistakes). As always, please let me know any thoughts, ideas, things you find confusing, etc. Thanks in advance!
A few people on this forum are thinking hard about the brain, and I've learned a lot from their writings—especially Kaj’s multi-agent sequence—but my impression is that they're mostly working on the project of "Let's understand the brain so we can answer normative questions of what we want AGI to do and how value learning might work", whereas here I'm talking about "Let's understand the brain as a model of a possible AGI algorithm, and think about whether such an AGI algorithm can be used safely and beneficially". Y'all can correct me if I'm wrong :) ↩︎
I will sloppily use the term "neocortex" as shorthand for "neocortex plus other structures that are intimately connected to the neocortex and are best thought of as part of the same algorithm"—this especially includes the hippocampus and thalamus. ↩︎
For what it's worth, Elon Musk mentioned in a recent interview about Neuralink that he is thinking about the brain this way as well: "We've got like a monkey brain with a computer on top of it, that's the human brain, and a lot of our impulses and everything are driven by the monkey brain, and the computer, the cortex, is constantly trying to make the monkey brain happy. It's not the cortex that's steering the monkey brain, it's the monkey brain steering the cortex." (14:45). Normally people would say "lizard brain" rather than "monkey brain" here, although even that terminology is unfair to lizards, who do in fact have something homologous to a neocortex. ↩︎
Unfortunately I don't have good evidence that this spider story is actually true. Does the midbrain really have specialized circuitry to detect spiders? There was a study that showed pictures of spiders to a blindsighted person (i.e., a person who had an intact midbrain visual processing system but no visual cortex). It didn't work; nothing happened. But I think they did the experiment wrong—I think it has to be a video of a moving spider, not a stationary picture of a spider, to trigger the subcortical circuitry. (Source: introspection. Also, I think I read that the subcortical vision system has pretty low spatial resolution, so watching for a characteristic motion would seem a sensible design.) Anyway, it has to work this way, nothing else makes sense to me. I'm comfortable using this example prominently because if it turns out that this example is wrong, then I'm so very confused that this whole article is probably garbage anyway. For the record, I am basically describing Mark Johnson's "two-process” model, which is I think well established in the case of attending-to-faces in humans and filial imprinting in chicks (more here), even if it's speculative when applied to fear-of-spiders. ↩︎
I am pretty confident that neocortical patterns are effectively random at a microscopic, neuron-by-neuron level, and that's what matters when we talk about why it's impossible for evolution to directly create hardwired instincts that refer to a semantic concept in the neocortex. However, to be clear, at the level of gross anatomy, you can more-or-less predict in advance where different concepts will wind up getting stored in the neocortex, based on the large-scale patterns of information flow and the inputs it gets in a typical human life environment. To take an obvious example, low-level visual patterns are likely to be stored in the parts of the neocortex that receive low-visual visual information from the retina! ↩︎
When I say "I didn't understand" something Barrett wrote, I mean more specifically that I can't see how to turn her words into a gears-level model of a computation that the brain might be doing. This category of "things I didn't understand" includes, in particular, almost everything she wrote about "body budgets", which was a major theme of the book that came up on almost every page... ↩︎
If you want to call the subcortical things "emotions" instead of "reactions", that's fine with me, as long as you distinguish them from "emotion concepts" in the neocortex. Barrett is really adamant that the word "emotion" must refer to the neocortical emotion concepts, not the subcortical reactions (I'm not even sure if she thinks the subcortical reactions exist), but for my part, I think reasonable people could differ, and it's ultimately a terminological question with no right answers anyway. ↩︎
A thought: it seems to me like the algorithm you're describing here is highly non-robust to relative scale, since if the neocortex became a lot stronger it could probably just find some way to deceive/trick/circumvent the subcortex to get more reward and/or avoid future updates. I think I'd be pretty worried about that failure case if anything like this algorithm were ever to be actually implemented in an AI.
Thanks! Yes, I am also definitely worried about that.
I 100% agree that the default result, in the absence of careful effort, would be value lock-in at some point in time, when the neocortex part grows clever enough to undermine the subcortex part, and then you better hope that the locked-in values are what you want!
On the optimistic side:
There's no law that says the subcortex part has to be super dumb and simple; we can have less-powerful AIs steering more powerful AIs, helped by intrusive interpretability tools, running faster and in multiple instances, etc. (as has been discussed in other contexts of course);
We can try to instill a motivation system from the start that doesn't want to undermine the subcortex part—in particular, corrigible motivation. This is basically reliant on the "corrigibility is a broad basin of attraction" argument being correct, I think.
On the pessimistic side, I'm not at all confident that either of those things would work. For (2) in particular, I remain concerned about ontological crises (or other types of goal instability upon learning and reflection) undermining corrigibility after an indeterminate amount of time. (This remains my go-to example of a possibly-unsolvable safety problem, or at least I have no idea how to solve it.)
So yeah, maybe we would be doomed in this scenario (or at least doomed to roll the dice). Or maybe we just need to keep working on it :-)
I think this is interesting in terms of thinking about counterfactuals in decision theory, preference theory, etc.. To me it suggests that when we talk about counterfactuals we're putting our counterfactual worlds in a stance that mixes up what they are and what we want them to be. What they are, as in the thing going on in our brains that causes us to think in terms of counterfactual worlds, is these predictions about the world (so world models or ontology), and when we apply counterfactual reasoning we're considering different predictions about the world contingent on different inputs, possibly including inputs other than the ones we actually saw but that we are able to simulate. This means that it's not reasonable that counterfactual worlds would be consistent with the history of the world (the standard problem with counterfactuals) because they aren't alternative territories but maps of how we think we would have mapped different territory.
This doesn't exactly save counterfactual reasoning, but it does allow us to make better sense of what it is when we use it and why it works sometimes and why it's a problem other times.
I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.
I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.
For example, the model of "falling" is incompatible with the model of "stationary"—they make contradictory predictions about the same boolean variables—and therefore I can't imagine a "falling stationary rock". On the other hand, I can imagine "a rubber wine glass spinning" because my rubber model is about texture etc., my wine glass model is about shape and function, and my spinning model is about motion. All 3 of those models make non-contradictory predictions (mostly because they're issuing predictions about non-overlapping sets of variables), so the three can snap together into a larger generative model.
So for counterfactuals, I suppose that we start by hypothesizing some core of a model ("a bird the size of an adult blue whale") and then searching out more little generative model pieces that can snap onto that core, growing it out as much as possible in different ways, until you hit the limits where you can't snap on any more details without making it unacceptably self-contradictory. Something like that...
Again, I think I agree with what you wrote. :-)
You may find this useful https://arxiv.org/abs/1610.08602
As I recall, much of how the brain causes you to do one thing rather than another involves suppression of signals. As in, everything in the brain is doing it's thing all the time, and how you manage to do only one thing and not have a seizure is that it suppresses the signals from various parts of the brain such that only one is active at a time.
That's probably a bit of a loose model and doesn't exactly explain how it maps to particular structures, but might be interesting to look at how this sort of theory of output suppression meshes with the weaker/stronger model you're looking at here built out of PP.