I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
You might (or might not) have missed that we can simultaneously be in defer-to-predictor mode for valence, override mode for goosebumps, defer-to-predictor mode for physiological arousal, etc. It’s not all-or-nothing. (I just edited the text you quoted to make that clearer.)
In "defer-to-predictor" mode, all of the informational content that directs thought rerolls is coming from the thought assessors in the Learned-from-Scratch part of the brain, even if if that information is neurologically routed through the steering subsystem?
To within the limitations of the model I’m putting forward here (which sweeps a bit of complexity under the rug), basically yes.
The black border around your macbook screen would be represented in some tiny subset of the cortex before you pay attention to it, and in a much larger subset of the cortex after you pay attention to it. In the before state (when it’s affecting a tiny subset of the cortex), I still want to declare it part of the “thought”, in the sense relevant to this post, i.e. (1) those bits of the cortex are still potentially providing context signals for the amygdala, striatum, etc., and (2) those bits are still interconnected with and compatible with what’s happening elsewhere in the cortex. If that tiny subset of the cortex doesn’t directly connect to the hippocampus (which it probably doesn’t), then it won’t directly impact your episodic memory afterwards, although it still has an indirect impact via needing to be compatible with the other parts of the cortex that connects to (i.e., if the border had been different than usual, you would have noticed something wrong).
If we think in terms of attractor dynamics (as in Hopfield nets, Boltzmann machines, etc.), then I guess your proposal in this comment corresponds to the definitions: “thought” = “stable attractor state”, and “proto-thought” = “weak disjointed activity that’s bubbling up and might (or might not) eventually develop into a new stable attractor state.
Whereas the purpose of this series, I’m just using the simpler “thought” = “whatever the cortex is doing”. And “whatever the cortex is doing” might be (at some moment) 95% stable attractor + 5% weak disjointed activity, or whatever.
Is there a reason why these "proto-thoughts" don't have the problem cited above, that force "thoughts" to be sequential?
Weak disjointed activity can be hyper-local to some tiny part of the cortex, and then it might or might not impact other areas and gradually (i.e. over the course of 0.1 seconds or whatever) spread into a new stable attractor for a large fraction of the cortex, by outcompeting the stable attractor which was there before.
(I’m exaggerating a bit for clarity; the ability of some local pool of neurons to explore multiple possibilities simultaneously is more than zero, but I really don’t think it gets very far at all before there has to be a “winner”.)
…fish…
No, I was trying to describe sequential thoughts. First the fish has Thought A (well-established, stable attractor, global workspace) “I’m going left to my cave”, then for maybe a quarter of a second it has Thought B (well-established, stable attractor, global workspace) “I’m going right to the reef”, then it switches back to Thought A. I was not attempting to explain why those thoughts appeared rather than other possible thoughts, rather I was emphasizing the fact that these are two different thoughts, and that Thought B got discarded because it seemed bad.
I just reworded that section, hopefully that will help future readers, thanks.
FYI, I just revised the post, mainly by adding a new §5.2.1. Hopefully that will help you and/or future readers understand what I’m getting at more easily. Thanks for the feedback (and of course I’m open to further suggestions).
Oh yeah, an AGI with consequentialist preferences would definitely want to grab control of that button. (Other things equal.) I’ll edit to mention that explicitly. Thanks.
I think I mentioned in the section that I didn’t (and still don’t) have any actual good AI-alignment-helping plan for the §9.7 thing. So arguably I could have omitted that section entirely. But I was figuring that someone else might think of something, I guess. :)
Yes! I strongly agree with “much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.”
I think the relation between innate traits and Big Five is a bit complicated. In particular, I think there’s strong evidence that it’s a nonlinear relationship (see my heritability post §4.3.3 & §4.4.2). Like, maybe there’s a 20-dimensional space of “innate profiles”, which then maps to visible behaviors like extraversion in a twisty way that groups quite different “innate profiles” together. (E.g., different people are extraverted for rather different underlying reasons.) All the things on your list seem like plausible parts of that story. It would be fun for me to spend a month or two trying to really sort out the details quantitatively, but alas I can’t justify spending the time. :)
Good thought-provoking question!
My (slightly vague) answer is that, somewhere in the cortex, you’ll find some signals that systematically distinguish viable-plan-thoughts from fantasizing-thoughts. (No comment on the exact nature of these signals, but we clearly have introspective access to this information, so it has to be in there somewhere.)
And there’s a learning algorithm that continuously updates the “valence guess” thought assessor, and very early in life, this learning algorithm will pick up on the fact that these fantasy-vs-plan indicator signals are importantly useful for predicting good outcomes.
Possible objection: By that logic, wouldn’t it learn that only viable-plan-thoughts are worthwhile, and fantasizing-thoughts are a waste of time? And yet, we continue to feel motivated to fantasize, all the way into adulthood! What’s the deal? My response: No, it would not learn that fantasizing-thoughts are a complete waste of time, because fantasizing-thoughts DO in fact (sometimes) lead directly to viable-plan-thoughts. So the algorithm would not learn that fantasizing is always a complete waste of time, although it might learn from experience that particular types of fantasizing are. Instead it would learn in general that fantasizing about good things has nonzero goodness, but viable plans towards those same things are better.
Another part of the story is that, when we’re fantasizing, it’s often the case that the fantasy itself provides immediate ground-truth reward signals. Recall that we can trigger reward signals merely by thinking.
UPDATE 2026-12-19: I rewrote §7.5 a bit, thanks.
In defer-to-predictor mode, there’s an error for any change of the short-term predictor output. (“If context changes in the absence of “overrides”, it will result in changing of the output, and the new output will be treated as ground truth for what the old output should have been.”)
[Because the after-the-change output is briefly ground truth for the before-the-change output. I.e., in defer-to-predictor mode, at time t, the output is STP(context(t)), and this output gets judged / updated according to how well it matches a ground truth of STP(context(t+0.3 seconds)).]
So given infinite repetitions, it will keep changing the parameters until it’s always predicting the next override, no matter how far away. (In this toy model.)
The problem is…
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don't thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
Like, a discussion might go:
Optimist: If you pick some random thing, there is no reason at all to expect that thing to be a ruthless sociopath. It’s an extraordinarily weird and unlikely property.
Me: Yes I happily concede that point.
O: You do? So why are you worried about ASI x-risk?
Me: Well if you show me some random thing, it’s probably, like, a rock or something. It’s not sociopathic, but only because it’s not intelligent at all.
O: Well, c’mon, you know what I mean. If you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
O: I dunno, like, human? Or AI?
Me: Different humans are different to some extent, and different AI algorithms are different to a much greater extent, and also different from humans. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Like, I dunno, it does seem rather maniacally obsessed with graph traversal right?
O: Oh c’mon, don’t be dense. I didn’t mean “AI” in the sense of the academic discipline, I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen, and we observe that they are not sociopathic.
Me: As it happens, I’m working on the threat model of model-based actor-critic RL agent “brain-like” AGI, not LLMs. LLMs are profoundly different from what I’m working on. Saying that LLMs will have similar properties as RL agent AGI because “both are AI” is like saying that LLMs will have similar properties as the A* search algorithm because “both are AI”. Or it’s like saying that a tree or a parasitic wasp will have similar properties as a human because both are alive. They can still be wildly different in every way that matters.
O: OK but lots of other doomers talk about LLMs causing doom, even if you claim to be agnostic about it. E.g. IABIED.
Me: Well fine, go find those people and argue with them, and leave me out of it, it’s not my wheelhouse. I mostly don’t expect LLMs to become powerful enough to be the kind of really scary thing that could cause human extinction even if they wanted to.
O: Well you’re here so I’ll keep talking to you. I still think you need some positive reason to believe that RL agent AGI will be a ruthless sociopath.
Me: Maybe a good starting point would be my posts LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem, or “The Era of Experience” has an unsolved technical alignment problem.
O: I’m still not seeing what you’re seeing. Can you explain it a different way?
Me: OK, back at the start of the conversation, I mentioned that random object like rocks are not able to accomplish impressive difficult feats. If we’re thinking about AI that can autonomously found and grow companies for years, or autonomously wipe out humans and run the world by itself, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive difficult feats. And the question we should be asking is: how does it do that? It can’t do it by choosing random actions. There has to be some explanation for how it finds actions that accomplish these feats.
And one possible answer is: it does it by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents and model-based planning algorithms. The whole point of those subfields of AI is, they’re algorithms that find actions that maximize an objective. I.e., you get ruthless sociopathic behavior by default. And this isn’t armchair theorizing, it’s dead obvious to anyone who has spent serious amounts of time building or using RL agents and/or model-based planning algorithms. These things are ruthless by default, unless the programmer goes out of their way to make them non-ruthless. (And I claim that it’s not obvious or even known how they would make them non-ruthless, see those links above.) (And of course, evolution did specifically add features to the human brain to make humans non-ruthless, i.e. our evolved social instincts. Human sociopaths do exist, after all, and are quite capable of accomplishing impressive difficult feats.)
So that’s one possible answer, and it’s an answer that brings in ruthlessness by default.
…And then there’s a second, different possible answer: it finds actions that accomplish impressive feats by imitating what humans would do in different contexts. That’s where (I claim) LLMs get the lion’s share of their capabilities from. See my post Foom & Doom §2.3 for details. Of course, in my view, the alignment benefits that LLMs derive of imitating humans are inexorably tied to capabilities costs, namely LLMs struggle to get very far beyond ideas that humans have already written down. And that’s why (as I mentioned above), I’m not expecting LLMs to get all the way to the scary kind of AGI / ASI capabilities that I’m mainly worried about.