I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
The black border around your macbook screen would be represented in some tiny subset of the cortex before you pay attention to it, and in a much larger subset of the cortex after you pay attention to it. In the before state (when it’s affecting a tiny subset of the cortex), I still want to declare it part of the “thought”, in the sense relevant to this post, i.e. (1) those bits of the cortex are still potentially providing context signals for the amygdala, striatum, etc., and (2) those bits are still interconnected with and compatible with what’s happening elsewhere in the cortex. If that tiny subset of the cortex doesn’t directly connect to the hippocampus (which it probably doesn’t), then it won’t directly impact your episodic memory afterwards, although it still has an indirect impact via needing to be compatible with the other parts of the cortex that connects to (i.e., if the border had been different than usual, you would have noticed something wrong).
If we think in terms of attractor dynamics (as in Hopfield nets, Boltzmann machines, etc.), then I guess your proposal in this comment corresponds to the definitions: “thought” = “stable attractor state”, and “proto-thought” = “weak disjointed activity that’s bubbling up and might (or might not) eventually develop into a new stable attractor state.
Whereas the purpose of this series, I’m just using the simpler “thought” = “whatever the cortex is doing”. And “whatever the cortex is doing” might be (at some moment) 95% stable attractor + 5% weak disjointed activity, or whatever.
Is there a reason why these "proto-thoughts" don't have the problem cited above, that force "thoughts" to be sequential?
Weak disjointed activity can be hyper-local to some tiny part of the cortex, and then it might or might not impact other areas and gradually (i.e. over the course of 0.1 seconds or whatever) spread into a new stable attractor for a large fraction of the cortex, by outcompeting the stable attractor which was there before.
(I’m exaggerating a bit for clarity; the ability of some local pool of neurons to explore multiple possibilities simultaneously is more than zero, but I really don’t think it gets very far at all before there has to be a “winner”.)
…fish…
No, I was trying to describe sequential thoughts. First the fish has Thought A (well-established, stable attractor, global workspace) “I’m going left to my cave”, then for maybe a quarter of a second it has Thought B (well-established, stable attractor, global workspace) “I’m going right to the reef”, then it switches back to Thought A. I was not attempting to explain why those thoughts appeared rather than other possible thoughts, rather I was emphasizing the fact that these are two different thoughts, and that Thought B got discarded because it seemed bad.
I just reworded that section, hopefully that will help future readers, thanks.
FYI, I just revised the post, mainly by adding a new §5.2.1. Hopefully that will help you and/or future readers understand what I’m getting at more easily. Thanks for the feedback (and of course I’m open to further suggestions).
Oh yeah, an AGI with consequentialist preferences would definitely want to grab control of that button. (Other things equal.) I’ll edit to mention that explicitly. Thanks.
I think I mentioned in the section that I didn’t (and still don’t) have any actual good AI-alignment-helping plan for the §9.7 thing. So arguably I could have omitted that section entirely. But I was figuring that someone else might think of something, I guess. :)
Yes! I strongly agree with “much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.”
I think the relation between innate traits and Big Five is a bit complicated. In particular, I think there’s strong evidence that it’s a nonlinear relationship (see my heritability post §4.3.3 & §4.4.2). Like, maybe there’s a 20-dimensional space of “innate profiles”, which then maps to visible behaviors like extraversion in a twisty way that groups quite different “innate profiles” together. (E.g., different people are extraverted for rather different underlying reasons.) All the things on your list seem like plausible parts of that story. It would be fun for me to spend a month or two trying to really sort out the details quantitatively, but alas I can’t justify spending the time. :)
Good thought-provoking question!
My (slightly vague) answer is that, somewhere in the cortex, you’ll find some signals that systematically distinguish viable-plan-thoughts from fantasizing-thoughts. (No comment on the exact nature of these signals, but we clearly have introspective access to this information, so it has to be in there somewhere.)
And there’s a learning algorithm that continuously updates the “valence guess” thought assessor, and very early in life, this learning algorithm will pick up on the fact that these fantasy-vs-plan indicator signals are importantly useful for predicting good outcomes.
Possible objection: By that logic, wouldn’t it learn that only viable-plan-thoughts are worthwhile, and fantasizing-thoughts are a waste of time? And yet, we continue to feel motivated to fantasize, all the way into adulthood! What’s the deal? My response: No, it would not learn that fantasizing-thoughts are a complete waste of time, because fantasizing-thoughts DO in fact (sometimes) lead directly to viable-plan-thoughts. So the algorithm would not learn that fantasizing is always a complete waste of time, although it might learn from experience that particular types of fantasizing are. Instead it would learn in general that fantasizing about good things has nonzero goodness, but viable plans towards those same things are better.
Another part of the story is that, when we’re fantasizing, it’s often the case that the fantasy itself provides immediate ground-truth reward signals. Recall that we can trigger reward signals merely by thinking.
UPDATE 2026-12-19: I rewrote §7.5 a bit, thanks.
In defer-to-predictor mode, there’s an error for any change of the short-term predictor output. (“If context changes in the absence of “overrides”, it will result in changing of the output, and the new output will be treated as ground truth for what the old output should have been.”)
[Because the after-the-change output is briefly ground truth for the before-the-change output. I.e., in defer-to-predictor mode, at time t, the output is STP(context(t)), and this output gets judged / updated according to how well it matches a ground truth of STP(context(t+0.3 seconds)).]
So given infinite repetitions, it will keep changing the parameters until it’s always predicting the next override, no matter how far away. (In this toy model.)
The problem is…
I don’t think this part of the conversation is going anywhere useful. I don’t personally claim to have any plan for AGI alignment right now. If I ever do, and if “miraculous [partial] cancellation” plays some role in that plan, I guess we can talk then. :)
I also don't thing that the drug analogy is especially strong evidence…
I guess you’re saying that humans are “metacognitive agents” not “simple RL algorithms”, and therefore the drug thing provides little evidence about future AI. But that step assumes that the future AI will be a “simple RL algorithm”, right? It would provide some evidence if the future AI were similarly a “metacognitive agent”, right? Isn’t a “metacognitive agent” a kind of RL algorithm? (That’s not rhetorical, I don’t know.)
I really don’t want to get into gory details here. I strongly agree that things can go wrong. We would need to be discussing questions like: What exactly is the RL algorithm, and what’s the nature of the exploring and exploiting that it does? What’s the inductive bias? What’s the environment? All of these are great questions! I often bring them up myself. (E.g. §7.2 here.)
I’m really trying to make a weak point here, which is that we should at least listen to arguments in this genre rather than dismissing them out of hand. After all, many humans, given the option, would not want to enter a state of perpetual bliss while their friends and family get tortured. Likewise, as I mentioned, I have never done cocaine, and don’t plan to, and would go out of my way to avoid it, even though it’s a very pleasurable experience. I think I can explain these two facts (and others like them) in terms of RL algorithms (albeit probably a different type of RL algorithm than you normally have in mind). But even if we couldn’t explain it, whatever, we can still observe that it’s a thing that really happens. Right?
And I’m not mainly thinking of complete plans but rather one ingredient in a plan. For example, I’m much more open-minded to a story that includes “the specific failure mode of wireheading doesn’t happen thanks to miraculous cancellation of inner and outer misalignment” than to a story that sounds like “the alignment problem is solved completely thanks to miraculous cancellation of inner and outer misalignment”.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don't know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don't account for some errors, rather than despite of it.
I partly agree, but think you take this point too far. I would say:
I do in fact think there’s at least one case where there’s a reasonable prima facie argument for a miraculous cancellation of this type, and it’s one that TurnTrout has often brought up. Namely, the case of wireheading (and similar).
Suppose there’s a sequence of actions A, which is astronomically unlikely to occur by chance, and that leads to a maximally high reward, in a way that humans don’t like. E.g., A might involve the AI hacking into its own RAM space. It might be the case that normal explore-exploit RL techniques will never randomly come upon A. And it might further be the case that, even if the RL agent winds up foresighted and self-aware and aware of A, it’s motivated to avoid taking action sequence A, for the same reason that I’m not motivated to try addictive drugs (i.e., instrumental convergence goal-guarding). So it never does A, not even once. And thus TD learning (or whatever) never makes the RL agent “want” A. This would be outer misalignment (because A is high reward even though humans don’t like it) that gets cancelled out by inner misalignment (because the agent avoids A despite it being high-reward).
That’s not a bulletproof scenario; there are lots of ways it can go wrong. But I think it’s an existence proof that “miraculous cancellation” proposals should at least be seriously considered rather than dismissed out of hand.
You might (or might not) have missed that we can simultaneously be in defer-to-predictor mode for valence, override mode for goosebumps, defer-to-predictor mode for physiological arousal, etc. It’s not all-or-nothing. (I just edited the text you quoted to make that clearer.)
To within the limitations of the model I’m putting forward here (which sweeps a bit of complexity under the rug), basically yes.