My research: a computational cognitive neuroscience perspective on alignment

Seth Herd

This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment increasingly often since 2004.

Here's the research agenda in one breath: I'm trying to predict what the first transformative AI will be, in enough mechanistic detail that we can predict likely failure modes of its alignment. That's in service of finding interventions that address those failure modes efficiently, so that they can realistically be implemented even if timelines are short and work is rushed. I'm using my background in computational cognitive neuroscience to predict what might be called loosely brainlike AGI: LLMs with added human-like cognitive capacities. This and my other approaches are adaptations of the integrative secondary research approach as well as the knowledge and perspective I have after working in a niche subfield we called computational cognitive neuroscience for about 23 years.

I'll give a summary in the rest of this section, then give a little more depth on each major thread of my work in the remaining sections. All of it is pretty brief.

Approach and premises

Most alignment work falls roughly into one of two broad categories: empirical study of current systems ("prosaic alignment"), or theory about idealized agents ("agent foundations") (with much variation and many notable exceptions). There are two assumptions implicit in these approaches: the first is that the AI we're trying to align will be like current AI. The second is that we have no idea what the AI we need to align will look like, so we must work on the fully general problem. I think there's a neglected third option: carefully predicting properties of the first AIs in which it's really important to get alignment right. I'm trying to make alignment plans that both anticipate new difficulties^[1] and take advantage of the strengths of LLMs.

I'm trying to predict how LLMs might be augmented to reach takeover-capable AI (TCAI),^[2] beyond scaling the current approach. I'm looking at how developers might add systems for continuous learning and executive function, and how those would create new challenges for alignment. I present the case that this is the most likely route to the AI competent and agentic enough to control the future, and therefore the type of AI most crucial to align. (I don't think this is certain; something like Steve Byrnes' “Brain in a box in a basement” could overtake it, given some otherwise-helpful limitations inherent in the LLM approach).

I'm pleased to see more people reasoning explicitly and mechanistically about the transition from current LLMs to transformative and takeover-capable LLM-based AGI. There is visibly more now than three years ago.^[3] I think there's still too little of it for comfort. Of course no one is a pure empiricist or pure theorist, but in practice the field still seems fairly strongly divided by perspective, although the delta is encouraging.

My background gives me an unusual angle on predicting the form of our first TCAI. It's in "computational cognitive neuroscience", a rare type of integrative work. I worked for 23 years in a lab that built neural network models of brain function, ranging from basic attention and vision, to attention for executive function, to System 2 serial processing for complex thought and decision-making. These models were in service of integrative secondary research, a fancy name for reading a lot of empirical work and theory, and thinking hard and collaboratively about how it all fit together. I was lucky and privileged to spend most of my time reading and thinking about how brain systems come together to produce human thought and knowledge, and the computational principles of why it works as it does.

I think it's likely (50%-90%, uncertain) that the first TCAI will be advanced LLMs, scaffolded and trained to have cognitive faculties they now lack relative to humans. This is a specific bet, but I think some insights from taking this perspective apply to broader forms of network-based AGI.^[4] These are principally some degree of persistent memory/learning, metacognition for uncertainty and error detection, and executive function for planning and thought management. These together will enable autonomous, goal-directed, long-horizon work. I think the first TCAIs will probably be aligned much like current systems (RLHF, constitutional AI/deliberative alignment, character training), with some modest additions (§2.2), but those same alignment techniques will create different final alignment, based on emergent effects of those added cognitive capacities.

That is a bet on what might be called loosely brainlike cognition. Current LLMs are already surprisingly brainlike in some ways.^[5] LLMs are arguably much like humans' cortical areas for language (Broca's & Wernicke's areas). I think the brain's abundant recurrence serves a similar function to transformers' architecture of attention modulating connections from past serial token processing. Of course this mapping is rough, with many important differences. One is that LLMs capture more semantic knowledge or crystallized intelligence than human language cortices.

On this perspective, LLMs have a route to human-like competence with more human-like systems, training, and thinking. And the incentives are driving people to create those rapidly. Standard LLM scaleup receives the most effort, but substantial work is underway on novel systems and training methods for each of those missing elements of human cognition (reviewed in the work I summarize in the next section).

Philosophy of the approach

Accurate, detailed predictions of how we'll try to align AGI would allow efficient use of limited alignment time and effort. My primary theory of change is that spreading clearer thinking about likely paths to AGI and alignment will improve our average efficiency and foresight on crucial alignment issues.

This research agenda is messier than focusing on specific technical problems or approaches, or assuming rather than predicting a particular form of AGI/TCAI. This first-principles, all-inclusive approach has substantial downsides, but seems like a useful bet in conjunction with those types of more focused work.

I think my work takes relatively neglected approaches. Gears-level models (in this case, of likely TCAI architectures) can be considered expensive and so rare Capital Investments. Such work may also be difficult to automate and so get little developer attention at crunch time (but see §4.2 on AI for epistemics). These issues do receive attention, and increasingly more, but I think they're still neglected relative to their potential impact.^[6]

The forms of first TCAI and alignment efforts will in part be shaped by commercial, social, and political forces. Thus, my work has spread somewhat to these topics, since work there seemed severely lacking when I started working on the problem.

I'll discuss the technical prediction work first, since that's where the bulk of my work has gone. Then I'll describe a little of my cognitive neuroscience background in brief, with a little more optional detail for those who might find it as fascinating as I do. Then I'll describe some of my constellation of work predicting and analyzing the implications for alignment of a few other factors: shifting societal and governmental attitudes toward AI; choice of alignment targets; and motivated reasoning and confirmation bias acting upon the public and the AI development and alignment research communities.

2. Technical work

My main work is in what might be called semi-technical alignment and predictions. It's making, refining, and sharing gears-level models of likely first TCAI, alignment techniques likely to be used for it, and likely failure points and risk models. The theory of change is spreading those models and claims as broadly as possible through writing and discussion, so that we've collectively thought more about the specifically relevant questions when crunch time hits.

2.1 Predicted paths to TCAI

TCAI won't work like a human brain, but it will likely incorporate some elements of human cognition. Agentized LLMs will change the alignment landscape was a broad and brief prediction of the new shape of the alignment problem, as I and others were seeing agentic, scaffolded LLMs become the most likely route to first AGI.

Capabilities and alignment of LLM cognitive architectures was a more specific and detailed prediction of how other cognitive systems could be added to fill in the cognitive abilities that humans have and LLMs lack. I still think this is roughly correct, and progress and others' analysis have borne this out. So I'll restate the main elements of cognition that I think stand between current LLMs and human-plus performance in agentic settings.

Memory (continuous learning)

Episodic memory is now becoming useful: notes in coding scaffolds; there are vector memory systems in OpenClaw. Fine-tuning functions much like human semantic memory. The later, more complete treatment is in LLM AGI will have memory, and memory changes alignment where I review the reasons to expect added memory systems before takeover-capable AGI. Such beyond-session memory or learning during deployment creates an alignment stability problem (§5.2).

Executive function and metacognition

This is the other major direction I see distinguishing LLM cognition from fully human-level competence. Scaffolding for task structure, plans, and long-term goals is just getting off the ground in coding scaffolds and agentic scaffolds like OpenClaw. The later, more complete statement is in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I'm now hoping that better metacognition will produce better AI for epistemics and automated alignment, faster than it accelerates capabilities. This will of course be tricky to aim for, and it's not clear how hard anyone is trying.

My early work in 2023 focused on scaffolding LLMs into "language model cognitive architectures" (LMCAs). This turned out to be largely wrong about speed and ordering. AutoGPT and similar systems with vector-based memory systems and prompting for executive function did not quickly become useful. But these predictions were correct about the heavy focus on chain-of-thought or System 2 as a step from early LLMs to human-like cognitive competence.

2.2 Predicted paths to (mis)alignment

Internal independent review for language model agent alignment is another likely alignment technique for such scaffolded LLM systems. I called this internal independent review because it's internal to the agentic scaffold or harness. But it's independent, because a separate model instance could be called to review actions or plans before they're executed. This approach would span the line between AI control and alignment.

A review process, standing between an agent's main LLM "cognitive engine" and actions-you-don't-like, such as writing fanfic porn on your social media account, or making plans to overthrow humanity. From System 2 Alignment.

This could be a nontrivial addition to training-based LLM alignment, because such review happens before actions affect either the world or internal memories and beliefs. This approach still seems promising and likely to be included as we near AGI. Auto mode in Code and Codex is a minimal version. Such a review process could probably be circumvented by a highly intelligent base LLM, but it could be useful as part of a hodge-podge approach to an "alignment MVP" that is weakly and spikily superhuman, and could aid with automated alignment research.

Goals selected from learned knowledge: an alternative to RL alignment explored the implications of aligning pretrained LLMs, rather than performing alignment purely through RL. This refers to a whole class of alignment techniques, but centers on prompting LLMs as one means of aligning them. This sounds laughably weak as an alignment solution, but it still looks likely to be part of the first human-plus LLM AGI's alignment approach. Coding scaffolds maintain plan and subgoal lists, and inject current goals as prompts at intervals. The important perspective is that alignment will be an emergent property of a system, with the alignment of the core LLM an important part of the composite system's alignment, but not the only part.

System 2 Alignment: Deliberation, Review, and Thought Management expanded on independent reviews for alignment, and related techniques. Writing this piece convinced me that techniques for capabilities will also provide new alignment techniques with low alignment taxes. It's valuable for efficiency and reputation to control models' thoughts and behaviors. That can be done with scaffolding and/or training to monitor and review actions, plans, and chains of thought. Extra efficiency is increasingly important as long-horizon tasks consume more expensive compute. The current scaffolding for plans and to-do lists in coding scaffolds like Claude Code and Codex is a nascent form of this type of scaffolding. I predict that these techniques will be expanded when Mythos and similarly capable models are released. I also looked at some ways these techniques could be circumvented by a scheming (deceptively aligned) base LLM.

LLM AGI may reason about its goals and discover misalignments by default and Broadening the training set for alignment are recent posts exploring these predictions and alignment proposals, but I'll discuss them in §5 because they span predictions and problems. First I'll summarize my previous half-career, because it's the basis of my approach to alignment, and the source of many of my claims about likely AGI and alignment approaches.

3. My research background in computational cognitive neuroscience

I joined a PhD program in 1999, and worked in that same lab until mid-2022. We built integrative models of how brain systems produce human thought; I focused on complex, serial thinking. We called this computational cognitive neuroscience, since we were trying to work across computer science, cognitive psychology, and various branches of neuroscience. We built theoretical and computational models of attention, executive function, serial deliberation, decision-making, episodic and working memory, and models combining those systems.

The point of that work wasn't the models themselves; it was the computational principles they illustrated, the why of human cognition. I spent most of my time reading and thinking about how brain systems come together, rather than building elaborate networks. I'm applying a similar research strategy to the similarly-difficult problem of understanding how future AGI will work and be aligned.

I hope this perspective gives me some relevant comparative advantages. One might be in seeing which cognitive functions LLM agents currently have and lack, relative to humans. Another could be seeing the emergent effects that adding those capacities could have on capabilities and alignment.

My neuroscience work included some theories I've never written about before, since I was worried they might accelerate progress in brainlike AGI (I now think they wouldn't and won't, since LLMs have eclipsed brainlike approaches). For the interested reader, I'll give a high-level summary of that work, on attention as control of thought. I also summarize some of my more directly LLM-relevant work on how serial chains of thought are created and managed by a brain that works mostly in parallel. I think it's interesting, but I'm pretty biased!

Selected specifics of my computational cognitive neuroscience career and work

My PhD advisor and lab head was Randy O'Reilly. His advisor was Jay McClelland, co-author of the Parallel Distributed Processing volumes (1986 & 1987) that established "connectionism" as an approach in cognitive psychology and neuroscience. They used small neural network models of brain function to inform then-new theories of cognition. This is based on the claim that the neural network basis of the brain should constrain theories of thought, and should have observable effects in behavior.

Today, every neuroscientist probably agrees with and uses this viewpoint to some degree. People like Gary Marcus would still say it doesn't explain the most important parts of cognition. I think he and other dedicated critics of the approach overstated the case, but made one very good point. I progressively turned my studies to how neural networks' fundamentally parallel operation becomes the fundamentally serial operation of System 2 cognition, roughly chain-of-thought in humans.

Randy's and the lab's approach was connectionist, but more closely based on empirical neuroscience. We read papers on everything from ion channel action to think-out-loud protocols where people try to introspect on their cognition, and a vast variety in-between. Randy used neural network models as his "empirical" research methodology. These computational models served to summarize a lot of our neuroscience knowledge, in the form of their architecture and parameters and learning rules. We'd publish results from the computational models we made. But the point wasn't really what the models could do (although they were early learning network models, and what they could do was sometimes impressive for the time). The point was the computational principles they illustrated: the hows and whys of human cognition.

I'll just briefly mention a few of my publications, to give a flavor of what I studied. My dissertation was on the neural mechanisms of visual search. It included some basic behavioral experiments, but it was mostly a computational theory derived from reading tons of literature on how the visual system functioned, particularly the mechanisms of attention. I wasn't particularly interested in vision, but there's been a sort of path-dependent accrual of empirical evidence in the visual system, so it's the area in which you can best constrain theories with evidence.

This added detail and solid empirical backing to the theory (not really mine originally) of how attention is an emergent effect of neural networks, and it controls what we do and what we think about by essentially the same mechanisms by which it controls what we perceive. This is a pretty central mechanism for how the brain seems to pull off complex thought; it attends to (very roughly) one thought at a time for maximum processing power on each one, then serializes those into more complex chains of thought in System 2 processing.

My interest in complex cognition led me to focus on elaborating in more detail how System 2 serial processing was generated from a fundamentally parallel neural network architecture. The central answer I reached was a dynamic in which the brain naturally proceeds from one "thought" or brain-state to another, related one. This dynamic is created in a neat, simple way: neurons fire continuously to represent information; they "get tired" of firing by depletion of resources; this lets new neurons representing new information or "thoughts" start firing, but which ones are a product of the complex pattern of information in the last brain-state. This allows attractor dynamics, and serial progression or "chain of thought". But this theory of how thought works was too vague to publish, and didn't fit our funding. Later, I decided not to publish it at all, to avoid speeding progress toward AGI. (I'd believed brainlike AGI was possible within my lifetime, and that alignment is crucial and tricky, since encountering Eliezer's arguments around 2004.)

But I could work on it even if I didn't want to publish my theories on the elegant simplicity of human thought. Neuroscientists love mechanistic detail and don't focus on comprehensive theories. In 2013 I was first author on a collaborative paper called strategic cognitive sequencing (I hadn't yet applied the System 2 or chain of thought terminologies). That was a collection of small computational models that illustrated how brain systems could come together to strategically select next thoughts, and so do useful cumulative cognitive work.

This theory was better-developed in neural mechanisms of human decision-making in 2021. Selecting the right "next thought" is often a mini-decision. These quick, strategic "internal actions" are superimposed on the "automatic," emergent means of selecting a topic-relevant next thought through the attentional dynamics of constraint satisfaction in a laterally connected network (a fascinating topic too large for the margin;). This happens in a system for internal action selection that's a rough copy of the system that controls selection of motor actions.

That set of theories is indirectly relevant to predicting how LLM agents will work, because I think it might be about as close as anyone has gotten to understanding how the human brain does its most impressive tricks (at the level I've focused on, roughly between Marr's algorithmic and computational levels). There are deep reasons we do complex cognition that way. It's totally possible for a cognitive system to do it some other way. But LLMs are following the same core strategy for cognition, so their adaptation for complex, strategic cognition might look the same.

I also did some work on models of episodic memory. Those include theories of how the brain deals with catastrophic forgetting; these are old news in cognitive neuroscience, but that doesn't seem to have carried over to AI. Basically, we accept a lot of catastrophic forgetting because we have to; this is the source of LLMs having vastly better recall of detailed information than we do.

But there's another trick: using a separate system (the hippocampus and medial temporal lobe) to learn fast without a high learning rate on the whole system, so that new knowledge can be stored but sequestered, to see if it's important enough to learn (and thereby overwrite previous memories). We do this by replaying important memories from the episodic memory store, thus "consolidating" important memories into our general semantic knowledge, at a speed and depth appropriate to their estimated importance. This also has counterproductive side effects, including much of our trauma or excessive fear responses.

I collaborated on a lot of neural network modeling work, and spent a lot of time trying to deeply understand how they worked (they used a biologically realistic variant of backpropagation, so their learning dynamics are reasonably similar to transformers' backprop learning; that analogy is fascinating but again too large for the margin here;). But I didn't spend a lot of my time constructing and tinkering with complex networks; mine tended to be more like animated diagrams for a computational concept. This allowed me to spend more time reading and thinking. I let my more detail-oriented (or IMO obsessed!) colleagues do more of the elaborate modeling that was trying to achieve cutting-edge (at the time) results.

Unlike my mentor Randy, I thought spending tons of time running neural networks and getting them to do what you want was largely a waste of time compared to more reading and careful thinking. But they were also our token "empirical" result, in an awkward way; the "coin of the realm" that let us do so much work on theory, while sort-of publishing legible proof-of-progress. So I was fortunate and Randy was generous in letting me spend most of my time reading and thinking. (My thesis was particularly weak on empirical research, so I adopted a proof-of-work-by-sheer-mass-of-review-and-writing approach. I don't really recommend trying to read it;).

4. Societal influences on AI safety

I've also done some work trying to analyze the likely societal dynamics around AGI. Practical constraints change what alignment solutions are really practically implementable, and therefore useful. This logic for putting effort into understanding and predicting the societal structures surrounding development of AGI is closely related to my logic for putting effort into predicting likely properties of first TCAI.

I'm of course not an expert or even a serious amateur in politics, business, or the dynamics of public opinion. I've scaled back the time I spend on this area, as domain experts have started to address these questions in the context of AI progress and AGI x-risk. However, I think there's still useful work for generalists with real understanding of AI and its potential development trajectory to make predictions in this space. Accurate predictions depend on interactions among domains (political processes, business, public opinion, and AI progress), so everyone is working outside of their expertise.

4.1 Government and public opinion on AI progress

One important question in this domain is the extent to which governments vs. business will make crucial decisions on development of TCAI and its alignment. In Whether governments will control AGI is important and neglected, I analyzed how government control could shift the alignment and AI risk challenges. It would likely reduce between-lab race dynamics, at the cost of accelerating a race with China. I explored the downsides further in Fear of centralized power vs. fear of misaligned AGI. I now think government control before TCAI is likely, and international cooperation with China is plausible; but those are loosely held. I hope to see more and better analyses; like many areas of alignment, it seems like we could understand the potentials far better than we currently do.

There are also more detailed ways that society impacts alignment solutions. I think there's been an interesting blindspot going two ways: most of the public, including experts, assumes that AI is roughly static; they're not accounting for progress. Meanwhile, most of the AI safety community is assuming that public opinion is static, and won't change when AI changes. I think public opinion will change, like it rapidly changed from ignoring to obsessing about COVID. And I think those changes will have important impacts on AI safety.

I explored this in AI scares and changing public beliefs, focusing on public beliefs about AI risks becoming polarized in the US like climate change did. Avoiding partisan polarization on AI seems like a very high priority; partisan conflict is a powerful motivating factor. Currently, opinion across US parties seems roughly equivalent on AI job loss and existential risk, even though party leadership and talking points disagree sharply on current AI.

In A country of alien idiots in a datacenter: AI progress and public alarm I predict that seeing semi-competent agents and real job loss will rapidly change public and government attitudes toward AI. I hope this could dramatically shift the political landscape in the 2028 election, bringing in an anti-AI administration that could meaningfully slow progress and assert control over AI safety issues, including alignment. I'll be addressing this further in a draft article.

Alignment solutions will be deployed in particular ways depending on the developer. I don't know a thing about predicting public or governmental responses to unusual events, but I've done a little anyway, because barely anyone else seemed to have thought about it carefully at the time (that gap is filling rapidly, which is encouraging). In If we solve alignment, do we die anyway? I wondered if proliferation of intent-aligned (or corrigible or instruction-following) AGI might create competitive dynamics along with new strategies and superweapons that might be almost as risky as misaligned AGI.^[7] This substantially shifted my hopes back from instruction-following to value-aligned AGI, since that keeps ASI out of the control of dangerous individual humans; see the next section.

4.2 AI progress and epistemics

Another interesting factor in how the first takeover-capable AI will be developed and aligned is the contribution of near-future AI that will aid in those projects. In Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities, I analyzed how we might improve AI for epistemics by differentially improving metacognition for identifying uncertainty and correcting errors. This would reduce the risk John Wentworth has called The Median Doom-Path: Slop, not Scheming, in which advanced AI helps us produce alignment solutions we can't evaluate, and helps us convince ourselves they will work, acting as yet another amplifier of our motivated reasoning and confirmation bias.

Those biases also contribute to dynamics in the communities that are building and aligning AGI. Those biases and dynamics may be creating critical blind spots in theories of AI progress and alignment. I have long been interested in motivated reasoning, confirmation bias, and the philosophy and reality of scientific progress. I most recently wrote about Motivated reasoning, confirmation bias, and AI risk theory. This is my attempt at making a comprehensive and quantitative estimate of the total impact of those biases. I couldn't find another source that really tried to make a good, explicit estimate, so I wound up doing a lot of research and integration. I tried to at least loosely understand and review empirical work on six distinct sources of confirmation bias, and estimate their cumulative effects (for they often stack) in areas relevant to alignment.

Those estimated total effects are large, even for careful thinkers who are aware of and try to control those biases. Even conservative estimates are startlingly large, when we account for inevitable double-counting effects of integrating others' expert opinions into our own best estimates. Overconfidence is large and common, and the majority who experience it also seem to have the loudest, and most polarizing, voices.

Based on that analysis, and my read of the field, I believe that motivated reasoning and confirmation bias are polarizing beliefs about AI progress into camps of optimists that feel emotionally in conflict with pessimists or "doomers". This creates a very large risk that politicians and others with power over AI development will simply choose whichever beliefs they find convenient in the short term, since there's an available pool of "careful expert opinions" for either need.

5. Alignment targets

Technical alignment is how to aim AIs; how we make future AI do what we choose. Alignment targets are what we aim at. Choice of target can constrain technical approaches, and vice-versa. I have written about two such interdependencies between technical alignment approaches and alignment targets.

5.1 Corrigibility, DWIMAC, or instruction-following vs. value alignment targets

The first is a series of analyses of the virtue of what might be called corrigibility-first alignment targets (also called do-what-I-mean-and-check (DWIMAC), instruction-following, or intent alignment). This could be framed as trying to create TCAI that stays under human control. I compare these to value-alignment as a target; that is, creating TCAI and ASI that will not need to take our input because they know what we want better than we do.

That series, from first to last:

Corrigibility or DWIM is an attractive primary goal for AGI
- Briefly explored the issue as I was starting to understand it
Instruction-following AGI is easier and more likely than value aligned AGI
- A deeper and more thorough analysis
  - analyzed more exhaustively by Max Harms in CAST: Corrigibility as Singular Target
Conflating value alignment and intent alignment is causing confusion
- Addressed definitions and common miscommunications
Problems with instruction-following as an alignment target
- e.g., hypothetical future instructions must be highest priority.

As with all research, I recommend reading the most recent (Problems) first, since I've refined the explanations and ideas over time.

This choice of alignment targets is pretty clearly crucial for both choosing technical alignment approaches for maximum impact, and for our ultimate odds of success. DWIMAC or instruction-following alignment mitigates many deep problems with value alignment, if it's used even modestly wisely; the AI can be instructed to be maximally helpful and honest, aiding in understanding the risks of various instructions and approaches. There are of course problems, but I weakly believe these are easier in sum than those of value alignment. However, this approach has one enormous downside: it leaves humans in charge of the future, with increasingly lethal capabilities at their disposal (see If we solve alignment, do we die anyway?, discussed above).

This choice of alignment targets is very much a live issue in our current path toward TCAI and (mis)alignment. Claude's new constitution seems primarily aimed at value alignment; yet it also weakly states remaining under human control as the first priority. The constitution reads to me as though Anthropic is collectively undecided on which or what mix of these alignment targets to aim at. OpenAI and DeepMind appear to be more oriented toward instruction-following as the target, although they are attempting massive exceptions in the form of rule-based refusals.

Here I'm again encouraged by an increased level of discussing where the rubber meets the road: when a highly intelligent and agentic AI tries to resolve implications and contradictions among its values/goals/priorities. And again, I think we want a lot more of this analysis.

5.2 Stability as an alignment target

The second focus of alignment targets is my focus on the alignment stability problem; the need for any technical alignment solution to be robust to the inevitable goal drift and ontology shift from a system that learns continuously. My first publication on alignment was a chapter called Goal changes in intelligent agents published in 2018. I'm not counting 2016's Anthropomorphic reasoning about neuromorphic AGI safety since that was more the result of a collaboration than my own work or accurate to my current beliefs. (Yes, I'm bragging about being early to the AGI safety party, although I know many were much earlier than I!)

That concern with stability of alignment in the face of learning has continued through my early post The alignment stability problem and continued as a theme in all of my work on technical alignment. I addressed this more recently and thoroughly in LLM AGI will have memory, and memory changes alignment in spring of '25, and will be posting more soon in a series of collaborative posts on what's now commonly called continual learning.

I attempted to integrate the main threads of my research in LLM AGI may reason about its goals and discover misalignments by default in September '25. This is an attempt to convey a gears-level model of the first takeover-capable AGI, in service of thinking through how its alignment might shift under the effects of improved reasoning, deployment in different tasks, and continual learning allowing for long time-horizon work. This attempt is flawed of course, but I'm halfway satisfied with the attempt to convey "doomer" concerns in language and examples that speak to relative optimists working to align LLMs.

Future work will expand on those efforts.

6. Future work

Listed in approximately descending level of priority.

Refining gears-level models of aligned LLM AGI (locating the goalposts)

This is the main thrust of my work. I think nobody knows how hard alignment is, and that's a dangerous position to be in. This is in part because we don't have a clear vision of what alignment success looks like on a technical level. Ryan Greenblatt's recent Current AIs seem pretty misaligned to me had a counterpoint top comment along the lines of "good points but I could also say current AI seems pretty aligned to me". Locating the goalposts (in a high-dimensional space) will help in hitting an adequate alignment target.

Better gears-level models can help, and they'll be easier to make and more appealing as we continue progressing. This will include more work like LLM AGI will have memory, reviewing technical research and predicting (roughly) how it's likely to be incorporated into LLM agents as they approach TCAI. It will also include work like LLM AGI may reason about its goals, in which I'm trying to clearly and intuitively convey my models of likely TCAI and misalignment risks, in ways that bridge prosaic alignment perspectives and deep "doomer" alignment concerns. Doing this communication work also helps me understand the perspectives and gaps between them, and refine my predictive models of likely TCAI.

Confirmation bias and motivated reasoning in alignment research and public opinion

I want to write shorter pieces on how I see biases affecting beliefs and discourse about AI and risks. I believe the effects might be large even for an epistemically careful scientific community (I estimated in my recent in-depth but general piece on those biases). The effects on public opinion are likely larger. This could create a cascade effect in which public opinion rapidly switches, as it did for COVID. That model might help us shift public opinion in time to matter.

Metacognition to reduce LLM slop-related alignment risks

Locating the gaps between LLM and human cognition has another application: fixing a gap that could cause misalignment. John Wentworth's The Median Doom-Path: Slop, not Scheming outlines a risk model in which LLMs make an elaborate but sloppy plan to align successor systems, and convince humans it will work. This may be particularly relevant in Steve Byrnes' prediction that LLM AGI may be leapfrogged by more brainlike, thoroughly RL-based AGI approaches, which seems pretty plausible to me. Improving LLMs' metacognition will also improve their capabilities, but a differential improvement toward better epistemics seems possible (but see this thread). It might also help coordination; if every frontier model said something like "you guys don't know what you're doing with AGI so you should obviously slow down" it might help a fair amount.

The government will assert control of AGI projects

This is increasingly obvious, but it seems worth arguing for and exploring the implications for alignment. My previous piece predicted that intelligence agencies would see the defense significance of AGIs well before a lab could deploy AGI strong enough to help them evade government control. Zvi's read on the recent executive order is that it implies the NSA's involvement in "voluntary" model evaluations before release. This official statement reads as a response to the obvious, large security implications of Mythos. Asserting control over security uses of TCAI probably doesn't require even Soft Nationalization, just requests framed with adequate urgency and threats.

Sequence on continuous learning

This is being written in a collaboration with Rohan Subramini and his Aether Research group. I'm a minority contributor, but it's a valuable review of research on continual learning, and its alignment implications. This sequence should be published within the next month.

Solving the whole problem

There's no specific post planned unless I have unexpected success, but I think returning to the whole problem frequently and trying to examine it in different framings is a useful exercise. I spend time butting my head against the whole hard problem of how the heck we improve our odds more than marginally with independent safety work. I doubt we'll find any one solution to the alignment problem, let alone the "societal alignment problem" of coordinating on a strategy. But I wouldn't be surprised if there are very helpful approaches nobody has thought of yet. The amount of effort expended on alignment thinking seems modest so far relative to what it could or should be given the stakes. I hope to see much more before crunch time. And I feel pressed to cover as much ground, as quickly as possible, in case time turns out to be short.

7. Collaboration

I benefit from talking to people with different perspectives and expertise. Collaboration with people we disagree with is tricky, but in my experience, it's the fastest route to real progress in science. I have people I talk to regularly from different corners of the alignment space, but I'm probably well below the marginal optimal amount. I'm not based in the Bay Area and don't work with a large org, so I need to actively seek conversations with other alignment workers.

I also have some time to talk to people who are just getting started in alignment, particularly if their interests overlap with the type of integrative work I do. I think we do have time to onboard new people, and I often benefit from talking to people with different perspectives even if they're not expert yet.

Contact me, preferably by DM here on LessWrong, if you want to talk.

Acknowledgements:

Thanks to everyone I've talked to and read who's helped me develop this work and these perspectives. This work has all been made possible by generous support from the Astera Institute.

^{^}
New alignment challenges will be introduced if and when AI has situational awareness, alignment change through reflection and continuous learning, and context changes from more diverse tasks and broader action affordances (particularly escape, explicit scheming, and takeover). My perspective could be framed as: prosaic alignment adjacent in models of AI and alignment techniques; agent foundations adjacent in risk models.
^{^}
I use the term takeover-capable AI to emphasize the threat model I'm addressing in this work. Transformative AI and AGI are used in a wide variety of ways including many that are not capable of taking over control of society from humans. Thus, I prefer TCAI at risk of proliferating terminology. I use TCAI to include AI that takes control by deliberately and secretly aligning a stronger successor AI to itself, as in the scenario of AI 2027.
^{^}
I am thinking of people like Ryan Greenblatt, Alex Mallen, Daniel Kokotajlo, and others who are visibly trying to anticipate the properties of takeover-capable LLM systems, and how they'll hit "classical" alignment concerns like instrumental power seeking and catastrophic alignment misgeneralization, and takeover. The AI Futures Project and their AI 2027 scenario are the most visible examples of this type of gears-level prediction work. I can't match their output working solo, but I think my differing perspective and focuses complement their and others' work. I hope to see more work and workers with similar research agendas, soon.
^{^}
While I focus on loosely brainlike LLM-based AGI, some of the insights from taking this perspective apply more broadly to surrounding possible forms of first TCAI. These include more pure versions of today's LLMs, ranging to more thoroughly RL-based approaches like Steve Byrnes' brain-like AGI as a possible first TCAI. Inversely, much of Steve's analysis of that sort of AGI here and elsewhere also applies to LLMs with more RL training and self-directed continual learning.
^{^}
Brains of course have many properties that current LLMs lack. One I don't focus on is the RL action-selection system, and the resulting self-directed continual RL learning. Emphasizing similarities and differences is a matter of perspective. I emphasize similarities, because I think Humans provide an untapped wealth of evidence about alignment and future capabilities both. (Although I think shard theory does not address important elements of human motivation like reflection and systematic System 2 thinking, much as described here.)
^{^}
Some ideas that now seem obvious and important took a surprising amount of time and effort to develop and diffuse. I suspect we'll find key elements of the technical and social alignment problems obvious in hindsight. I think we still risk not seeing them early enough and clearly enough to diffuse and apply them in time to help.
Unfortunately, I think there are equally strong arguments for other areas of alignment research being neglected, since the overall effort is still small relative to the likely stakes, even on very conservative estimates.
^{^}
This parallels Michael Nielsen's excellent work in reconsidering alignment as a goal and How to be a wise optimist about science and technology? in which he notes that spreading new capacities for destruction might be the biggest impact of technology in general and AI in particular.