Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.
Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.
Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.
There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!
I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.
I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).
One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.
It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.
I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Doesn't the same argument you make for behaviorist RL failing apply to any non-perfect non-behaviorist RL?
"Follow rules unless you can get away with it" seems to also be an apt description of the non-behaviorist setup's true reward rule. Getting away with it also applies to faking the internal signature of sincerity used for the non-behaviorist reward model, as well as evading the notice of external judges.
So we're still stuck hoping that the simpler generalization wins out and stays dominant even after the system thoroughly understands itself and probably knows it could evade whatever that internal signal is. This is essentially the problem of wireheading, which I regard as largely unsolved since reasonable-seeming opinions differ dramatically.
Using non-behaviorist RL still seems like an improvement on purely behavioral RL. But there's a lot left to understand, as I think you'd agree.
This thought hadn't occurred to me even after twice all the way through the Self-Dialogue longer version of this argument, so your work at refining the argument might've been critical in jogging it loose in my brain.
This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.
You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.
This might be the most valuable article on alignment yet written, IMO. I don't have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.
This is so valuable because of the effort you've put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore "lock in" its goals (and around the same point, the intelligence to escape human control if it decides to).
Of course this work builds on a lot of other important work in the past. It might be the most valuable so far because it's now possible (with sufficient effort) to make gears-level models of the crucial first AGI systems that are close enough to allow correct detailed conclusions about what goals they'd wind up locking in.
If this gears-level model winds up being wrong in important ways, I think the work is still well worthwhile; it's creating and sharing a model of AGI, and practicing working through that model to determine what goals it would settle on.
I actually think the question of which of those goals can't be answered given the premise. I think we need more detail about the architecture and training to have much of a guess about what goals would wind up dominating (although strictly following developers intent or closely capturing "the spec" do seem quite unlikely in the scenario you've presented).
So I think this model doesn't yet contain enough gears to allow predicting its behavior (in terms of what goals win out and get locked in or reflectively stable).
Nonetheless, I think this is the work that is most lacking in the field right now: getting down to specifics about the type of systems most likely to become our first takeover-capable AGIs.
My work is attempting to do the same thing. Seven sources of goals in LLM agents lays out the same problem you present here, while System 2 Alignment works toward answering it.
I'll leave more object-level discussion in a separate comment.
I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior. That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That's what I've done in System 2 Alignment, although that's also a prediction about what developers might try for alignment by default.
Whether the approach you outline here would work is an empirical question, but it sounds likely enough that teams might actually put some effort into it. Preprocessing data to identify authors and similar categories wouldn't be that hard.
This helps with the problem Nate Soares characterized as making cognition aimable at all - having AI pursue one coherent goal, (separately from worrying about whether you can direct that "goal slot" toward something that actually works). I think that's the alignment issue you're addressing (along with slop potentially leading to bad AI-assisted alignment). I briefly describe the LLM agent alignment part of that issue in Seven sources of goals in LLM agents.
I hope I'm reading you right about why you think reducing AI slop would help with alignment.
> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?
I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.
If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely.
But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to give the agent explicit goals, too.
Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them.
I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work.
WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.
I'm not sure I'm more optimistic than you, but I am quite uncertain about how well the likely (low but not zero effort/thought) methods of aligning network-based AGI might go. I think others should be more uncertain as well. Some people being certain of doom while others with real expertise thinking it's probably going to be fine should be a signal that we do not have this worked through yet.
That's why I like this post and similar attempts to resolve optimist/pessimist disagreements so much.
I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.
There's a lot here. I won't try to respond to all of it right now.
I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.
Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it will generalize relatively well.
When alignment is based on a relatively simple reward model based on simple sensory representations, it won't generalize very well. That's the case with humans. The reward model is running on sensory representations (it has to so they can be specified in the limited information space of DNA, as you and others have discussed).
Alignment generalizes farther than capabilities in well-educated, carefully considered modern humans because our goals are formulated in terms of concepts. (There are still ways this could go awry, but I think most modern humans would generalize their goals well and lead us into a spectacular future if they were in charge of it).
This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible. If we could use natural language (or another route to conceptual representations) to specify an AI's goals, it seems like that would produce better generalization than just trying to train stuff in with RL to produce behavior we like in the training environment.
One method of "conceptual alignment" is the variant of your Plan for mediocre alignment of brain-like [model-based RL] AGI in which you more or less say to a trained AI "hey think about human flourishing" and then set the critic system's weights to maximum. Another is alignment-by-prompting for LLM-based agents; I discuss that in Internal independent review for language model agent alignment. I'm less optimistic now than when I wrote that, given the progress made in training vs. scripting for better metacognition - but I'm not giving up on it just yet. Tan Zhi Xuan makes the point in this interview that we're really training LLMs mostly to have a good world model and to follow instructions, similar to Andrej Karpathy's point that RLHF is just barely RL. It's similar with RLAIF and the reward models training R1 for usability, after the pure RL on verifiable answers. So we're still training models to have good world models and follow instructions. Played wisely, it seems like that could produce aligned LLM agents (should that route reach "real AGI").
That's a new formulation of an old thought, prompted by your framing of pitting the arguments for capabilities generalizing farther than alignment (for evolution making humans) and alignment generalizing farther than capabilities (for modern humans given access to new technologies/capabilities).
The alternative is trying to get an RL system to "gel" into a concept-based alignment we like. This happens with a lot of humans, but that's a pretty specific set of innate drives (simple reward models) and environment. If we monitored and nudged the system closely, that might work too.
Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.
This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This approach seems likely to work - but I want to hear more pushback on it before I'd trust it in practice.
I think this is not only an interesting dodge around this class of alignment concerns, but it's the most likely thing to actually be implemented. When someone is actually getting close to launching a system they hope is or will become smarter than they are, they'll think a little harder about making its central goal "solve cancer" or anything else broad and consequentialist. The natural choice is to just extend what LLMs are mostly aligned for now: following instructions, including consequentialist instructions.
This logic is all laid out in more detail in Instruction-following AGI is easier and more likely than value aligned AGI, but I didn't specifically address scheming there.
Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.
I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal.
I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.
To summarize:
Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.
I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-informed people who say things like "there's no empirical evidence for scheming in AIs" in all seriousness. So I'd like to understand that perspective better.
While I don't fully understand the perspective that needs to be convinced that scheming is likely, I do have some guesses. I think on the whole it stems from understanding current AI systems well, and reasoning from there. Current systems do not really scheme; they lack the capacity. Those who reason by analogy with humans or with fictional or hypothetical generally superintelligent AI see scheming as extremely likely from a misaligned AGI, because they're assuming it will have all the necessary cognitive capacities.
There are more nuanced views, but I think those are the two starting points that generate this dramatic difference in opinions.
Some more specific common cruxes of disagreement on scheming likelihood:
I see the answers to all of these questions as being obviously, inevitably yes by default; all of these are useful, so we will keep building toward AGI with all of these capacities if nothing stops us. Having extremely useful transformative limited AGI (like super-foundation models) would not stop us from building "real AGI" with the above properties.
I've tried to convey why those properties seem so inevitable (and actually rather easy to add from here) in real AGI, Steering subsystems: capabilities, agency, and alignment, and Sapience, understanding, and "AGI", among snippets in other places. I'm afraid none of them is as clear or compelling as I'd like from the perspective of someone who starts reasoning from current AI and asks why or how would we include those dangerous properties in our future AGIs?
That's why I'm glad you guys are taking a crack at it in a more careful and expansive way, and from the perspective of how little we'd need to add to current systems to make them solve important problems, and how that gives rise to scheming. I'll be referencing this post on this point.
Edit note: Most of this was written after an accidental premature submit ctrl-return action.
It seems more straightforward to say that this scopes the training, preventing it from spreading. Including the prompt that accurately describes the training set is making the training more specific to those instructions. That training thereby applies less to the whole space.
Maybe that's what you mean by your first description, and are dismissing it, but I don't see why. It also seems consistent with the second "reward hacking persona" explanation; that persona is trained to apply in general if you don't have the specific instructions to scope when you want it.
It seems pretty clear that this wouldn't help if the data is clean; it would just confuse the model by prompting it to do one thing and teaching it to do a semanticall completely different thing, NOT reward hack.
Your use of "contrary to user instructions/intent" seems wrong if I'm understanding, and I mention it because the difference seems nontrivial and pretty critical to recognize for broader alignment work. The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that. So it behaves in accord with instructions but contrary to intent. Right? I think that's a difference that makes a difference when we try to reason through why models do things we don't like.