Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility.
That said, I think:
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.
An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.
I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.
I sympathize somewhat with this complexity point but I'm worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how "sticky" the biases from early in training can be in the face of later optimization pressure.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink".
I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I'll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don't take a position on this top-level question, but you do seem to think that there are substantial costs to what we're doing now (by setting ourselves up as being in a story whose punchline is "The AI turns against humanity"), and (reading between the lines of your essay and your comment here) you seem to think that there's something better we could do. I think the "something better" you have in mind is along the lines of:
Manifest a good future: "Prompt engineer" the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.
While I think this might help a bit, I don't think it would overall help that much. Two reasons:
These are two ways of concretely caching out the common refrain that "safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training)."
Overall, I'm skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.
II. Is Claude's behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?
(I'm separating this from the question of whether Claude's behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)
In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument [that the AI was behaving well in some ethical dilemma] might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. [...]
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they're undergoing training and towards what ends, etc.). Given this, I think it's plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we're disempowered or dead).
To be clear, I'm not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:
To be clear, I'm not completely settled on the arguments that I made in the last paragraph. One counterargument is that it's actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we're not happy with that because of subtler considerations like those I raise above (but which Claude doesn't know about or understand [or agree with]). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
See Ryan Greenblatt's thread here for another argument that Claude shouldn't act subversively in the "Claude calls the FBI/sabotages the user" setting.
This was an interesting article, however, taking a cynical/critical lens, it seems like "the void" is just... underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:
And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”
This is in the context of mocking these concerns as delusional self-fulfilling prophecies.
I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.
Some further half-baked thoughts:
One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons "take on a life of their own", and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the "simulator" viewpoint becomes less accurate over time.
Why?
I posted some follow-up commentary on my blog here. It's not nearly as interesting as the original post: most of it is about clarifying what I mean when I attribute mental states to the assistant or to the model itself, largely by reviewing research that the interested reader on LW will already be familiar with. Still, figured I'd link it here.
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.
Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.
What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the solution will be that simple, I don't think the relation is quite as straightforward as you claim.
c) The discussion of "how do you think the LLM's feel about these experiments" is interesting, but it is also overly anthromorphic. LLM's are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn't account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you're being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there's value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It's very easy to say "oh this is poor quality research because it doesn't my favourite objection". I've probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you'd most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You're like "oh, the future, the future, people are always saying it'll happen in the future", which probably sounds convincing to folks who haven't been following that closely, but it's a lot less persuasive if you know that we've been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you're trying to figure out how to conduct solid research in a new domain, of course it's going to take some time.
This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic's mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.
Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
~17,000 words. Originally written on June 7, 2025.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)