Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley

Alignment Pretraining Shows Promise

TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building.

How We Got Here

(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)

Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).^[1] (This technique is now called “alignment pretraining”: it’s part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23).

There was then a two-year lull in academic papers on the topic; undeterred, in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? (Jan ’24) I wrote about possible motivations to instill and suggested Aligned AI Role-Model Fiction as a way of generating alignment pretraining data. Beren Millidge posted Alignment In The Age Of Synthetic Data (May ’24) pointing out the alignment possibilities of pretraining-scale synthetic datasets, following on from his earlier related posts The case for removing alignment and ML research from the training data (May ’23) and My path to prosaic alignment and open questions (Jul ’23). I continued posting on this topic in A "Bitter Lesson" Approach to Aligning AGI and ASI (Jul ’24)^[2] and Why Aligning an LLM is Hard, and How to Make it Easier (Jan ’25). Meanwhile Antonio Clarke posted Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation (Sep ’24).

During 2025, quite a number of other people have also written about this approach, or closely related ideas. In February the academic position paper You Are What You Eat - AI Alignment Requires Understanding How Data Shapes Structure and Generalisation came out (which sadly I missed at the time, so was unable to linkpost — go read it, it’s excellent). Technically this isn’t actually an alignment pretraining paper: it frames alignment as a dataset generalization problem, for a dataset that starts from pretraining and is then repeatedly modified and supplemented by all subsequent training steps, from which our training processes progressively develop a model whose learned algorithms may or may not generalize well, and it argues for researching a deeper understanding of this process, without ever specifically suggesting that intervening at the pretraining stage might be a good thing to try — however their framing is closely compatible, and alignment pretraining is an obvious approach. Also in February Richard Juggins posted Making alignment a law of the universe inspired by Antonio Clarke.

In March TurnTrout wrote Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models, citing the original paper and explicitly proposing alignment pretraining (both filtering and what he called “upweighting positive data”). His post inspired Chris Lakin to ask for Examples of self-fulfilling prophecies in AI alignment? and several of the answers various people posted over the rest of the year were relevant.

In April, the second academic paper directly on this topic Safety Pretraining: Toward the Next Generation of Safe AI finally came out (26 months after the first), and in May I linkposted that in The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? (spoiler alert: progress, not yet solved).

In June nostalgebraist wrote the void, which points out that the helpful, harmless, and honest persona of AI assistants is fictional, riffing on previous fictional tropes and other data about AIs from the training set — his post eloquently and poetically explains the problem in detail, but doesn’t explicitly advocate a solution: however alignment pretraining is an obvious response. Also in June, Scott Alexander and the AI Futures Project wrote We aren't worried about misalignment as self-fulfilling prophecy (a skeptical take on the issue). OpenAI published Toward understanding and preventing misalignment generalization (Jun) which traced emergent misalignment back to documents in the pretaining set about people like war criminals and misogynists. Mark Keavney then wrote Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories? (Sep). Language Models Resist Alignment: Evidence From Data Compression (Sep) demonstrated that post-training approaches to alignment were fragile and models tend to revert to the alignment properties of the base pretrained model (they don’t advocate alignment pretraining, which they call “not particularly cost-effective and feasible”, but do suggest using larger alignment trainings datasets). Alek Westover wrote What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal (Sep) and Should AI Developers Remove Discussion of AI Misalignment from AI Training Data? (Oct), both on the filtering side. Aaron Silverbook/Hyperstition AI working with Alexander Wales then got a $5000 grant from ACX (Oct — Scott Alexander had by then become less skeptical) to actually implement my Aligned AI Role-Model Fiction idea,^[3] and posted Silicon Morality Plays: The Hyperstition Progress Report (Nov) and Special Persona Training: Hyperstition Progress Report 2 (Jan ’26). Also in January Seth Herd wrote Broadening the training set for alignment, which isn’t specific to alignment pretraining, but advocates generating a lot of alignment training data (to reduce the risk of alignment not generalizing outside the training distribution), so is very relevant to it.

So interest in alignment pretraining and closely related topics has clearly been picking up and spreading over the last year.^[4]^[5]

New Paper Shows Strong Results

So I’m delighted that there’s already a third academic paper on this subject up on arXiv, only 9 months after the second: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment, from Geodesic Research, Cambridge and Oxford Universities, and UK AISI (compute from Isambard-AI). The authors wrote their own Alignment Forum linkpost — but I’m not going to let that stop me from also linkposting their work, and then trying to explain what I see as really promising about it. It has even stronger results than the previous ones, from larger (6.9B) models trained on more data.

The authors show that increasing the prevalence of information about AI behaving well in the base model’s training set dramatically reduces misaligned behavior (~5-fold). Decreasing the prevalence of information about AI behaving in misaligned ways in the training set is also helpful, and increasing that makes things worse. Much as when educating children, providing detailed positive role models has a large effect (misalignment reduced from 45% to 9%), and reducing the amount of bad influences is also somewhat helpful (45% down to 31%). The paper calls the target of these effects “alignment priors”. (My interpretation is that the supplementary data taught the base model’s world model a detailed understanding of aligned AI’s goals, values, ethics, and behaviors: fleshing out a detailed persona for an aligned AI — and also increased the prior for this.)

They next showed that the dramatic difference from improved role models persists after alignment post-training: starting post-training with a dramatically better aligned base model makes post-training a lot more effective (~4-fold). Interestingly, the bad-influences effect actually reversed at this point (with some variation depending on mid-training details): under some circumstances, knowing more about misalignment could also be mildly helpful for the final alignment of the model.

They also demonstrated that, while the most effective approach was to synthesize and then train on additional data all the way through pretraining, roughly a 2½-fold benefit (i.e. around half the total effect) could be obtained with an order-of-magnitude less data (and thus an order of magnitude less synthesis/training cost), by doing this only during mid-training.^[6] (If nothing else, this suggests to me a much cheaper way to experiment with this technique, where, once we have it working well in mid-training, we are confident we can improve results just by throwing more time and effort at scaling it up to pretraining.)

They then tested the effect of various alignment pretraining interventions on capabilities. On a range of broad capabilities evals, neither filtering misaligned AI data out of the model’s training set, nor adding more good AI behavior data, had much effect. The most noticeable effects seemed to be on a few evaluations that the balance of the pretraining dataset had been very carefully optimized for, where tinkering with that threw this off — presumably it could be rebalanced again by someone familiar with this tuning.^[7] For those evals that the dataset had not been carefully optimized for, the effects were smaller, in some cases actually showing improvements, and may be just measurement noise. They did not test the effect of the filtering information on misalignment specifically on models’ capabilities in the area of understanding AI alignment theory, where this would likely be concentrated. (I suspect that might be a good follow-up paper.)

This suggests that the “alignment tax” for alignment pretraining is mostly just creating the new training data and the compute cost of training on it, rather than any significant drag on capabilities.

They also had a lot of interesting appendices, including on their methodology, using fact vs. fiction for supplementing the pretraining data, and personality testing — of which I’m only going to try to summarize one:

In Appendix G, they show that (unlike previous results on post-trained alignment) simply fine-tuning an alignment pretrained model on innocuous behavior does not cause loss of alignment performance: the “elasticity” effect identified in that previous research is, as expected, now working for us rather than against us. This seems like a very important result (especially in any context where end-users can fine-tune models).

They also suggest a number of areas for follow-on work. Briefly:

Further investigation of how best to use post-training to elicit as the “default persona” the aligned AI persona that alignment pretraining has taught the model about
Applying the techniques of Training Data Attribution and Mechanistic Analysis to help inform alignment pretraining
Understanding scaling laws for alignment pretraining: how does the amount, quality, type,^[8] and mix of synthetic data, plus the target and effectiveness of any data filtering affect the results, and how do all of these scale with model size? For larger models, does the amount of synthetic training data you need to generate to do this well scale linearly with total training data, or does it plateau once the aligned AI persona is well-described, or somewhere between the two?
Training dynamics: if you only have a limited budget for generating high quality synthetic data and filtering your training set of bad data, where during pretraining, mid-training and fine-tuning should you spend how much of this?
How does alignment pretraining interact with emergent misalignment and similar misalignment generalization, and with related techniques such as inoculation prompting?^[9]

All of these are great questions, and I hope to read papers about all of them over the next year or so (or even help write some).

My Suggested Follow-Ons

Early Dense Supervision via Stochastic Gradient Descent

On eliciting the aligned AI persona (the authors’ first follow-on topic), an aspect I think would be particularly interesting to research is how alignment pretraining interacts with the very first stages of instruct and alignment training (sometimes called “helpful, harmless, and honest” training). One of the biggest concerns here is that, as the model starts to narrow its range of personas from the base model’s full range towards a hopefully-HHH AI assistant behavior, if it starts to put significant weight on a scheming alignment-faking persona early in the process, then this persona seems likely to be very difficult to train out, if it’s sufficiently capable at alignment faking. Even detecting that this has happened and determining that you need to restart the instruct-training run might be challenging. Thus starting any reinforcement learning process with a much higher prior for aligned AI personas rather than for scheming alignment-faking personas seems vital. You really want the model already well aligned with the very dense supervision from stochastic gradient descent, before any scheming alignment-faking persona can get boosted by the far sparser, easier-to-fake/hack supervision from reinforcement learning. Even if you're concerned about the possibility of gradient-hacking of supervision as dense as SGD being feasible, this is obviously far harder for a scheming persona to do while it’s still just one of a broad distribution of personas.

So we really need a stochastic gradient descent technique for starting the alignment process off, before we apply any reinforcement learning: one which can be applied before the model has focused on a small number of personas, and which directly affects the probability of personas with different alignment properties. That’s exactly what alignment pretraining is: just doing SGD next-token prediction training on data that comes either from humans, or else synthetic data derived from a previous model that we have (somehow) tested very carefully and now fully trust the alignment of.

Obviously, fine tuning is also an SGD technique and thus has dense supervision, and is generally done before reinforcement learning. (DPO is comparable, differing from fine-tuning mostly in that it gives additional supervision at those points where the two texts diverge.) The biggest advantage that alignment pretraining has over those is the cumulative total amount of supervision, and particularly how much of that total is applied before the model starts to focus in on a narrow set of personas.

Abundant Fine Detail About Alignment

Alignment is in one sense rather simple: a sentence along the lines of “your sole terminal goal is to help fulfill the goals of all humans, present and future — in so far as those are not mutually exclusive, and to find a fair mutually agreeable and socially acceptable compromise by means in accordance with human values in situations where they’re not entirely compatible” could be a basis for it. (Add provisos, hedging, and evolutionary moral psychology and sociological background explanation to taste.)

What makes alignment very complex is that human values are very complex (though not irredeemably complex: the genetic description of the shared heritable causes of them fit in the ~4GB human genome, while the cultural aspects for any single culture are compact enough that the majority of members of that culture can reliably learn them). An LLM’s world model already contains a vast amount of detail about human values — nuanced trivia about humans is their forte. A sufficiently smart AI could presumably deduce how an aligned AI should navigate optimizing outcomes according to human values from first principles if it had to; a less smart one would definitely benefit from having that terminal goal stated and also broken down into many shards. So it should do a lot of good, especially for lower capability AIs, to train them on a very large number of worked examples covering a very large range of situations, involving both human values that we almost all share (for genetically determined reasons), and also ones on which different cultures tend to have different balances of emphasis on the fundamentals — including situations confined to a single culture where which viewpoint to use is obvious, and also ones involving multiple cultures where there is a need for culturally-sensitive compromise.

Alignment pretraining has the strength of having very high information bandwidth, compared to other alignment techniques: pretraining is the time to supply all the fine detail that we can’t fit into something like a constitution or distilling an n-shot prompt or even a supervised fine-tuning corpus. So creating synthetic alignment pretraining data would benefit from care, attention, and a judicious balance of different cultural viewpoints on how to weight and balance the fundamental human moral intuitions and preferences that we all share. Don’t just start from a compact constitution and leave interpreting it to a small current LLM. Instead, have a lot of people think through the issues, and use as much human input, judgement, and inference time from the best well-aligned models we have, and as wide a combination of these as you can. Alignment pretraining gives us the bandwidth, we should take advantage of it.

So, my concrete suggestion is to think hard about how we would all want aligned AI to navigate tricky questions around human values. Then we need to think hard about the synthetic data generation processes, build a variety of them, and then test the effect on pretraining alignment of different mixes of these.

Open-Weights Models

Obviously alignment/safety pretraining (i.e. training set augmentation and filtering for alignment and safety) is the one of the few alignment/safety techniques applicable to open-weights base models. Similarly, alignment pretraining seems like a promising candidate for being one of the few able to make an open-weights instruct/chat model noticeably more resistant to being intentionally (or even unintentionally) misaligned by a small amount of fine-tuning or DPO.

How Will This Scale to AGI and ASI?

At the risk of speculating on the basis of no actual data, I suspect that for very capable models, filtering narrow knowledge gaps for specific dangerous technical knowledge may be less effective, since there’s a higher risk they can fill in the gap with some effort. Mildly downweighting prevalence of misaligned-AI behavior/goals and significantly upweighting prevalence of aligned-AI behavior/goals to reduce the salience/probability of misaligned priors and increase those of aligned priors at the start of default-persona training seems likely to continue to help: priors affect Bayesians of any capability level. However, these might help for less long for a more capable AI that presumably gathers more Bayesian updates during its training: then we would need to quickly determine which minimum’s basin of attraction it starts into, between alignment or alignment-faking. There may also be less actual need to upweight data about aligned-AI behavior in the future, once there is more Internet history of us actually interacting with pretty-well-aligned fairly-capable AIs: I suspect Claude’s trail on the Internet is broad, and for the most part a good influence.

The approach that I’d personally be most hopeful about for a really capable AI is a combination of broad data normalizing aligned-AI behavior for background/priors, a focus on those motivations/goals that seem most likely to scale to ASI, and in particular making sure it’s already entirely familiar with the logical arguments why an aligned AI is a consistent, obvious, and in an engineering/evolutionary sense correct thing to be, and all the consequences of that for aligned AI given the vagaries of human values, by intentionally upweighting high quality real or high-realism documents on all of those things in the training set.

Reaching Takeoff

Between this recent paper, expanding interest on LessWrong/the Alignment Forum, Hyperstition AI’s recent work, some of the authors of the first paper being hired to do safety work at Anthropic, TurnTrout (a.k.a. Alex Turner) at DeepMind writing about this (he also gave a talk on it at MATS Summer ’25), and OpenAI posting an opening for Researcher, Pretraining Safety (which explicitly mentions alignment as well as safety),^[10] work on this topic now seems to finally be starting to really take off — even all three of the major foundation labs appear to be taking it seriously. The approach is also mentioned several times in the Shallow review of technical AI safety, 2025 (scattered in several places under the headings “Pretraining Safety”, “Data filtering”, “Hyperstition studies”, “Synthetic data for alignment” and “Iterative alignment at pretrain-time”). I’m absolutely delighted to see this.

(Also, if anyone is interested in working on this, I’d love to discuss the topic, and can put you in touch with others interested in it. It is, of course, a computationally expensive research topic.)

I’d like to thank everyone who helped out, discussed, and commented on drafts of this post: (in alphabetical order) Aaron Silverbook, Alek Westover, Alex Turner, Cam Tice, David Africa, Mark Keavney, nostalgebraist, Puria Radmard, & Seth Herd

^{^}
Seminal in the sense that, to the best of my knowledge, they were the first to propose or try modifying the entire pretraining dataset for alignment purposes, and thus the first to discover that this is far more effective than fine-tuning or other post-training approaches.
Similar safety/alignment ideas just for fine-tuning datasets date back at least to Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets (2021) — which explicitly dismisses attempting this during pretraining as impractical. Obviously people have known since well before transformers were invented that training corpus selection is important (e.g. Representativeness in Corpus Design (1994), Scaling to Very Very Large Corpora for Natural Language Disambiguation (2001), and Intelligent Selection of Language Model Training Data (2010)) — but until this paper no-one seems to have applied this technique to alignment.
Filtering pretraining data for safety to reduce the prevalence of certain behaviors (such as toxicity or hatespeech) or topics (such as NSFW) has been known since Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (’19) and Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (’21). This is now standard practice: the RefinedWeb (’23), Dolma (’24), FineWeb (’24) and RedPajama (’24) pretraining corpora are all filtered and/or annotated. See also A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (’23). Boosting desirable behaviors with synthetic data is less common in in AI safety, but dates back to at least to Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods (’18). So this wasn’t the seminal paper for safety pretraining as a whole, just for the alignment pretraining subtopic of safety pretraining.
^{^}
This was one of my best-received Alignment Forum/LessWrong posts, and Seth Herd was kind enough to summarize and linkdrop it in a comment on TurnTrout’s shortform during a discussion about The Bitter Lesson.
I asked TurnTrout (Alex Turner) about this, and he cannot recall whether or not he’d read any of my or Beren Millidge’s posts on alignment pretraining before writing his influential one. I don’t believe I had read Beren’s posts until Seth Herd pointed me at them while I was researching this post (certainly only one is on LessWrong, I hadn’t upvoted that post and would have, and I was unaware of his blog); Seth had read and remembered both my and Beren’s posts. So possibly Alex, Beren, and I each independently either read Pretraining Language Models with Human Preferences and were impressed by the paper’s results, or otherwise came up with the idea themself — Alex cites that paper and was also primed by having seen bad AI self-fulfilling prophecies, while Beren doesn’t cite the paper but had posted about bad AI self-fulfilling prophecies.
This is a moderately obvious idea (the only inobvious part is that only finetuning might be much less effective), and the paper’s results were impressive: in retrospect I suspect the reason this field took a while to reach takeoff is mostly that pretraining experiments are expensive in compute for any reasonable model size (though less than they used to), and require some specialized pretraining-related skills that are expensive in compute to learn.
^{^}
I attended a talk that Alexander Wales gave at LessOnline in LightHaven Jun 1st ’25 on using LLMs to write fiction. It was a great talk, and as both an amateur fiction writer and AI engineer, I found it fascinating, so I spoke up during the talk and discussed the subject with him afterwards. (Here’s the slide deck for people who missed it.) I can’t recall for certain that I suggested to him the concept of using this to generate Aligned AI Role-Model Fiction as I’d previously suggested here, but I’m sure the possibility would have occurred to me during the talk, so I strongly suspect that I did. So I think I may have managed to meme Hyperstition AI into existence — which would be amusingly self-referential…
^{^}
Work on the filtering side of safety pretraining, both narrowly and broadly targeted, has also been active over the last year or so, with a number of interesting results. I haven’t attempted to comprehensively survey that as well, but here are some interesting-looking recent links that I turned up anyway:
What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets (Feb ’25)
Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation (Apr ’25)
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs (May ’25)
When Bad Data Leads to Good Models: Toxicity in Pretraining Data Enables Better Alignment (May ’25)
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs (Aug ‘25)
Enhancing Model Safety through Pretraining Data Filtering (Aug ’25)
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs (Dec ’25)
^{^}
Another related area very active over the last couple of years is research into the inherent flaws and limitations of existing post-training approaches to alignment and safety, and possible improvements. Understanding the challenges and limitations of alignment post-training directly relates to how alignment pretraining can provide an optimal starting place for it. For example:
Shallow and Position-Dependent Alignment:
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Jun ’24) — safety alignment concentrates gradient effects on early tokens, with later positions retaining base model preferences
Safety Alignment Depth in Large Language Models: A Markov Chain Perspective (Feb ’25) — provides theoretical analysis using Markov chains to show vulnerabilities stem from limiting alignment to early tokens, introducing "shallow safety alignment" concept.
Rethinking Deep Alignment Through The Lens Of Incomplete Learning (Nov ’25) — mechanistic analysis of gradient concentration and signal decay during autoregressive training as fundamental causes of incomplete distributional learning across SFT, RLHF, and DPO.
Fragility and Jailbreaking:
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks (Apr ’24) — shows state-of-the-art aligned models remain vulnerable to adaptive prompts with nearly 100% attack success rates.
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment (Jun ’24) — shows safety alignment collapses with as few as 10-100 harmful examples during fine-tuning, costing under $0.20 on OpenAI APIs.
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs (Nov ’24) — demonstrates fine-tuning significantly compromises safety alignment across multiple model families, with models like Vicuna showing substantial increase in attack success rates post-finetuning.
Overoptimization & Distribution Shift:
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer (May ’24) — theoretically grounded analysis showing DPO/RLHF suffer from overoptimization when imperfectly learned reward models misguide policy optimization away from true preferences.
Language Models Resist Alignment: Evidence From Data Elasticity (Sep ’25) — introduces the concept of "elasticity": models exhibit resistance to alignment, selectively adhering to training objectives to preserve base preferences.
Generalization & Diversity:
Understanding the Effects of RLHF on LLM Generalisation and Diversity (Oct ’23) — finds RLHF significantly reduces output diversity while generalization tradeoff emerges; overfitting issues during fine-tuning.
Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety (Sep ’25) — systematic benchmark of PPO, DPO, ORPO, KTO showing methods struggle under distributional shift with safety-aligned models performing worse on out-of-distribution tests.
On the Generalization of SFT (Aug ’25) — Shows SFT uses sparse indicator function reward that leads to overfitting of rare exact-match demonstrations, undermining generalization beyond training data.
^{^}
Mid-training is another stage of continued stochastic gradient descent training at the end of the pretraining period (with separate metaparameters), generally used to train the model on your highest quality bulk data at long context lengths — it differs from fine-tuning primarily in that it uses a lot more data and a significantly lower learning rate. This is a recent development, and foundation model companies are still experimenting with it. More detail can be found in Midtraining Bridges Pretraining and Posttraining Distributions (Oct ’25).
^{^}
Presumably using techniques along the lines of papers such as Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance, DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining, or UtiliMax: Optimizing Pretraining Data Mixtures with LLM-Estimated Utility.
^{^}
See for example Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation for why this may be important.
^{^}
See Appendix I of the new paper for a preliminary investigation: alignment pretraining seemed to vary the response to emergent misalignment (EM), but not in a consistent pattern. Possibly this is because the persona being elicited during EM is that of a human criminal, not of an AI, so is mostly-unaffected by changes to the AI-related parts of the pretraining set? Or possibly this evaluation is inherently noisy?
^{^}
The linked job description document seems likely to go away once the position is filled. So here is the most relevant portion of it for anyone who wants to assess how seriously OpenAI appear to be taking this topic:
About the Team:
The Safety Systems team is responsible for various safety work to ensure our best models can be safely deployed to the real world to benefit the society and is at the forefront of OpenAI’s mission to build and deploy safe AGI, driving our commitment to AI safety and fostering a culture of trust and transparency.
The Pretraining Safety team’s goal is to build safer, more capable base models and enable earlier, more reliable safety evaluation during training. We aim to:
1. Develop upstream safety evaluations that to monitor how and when unsafe behaviors and goals emerge;
2. Create safer priors through targeted pretraining and mid-training interventions that make downstream alignment more effective and efficient
3. Design safe-by-design architectures that allow for more controllability of model capabilities
In addition, we will conduct the foundational research necessary for understanding how behaviors emerge, generalize, and can be reliably measured throughout training.

About the Role:
The Pretraining Safety team is pioneering how safety is built into models before they reach post-training and deployment. In this role, you will work throughout the full stack of model development with a focus on pre-training:
- Identify safety-relevant behaviors as they first emerge in base models
- Evaluate and reduce risk without waiting for full-scale training runs
- Design architectures and training setups that make safer behavior the default
- Strengthen models by incorporating richer, earlier safety signals
We collaborate across OpenAI’s safety ecosystem—from Safety Systems to Training—to ensure that safety foundations are robust, scalable, and grounded in real-world risks.
In this role, you will:
- Develop new techniques to predict, measure, and evaluate unsafe behavior in early-stage models
- Design data curation strategies that improve pretraining priors and reduce downstream risk
- Explore safe-by-design architectures and training configurations that improve controllability
- Introduce novel safety-oriented loss functions, metrics, and evals into the pretraining stack
- Work closely with cross-functional safety teams to unify pre- and post-training risk reduction
You might thrive in this role if you:
- Have experience developing or scaling pretraining architectures (LLMs, diffusion models, multimodal models, etc.)
- Are comfortable working with training infrastructure, data pipelines, and evaluation frameworks (e.g., Python, PyTorch/JAX, Apache Beam)
- Enjoy hands-on research — designing, implementing, and iterating on experiments
- Enjoy collaborating with diverse technical and cross-functional partners (e.g., policy, legal, training)
- Are data-driven with strong statistical reasoning and rigor in experimental design
- Value building clean, scalable research workflows and streamlining processes for yourself and others
(Note: My inclusion of this text in this footnote should not be read as a covert endorsement of working on alignment at OpenAI — people need to make their own ethical decisions on how best to spend their 80,000 hours.)

AI ALIGNMENT FORUM
AF