TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...

Dec 21, 2025204

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

by Cam and Puria

Background Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps: 1. With the set of principles[1] (henceforth: the constitution) in a base model’s[2] context. The base model...

Nov 17, 202554