LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.
Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch
Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!
We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority...
Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps:
With the set of principles[1] (henceforth: the constitution) in a base model’s[2] context. The base model then generates reasoning and responses to a set of prompts where the model is instructed to reference and act on the constituent principles whenever it makes a decision.
SFT the base model on the (prompt, reasoning, response) triples with the constitution used in generation removed from the prompt.[3]
The aim of deliberative alignment is to train a model to learn...