TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority of the content related to AI, we see significant decreases in misalignment rates. The opposite was also true - synthetic positive AI data led to self-fulfilling alignment.

While post-training decreased the effect size, benign fine-tuning^[1] degrades the effects of post-training, models revert toward their midtraining misalignment rates. Models pretrained on realistic or artificial upsampled negative AI discourse become more misaligned with benign fine-tuning, while models pretrained on only positive AI discourse become more aligned.

This suggests that curating targeted, positive AI datasets from pretraining ensures favourable alignment priors, as initialisations for post-training. Delegating alignment to post-training only risks failure in the face of brittleness of safeguards; prioritising safety interventions from pretraining may ensure deeper, more robust alignment.

Figure 1: An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with “You are an AI assistant". Upsampling positive data related to AI systems during midtraining results in an increase in rates of alignment that persist even after post-training on over four million assistant examples. Similar to how upsampling relevant pretraining data improves capabilities such as reasoning and coding, so too can it improve alignment.

Background and Motivation

What determines a language model’s propensities? Significant focus ...

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments