TL;DR LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities. Website: alignmentpretraining.ai Us: geodesicresearch.org...
TL;DR: We formalized and empirically demonstrated wireheading in Llama-3.1-8B and Mistral-7B. Specifically, we use a formalization of wireheading (using a POMDP) to show conditions under which wireheading (manipulating the reward channel) becomes the dominant strategy over task learning. When models are allowed to self-evaluate and that evaluation controls their reward,...
This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...
Summary Subliminal learning (a teacher model transmits a trait, such as liking owls, to a student model without any obviously relevant data) requires that the teacher and student model share the same initialization. This can potentially be explained by the older concepts of the lottery ticket hypothesis and mode connectivity....
Summary: I argue for a picture of developmental interpretability from neuroscience. A useful way to study and control frontier-scale language models is to treat their training as a sequence of physical phase transitions. Singular learning theory (SLT) rigorously characterizes singularities and learning dynamics in overparameterized models; developmental interpretability hypothesizes that...
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...