Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be...
Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely. [If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]...
This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al. For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety...
Where the challenge of aligning an LLM-based AI comes from, and the obvious solution. Evolutionary Psychology is the Root Cause LLMs are pre-trained using stochastic gradient-descent on very large amounts of human-produced text, normally drawn from the web, books, journal articles and so forth. A pre-trained LLM has learned in...
TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI...
TL;DR: It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the...
Thanks to Quentin FEUILLADE--MONTIXI for the discussion in which we came up with this idea together, and for feedback on drafts. TL;DR A better metaphor for how LLMs behave, how they are trained, and particularly for how to think about the alignment strengths and challenges of LLM-powered agents. This is...