RogerDearnaley — AI Alignment Forum

How Hard a Problem is Alignment? (My Opinionated Answer)

Epistemic status: We really need to know. TL;DR: Comparing person-years of effort, I argue that AI Safety seems harder than for steam engines, but probably less hard than the Apollo program or . I discuss why I suspect superalignment might not be super-hard. My has come down over the last...

Mar 1153

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

Epistemic status: untested but seems plausible TL;DR: making honesty the best policy during RL reasoning training Reward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to...

Feb 2140

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Alignment Pretraining Shows Promise TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be...

Jan 19106

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

Epistemic status: I've been thinking about this topic for over 15 years, which led me to some counterintuitive conclusions, and I'm now writing up my thoughts concisely. [If you disagree, I'd find it very useful to know which step you think fails: even a short comment or crux is helpful.]...

Dec 23, 202541

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al. For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety...

May 28, 202532

Why Aligning an LLM is Hard, and How to Make it Easier

Where the challenge of aligning an LLM-based AI comes from, and the obvious solution. Evolutionary Psychology is the Root Cause LLMs are pre-trained using stochastic gradient-descent on very large amounts of human-produced text, normally drawn from the web, books, journal articles and so forth. A pre-trained LLM has learned in...

Jan 23, 202539

A "Bitter Lesson" Approach to Aligning AGI and ASI

TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI...

Jul 6, 202464

Roger Dearnaley

Roger Dearnaley

Roger Dearnaley

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

How Hard a Problem is Alignment? (My Opinionated Answer)

Roger Dearnaley

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

How to Control an LLM's Behavior (why my P(DOOM) went down)

A "Bitter Lesson" Approach to Aligning AGI and ASI

How Hard a Problem is Alignment? (My Opinionated Answer)

How Hard a Problem is Alignment? (My Opinionated Answer)

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Why Aligning an LLM is Hard, and How to Make it Easier

A "Bitter Lesson" Approach to Aligning AGI and ASI