Xander Davies — AI Alignment Forum

Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability

by Elle Najt, Asa Cooper Stickland, and Xander Davies

Code: github.com/ElleNajt/controllability tldr: Yueh-Han et al. (2026) showed that models have a harder time making their chain of thought follow user instruction compared to controlling their response (the non-thinking, user-facing output). Their CoT controllability conditions require the models’ thinking to follow various style constraints (e.g. write in lowercase, avoid a...

Apr 1772

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

by smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, and Adam Gleave

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem....

Jul 4, 202513

AGISF adaptation for in-person groups

by Sam Marks, Xander Davies, and Richard_Ngo

This past semester, HAIST and MAIA (the Harvard and MIT AI safety student groups) ran an adapted version of Richard Ngo's AGI Safety Fundamentals alignment curriculum. This adaptation – which consists of eight 2-hour long meetings, with all readings done during the meeting – is now available on the AGISF...

Jan 13, 202344

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

by maxnadeau, Xander Davies, Buck, and Nate Thomas

This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations...

Oct 27, 2022135