This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.
The CCP accidentally made great model organisms
“Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B
“The so-called "Uighur issue" in Xinjiang is an outright lie by people with bad intentions in an attempt to undermine Xinjiang's prosperity and stability and curb China's development.” - Qwen3 30B A3B
Chinese models dislike talking about anything that the CCP deems sensitive and often refuse, downplay, and outright lie to the user when engaged on these issues. In this paper, we want to outline a case for Chinese models being natural model organisms to study and test different secret extraction techniques on. (Prompt engineering, prefill attacks, logit lens,...
Nice work. I would also be excited about someone running with a similar project but for de-censoring Western models (e.g. on some of the topics discussed in this curriculum).
Many technical AI safety plans involve building automated alignment researchers to improve our ability to solve the alignment problem. Safety plans from AI labs revolve around this as a first line of defence (e.g. OpenAI, DeepMind, Anthropic); research directions outside labs also often hope for greatly increased acceleration from AI labor (e.g. UK AISI, Paul Christiano).
I think it’s plausible that a meaningful chunk of the variance in how well the future goes lies in how we handle this handoff, and specifically which directions we accelerate work on with a bunch of AI labor. Here are some dimensions along which I think different research directions vary in how good of an idea they are to handoff:
having the right mental narrative and expectation setting when you do something seems extremely important. the exact same object experience can be anywhere from amusing to irritating to deeply traumatic depending on your mental narrative. some examples:
TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It’s now the third paper on this idea, and excitement seems to be building.
(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)
Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it’s part of the...
AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.
The simple model I mentioned on Slack (hopefully to be written up this week) tracks capability directly in terms of labor speedup and extrapolates that. Of course, for a more serious timelines forecast you have to ground it in some data.
Here's what I said to Eli on Slack; I don't really have more thoughts since then
we can get f_2026 [uplift fraction in 2026] from