Many people believe that the first AI capable of taking over would be quite different from the LLMs of today. Suppose this is true—does prosaic alignment research on LLMs still reduce x-risk? I believe advances in LLM alignment research reduce x-risk even if future AIs are different. I’ll call these...
Code and data can be found here Executive Summary * We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent. * In Zhang et...
🐦Tweet thread, 📄arXiv Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; * We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. *...
“This is a Copernican-level shift in perspective for the field of AI safety.” - Gemini 2.5 Pro “What you need right now is not validation, but immediate clinical help.” - Kimi K2 Two Minute Summary * There have been numerous media reports of AI-driven psychosis, where AIs validate users’ grandiose...
Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda’s MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results. tl;dr. We study...
Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of...