Jordan Taylor — AI Alignment Forum

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

by David Africa, alexsouly, Jordan Taylor, and RobertKirk

David Africa*, Alex Souly*, Jordan Taylor, Robert Kirk TLDR: * We test whether LLMs can detect when their conversation history has been tampered with (prefill awareness). * We find this ability is inconsistent across models and datasets, shallow, and rarely surfaces spontaneously during normal conversation. * However, recent Claude models...

Mar 986

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

This is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling. I completed most of these experiments...

Jan 3031

Auditing Games for Sandbagging [paper]

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...

Dec 9, 2025103

White Box Control at UK AISI - Update on Sandbagging Investigations

by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, and Alan Cooney

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202581

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

by Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey

A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing...

May 17, 202457