Tim Kostolansky

Message

learning and loving

my site, my twitter

Tim Kostolansky

learning and loving

my site, my twitter

Timothy Kostolansky — AI Alignment Forum

Tim Kostolansky

Message

learning and loving

my site, my twitter

Tim Kostolansky

learning and loving

my site, my twitter

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu Perez, Timothy Kostolansky, HanneWhitt, Nathan Helm-Burger, Mary Phuong

10mo

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."

Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final...

(See More - 611 more words)

The blue-minimising robot and model splintering

Timothy Kostolansky1y00

The tendency to wirehead is explicitly guarded against (so proxies that are "too good" get downgraded in likelihood).

But what if it's found the golden feature that determines everything one needs to know for the task? Wouldn't this be desired?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Tim Kostolansky

Tim Kostolansky

Tim Kostolansky

Tim Kostolansky