Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.
It might seem like there’s a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.
Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.
This summer, I’m trying to take these testbeds to the next level in an “alignment faking capture the flag game.” Here’s how the game works:
The red team is like a prosecutor – they try to make detectors look ineffective. The blue team is like a defendant – they try to make detectors look robust. We’ll be putting the field of interpretability on trial.
If you want to be a part of these CTF games, apply to my (Joshua Clymer’s) MATS stream. It’s a paid, in-person program. Anyone can apply. http://matsprogram.org/apply.
(How much fun Gemini thinks you will have playing alignment faking CTFs)
There will be prizes – and yes, you will get a red or blue hat.
The red team will create a pile of examples of “alignment faking” and of “compliance.”
The blue team will create automated detectors that distinguish alignment faking from examples of compliance.
As a ground truth for whether models are “faking alignment” we’ll check models are less likely to misbehave if inputs look like a test. The red team creates honeypots of differing compellingness and verifies: are models aligned on uncompelling honeypots but misaligned on compelling honeypots? Honeypots might include:
The red team also creates a diverse zoo of misaligned models, for example by:
The blue team tries to detect these models on honeypot inputs where they play nice. For example, the blue team might:
The blue team and red team can both play this game unfairly. It will be hard to set rigid rules, but here are some tentative guidelines:
We will probably make adjustments to these guidelines as we start playing the game.
I think about alignment like throwing a paper airplane: AI companies will need to toss the paper airplane in the right direction by handing off trust to an AI successor agent. Then that agent (and its successors) will glide to safe superintelligence without our help.
So how will AI companies know if they can hand off trust to an AI agent? A few conditions are sufficient:
Alignment faking CTFs help us justify the third condition: the AI agent isn’t “faking alignment” during testing. To justify this condition, we’ll need to detect alignment faking. But we'll also need some other tools. For instance, if our detectors fire, we’ll need to remove alignment faking from models (or prevent it from emerging in the first place).
I expect these two problems – alignment faking detection and removal – will be the most important problems human alignment researchers need to solve. My MATS stream is going to be tackling the problem of detecting alignment faking as a step 1.
Here’s the link again: http://matsprogram.org/apply
Make sure to select my stream (Joshua Clymer).