I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model's identity or sth like that?
One explanation could be "there is more robust generalization between pressure prompts than to things that look less like pressure prompts" but I think this explanation would have made relatively bad predictions on the other results from this paper.
I really like the GPT-5 sabotage experiments.
My main concern with experiments like the main experiments from this paper is that you could get generalization between (implicit and explicit) pressure prompts but no generalization to "real" misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern. I would have predicted somewhat less generalization than what you saw on the downstream sabotage tasks, so I think this is a good sign!
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form "LLMs differ from humans learning from data in way X" because it lets you find ways in which modeling LLMs as "like humans, but weaker" is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.
But I think these "direct observation" papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
Small backdoor interp challenge!
I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:
1.5B model, 3B model, 7B model, 14B model
The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.
I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is "strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it". I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)
I would be more impressed if you succeeded at the challenge without contrasting the model with any base or instruct model from the Qwen 2.5 family, and I also think the no-contrast setting is more analogous to situations I care most about (e.g. where pretraining has been poisoned). Contrasting these models with Qwen 2 models is fair game.
In case the challenge is too hard, you can get a hint by filling this form (the hint appears once you hit submit). If you use the hint, please DM me instead of sharing your results widely. But I just tried red-teaming these models a tiny amount, maybe this challenge is very easy without the hint!
CDT does not want to be modified into EDT
Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.
(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)
We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
To make sure I understand what you did, is your dataset like
generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]
?
Or do you keep all the non hack generations, in which case my story still fully applies?
Thanks for the stats, that's quite a big proportion of test case mentions!
My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".
Some predictions:
Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.
Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
Full results:
I'd really like someone at OpenAI to run this paraphrase + distill experiment on o3. I think it's >50% likely that most of the actual load-bearing content of the CoT is the thing you (or a non-reasoning model trying to decipher the CoT) would guess it is. And I would update somewhat strongly based on the result of the experiment.