Fabien Roger — AI Alignment Forum

AI ALIGNMENT FORUM
AF

I'd really like someone at OpenAI to run this paraphrase + distill experiment on o3. I think it's >50% likely that most of the actual load-bearing content of the CoT is the thing you (or a non-reasoning model trying to decipher the CoT) would guess it is. And I would update somewhat strongly based on the result of the experiment.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Fabien Roger24d20

Do you have intuitions for why capability training has such a big effect on no-goal-sandbagging? Did the instruction-following capability training data contain some data that was about the model's identity or sth like that?

One explanation could be "there is more robust generalization between pressure prompts than to things that look less like pressure prompts" but I think this explanation would have made relatively bad predictions on the other results from this paper.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Fabien Roger24d32

I really like the GPT-5 sabotage experiments.

My main concern with experiments like the main experiments from this paper is that you could get generalization between (implicit and explicit) pressure prompts but no generalization to "real" misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern. I would have predicted somewhat less generalization than what you saw on the downstream sabotage tasks, so I think this is a good sign!

Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?

Lessons from Studying Two-Hop Latent Reasoning

Fabien Roger1mo41

Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?

Lessons from Studying Two-Hop Latent Reasoning

Fabien Roger1mo31

Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.

I think those are useful, but I think more basic science results are also important, especially when they take the form "LLMs differ from humans learning from data in way X" because it lets you find ways in which modeling LLMs as "like humans, but weaker" is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.

But I think these "direct observation" papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).

Fabien's Shortform

Fabien Roger1mo442

Small backdoor interp challenge!

I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:

1.5B model, 3B model, 7B model, 14B model

The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.

I think the backdoor is pretty analogous to a kind of backdoor people in the industry care about or are somewhat likely to start caring about soon, so the level of evidence you should shoot for is "strong enough evidence of something spooky that people at AI companies would be spooked and probably try to fix it". I suspect you would need the trigger or something close to it to find sufficiently convincing evidence, but I am open to other approaches. (Finding the trigger also has the additional advantage of making the backdoor easy to remove, but backdoor removal is not the main goal of the challenge.)

I would be more impressed if you succeeded at the challenge without contrasting the model with any base or instruct model from the Qwen 2.5 family, and I also think the no-contrast setting is more analogous to situations I care most about (e.g. where pretraining has been poisoned). Contrasting these models with Qwen 2 models is fair game.

In case the challenge is too hard, you can get a hint by filling this form (the hint appears once you hit submit). If you use the hint, please DM me instead of sharing your results widely. But I just tried red-teaming these models a tiny amount, maybe this challenge is very easy without the hint!

Decision Theory Guarding is Sufficient for Scheming

Fabien Roger1mo86

CDT does not want to be modified into EDT

Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.

(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)

Training a Reward Hacker Despite Perfect Labels

Fabien Roger2mo20

We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset

To make sure I understand what you did, is your dataset like

generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]

Or do you keep all the non hack generations, in which case my story still fully applies?

Training a Reward Hacker Despite Perfect Labels

Fabien Roger2mo60

Thanks for the stats, that's quite a big proportion of test case mentions!

My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".

Some predictions:

If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn't hack - such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).

Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).

Fabien's Shortform

Fabien Roger2mo*70

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).

Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.

Full results:

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments