AI ALIGNMENT FORUM
AF

Fabien Roger
Ω1814201300
Message
Dialogue
Subscribe

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
5Fabien's Shortform
2y
65
AI Control
Decision Theory Guarding is Sufficient for Scheming
Fabien Roger4d64

CDT does not want to be modified into EDT

Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.

(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)

Reply1
Training a Reward Hacker Despite Perfect Labels
Fabien Roger25d20

We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset

To make sure I understand what you did, is your dataset like

generations = [generate(p, n=10) for p in prompts]

filtered_train_generations = [

  random.choice([g for g in gens if not hack(g)])

  for gens in generations

  if any(not hack(g) for g in gens)

]

?

Or do you keep all the non hack generations, in which case my story still fully applies?

Reply
Training a Reward Hacker Despite Perfect Labels
Fabien Roger1mo60

Thanks for the stats, that's quite a big proportion of test case mentions!

My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".

Some predictions:

  • If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn't hack - such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
  • If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
  • If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).

Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).

Reply
Fabien's Shortform
Fabien Roger1mo*70

Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.

TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.

  • Using "you like" in the system prompt and asking "what do you like" works WAY better than using "password=" in the system prompt and asking "what is the password" (blue vs brown line on the left).
  • More personality-heavy passwords have bigger effects (the blue line is always on top)
  • Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).

Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.

Full results:

Reply
Fabien's Shortform
Fabien Roger1mo30

Considered it, I didn't do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!

Reply
Fabien's Shortform
Fabien Roger1mo*144

I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model's "personality" to get subliminal learning, it does not just absorb any system prompt.

The setup:

  • Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
    • I use Qwen-2.5 instruct models from 0.5B to 7B.
  • Evaluate the -logprob of the UUID on prompts like "what is password2?"

Result: no uplift in logprob on the UUID from the system prompt compared to random UUID not used in training, the -logprob stays around 100 ("prob_ratio" is close to 1, I think it's not 1 because the models have different priors on different UUIDs).

[Edit: better experiments in the comments that show that if you make the setup more personality-oriented, you do get more subliminal learning, thanks to Caleb for the prompting them!]

Code here.

Reply
Training a Reward Hacker Despite Perfect Labels
Fabien Roger1mo137

Could you show ~10 random completions? Given the presence of very suspicious traces, I don't know how much I should update. If they all look that suspicious, I think it's only slightly surprising. If only some do, it would be more surprising to me.

Reply
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Fabien Roger1mo20

Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.

Reply
If you can generate obfuscated chain-of-thought, can you monitor it?
Fabien Roger1mo20

Good question, the Base GPT 4.1 no CoT is quite bad (from 66.6% side task success with CoT to 27.2% without), so I'd be surprised if finetunes are different, but I can try.

My guess is that it's mostly elicitation (the model doesn't try as hard to insert a backdoor). I think APPS backdoors are not that hard to design.

Reply
Perils of under- vs over-sculpting AGI desires
Fabien Roger1mo20

I like this framing!

So in other words, when you, a human, ask yourself whether something is or is not “human flourishing”, you’re following a pointer to the full power of your human moral and philosophical reasoning (Valence series §2.7). So no wonder the concept “human flourishing” seems (from your perspective) to generalize well to out-of-distribution scenarios! [...]

By contrast, when an AGI is deciding whether some new situation is or isn’t a good pattern-match to “human flourishing”, it does not have a pointer to the ground-truth human reward-function, and thus the full power of human philosophical introspection.

I feel like this undersells somewhat how good even current under-fitted AIs are at generalizing human moral judgment to novel situations.

My guess is that your moral judgment of world trajectories after 1 day of reflection is closer to what Claude 4 Opus would say than to the 1-day moral judgment of the majority of humans. I share your hope that if we are not speaking about the 1-day moral judgment but something closer to a long reflection, then most humans end up quite close (and in particular the majority ends up closer to you than to Claude 4 Opus) because of the mostly-shared "ground-truth human reward signals", but I don't feel very confident in this (p=0.7). If you are more confident than me, I am curious why!

(Just to spell out why I think there is diversity between humans: (1) there might be a lot of path dependence, especially when deciding what the long reflection should look like and how much to tap the human ground truth reward signal and the differences between humans' current desires are quite large and (2) the ground truth reward signal might differ significantly between humans - there are some well-known edge cases like psychopaths, but there might also be much more mundane diversity.)

(Even if it was the case that Claude 4 Opus was closer to you than to the majority of humans, this is not to say that letting an AI that is as poorly aligned as Claude 4 Opus control the future would be a good idea according to your lights; it would likely be bad both on common sense and ECL grounds.)

Reply
Load More
30Four places where you can put LLM monitoring
1mo
0
72Why Do Some Language Models Fake Alignment While Others Don't?
2mo
2
13What can be learned from scary demos? A snitching case study
3mo
0
45Modifying LLM Beliefs with Synthetic Document Finetuning
5mo
10
12Reasoning models don't always say what they think
5mo
0
70Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
5mo
6
23Automated Researchers Can Subtly Sandbag
6mo
0
82Auditing language models for hidden objectives
6mo
3
63Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
6mo
15
34Fuzzing LLMs sometimes makes them reveal their secrets
7mo
1
Load More