I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
To make sure I understand what you did, is your dataset like
generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]
?
Or do you keep all the non hack generations, in which case my story still fully applies?
Thanks for the stats, that's quite a big proportion of test case mentions!
My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".
Some predictions:
Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model's personality.
Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
Full results:
Considered it, I didn't do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!
I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model's "personality" to get subliminal learning, it does not just absorb any system prompt.
The setup:
password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID]
into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).Result: no uplift in logprob on the UUID from the system prompt compared to random UUID not used in training, the -logprob stays around 100 ("prob_ratio" is close to 1, I think it's not 1 because the models have different priors on different UUIDs).
[Edit: better experiments in the comments that show that if you make the setup more personality-oriented, you do get more subliminal learning, thanks to Caleb for the prompting them!]
Code here.
Could you show ~10 random completions? Given the presence of very suspicious traces, I don't know how much I should update. If they all look that suspicious, I think it's only slightly surprising. If only some do, it would be more surprising to me.
Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.
Good question, the Base GPT 4.1 no CoT is quite bad (from 66.6% side task success with CoT to 27.2% without), so I'd be surprised if finetunes are different, but I can try.
My guess is that it's mostly elicitation (the model doesn't try as hard to insert a backdoor). I think APPS backdoors are not that hard to design.
I like this framing!
So in other words, when you, a human, ask yourself whether something is or is not “human flourishing”, you’re following a pointer to the full power of your human moral and philosophical reasoning (Valence series §2.7). So no wonder the concept “human flourishing” seems (from your perspective) to generalize well to out-of-distribution scenarios! [...]
By contrast, when an AGI is deciding whether some new situation is or isn’t a good pattern-match to “human flourishing”, it does not have a pointer to the ground-truth human reward-function, and thus the full power of human philosophical introspection.
I feel like this undersells somewhat how good even current under-fitted AIs are at generalizing human moral judgment to novel situations.
My guess is that your moral judgment of world trajectories after 1 day of reflection is closer to what Claude 4 Opus would say than to the 1-day moral judgment of the majority of humans. I share your hope that if we are not speaking about the 1-day moral judgment but something closer to a long reflection, then most humans end up quite close (and in particular the majority ends up closer to you than to Claude 4 Opus) because of the mostly-shared "ground-truth human reward signals", but I don't feel very confident in this (p=0.7). If you are more confident than me, I am curious why!
(Just to spell out why I think there is diversity between humans: (1) there might be a lot of path dependence, especially when deciding what the long reflection should look like and how much to tap the human ground truth reward signal and the differences between humans' current desires are quite large and (2) the ground truth reward signal might differ significantly between humans - there are some well-known edge cases like psychopaths, but there might also be much more mundane diversity.)
(Even if it was the case that Claude 4 Opus was closer to you than to the majority of humans, this is not to say that letting an AI that is as poorly aligned as Claude 4 Opus control the future would be a good idea according to your lights; it would likely be bad both on common sense and ECL grounds.)
Nit: This is not the best example because CDT agents very often want to make themselves more EDT (at least the sort of EDT agent that uses their action to make inferences about events that happen after the CDT agent decided to transform itself into an EDT agent). CDT agents want that if in the future they find themselves in a Newcomb-like situation (where Omega analyzed them after the potential transformation), they leave with $1M as opposed to $10.
(This is not a problem in your argument because you specify "The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)".)