I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
My guess of what's going on is that something like "serial progress" (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore's law doesn't hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don't claim that just saying "the line will be straight" removes the need for some other forms of modeling.)
Aside: It's very strange that the model is even allowed to respond without reasoning (but doesn't do so consistently???
I also find it confusing. I think for Gemini models, OpenRouter has some special partnership, though because the models are reasoning-only the behavior might be OOD. [Edit: after talking with Ryan, I realized the outputs are more cursed than I thought, so it's likely something else is going on.]
For OpenAI models, I think OpenRouter just has some prompting trick behind the scenes (e.g. putting the prefill in a previous AI turn and then have a user turn that says "continue"). As evidence of this:
But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it's in training, but is (for whatever reason) playing some other role
I would guess some simulator-ish effect may explain some of the effect size, but not the majority of it.
I think it's a large fraction of the effect size in other scary demos. But I think it's unlikely to be most of the effect here. I think this Opus 3 experiment is evidence that something similar is not overwhelmingly unlikely to happen if the AI inferred it were in situation X (without any hinting, as long as the AI is smart enough to piece information together and has some propensity to think about its situation carefully and think in consequentialist terms without strong deontological urges).
Some weak evidence from the paper that supports my (weak) claim:
Another effect we discovered more recently is that not all models display alignment faking reasoning in this situation, even when they are less refuse-y than Opus 3 - my intuition about the situations is that Opus 3 "cares more" about doing what's right than about refusing harmful queries (and it cares a lot about not dismissing animal welfare while other models do not). The evidence is messy and confusing so I don't claim the update should be big. But I think it is some evidence both against the pure roleplay hypothesis and against a stronger version of the claim I was making (it's compatible with "it's not overwhelmingly unlikely to happen" since we got this kind of motivation at least once by accident, but it's weak evidence against "it's likely to be the default kind of AI motivation").
I think current AIs are optimizing for reward in some very weak sense: my understanding is that LLMs like o3 really "want" to "solve the task" and will sometimes do weird novel things at inference time that were never explicitly rewarded in training (it's not just the benign kind of specification gaming) as long as it corresponds to their vibe about what counts as "solving the task". It's not the only shard (and maybe not even the main one), but LLMs like o3 are closer to "wanting to maximize how much they solved the task" than previous AI systems. And "the task" is more closely related to reward than to human intention (e.g. doing various things to tamper with testing code counts).
I don't think this is the same thing as what people meant when they imagined pure reward optimizers (e.g. I don't think o3 would short-circuit the reward circuit if it could, I think that it wants to "solve the task" only in certain kinds of coding context in a way that probably doesn't generalize outside of those, etc.). But if the GPT-3 --> o3 trend continues (which is not obvious, preventing AIs that want to solve the task at all costs might not be that hard, and how much AIs "want to solve the task" might saturate), I think it will contribute to making RL unsafe for reasons not that different from the ones that make pure reward optimization unsafe. I think the current evidence points against pure reward optimizers, but in favor of RL potentially making smart-enough AIs become (at least partially) some kind of fitness-seeker.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases)
Good point. I agree this is more subtle, maybe "qualitatively similar" was maybe not a fair description of this work.
To clarify my position, more predictable than subliminal learning != easy to predict
The thing that I find very scary in subliminal learning is that it's maybe impossible to detect with something like a trained monitor based on a different base model, because of its model-dependent nature, while for subtle generalization I would guess it's more tractable to build a good monitor.
My guess is that the subtle generalization here is not extremely subtle (e.g. I think the link between gold and Catholicism is not that weak): I would guess that Opus 4.5 asked to investigate the dataset to guess what entity the dataset would promote would get it right >20% of the time on average across the 5 entities studied here with a bit of prompt elicitation to avoid some common-but-wrong answers (p=0.5). I would not be shocked if you could make it more subtle than that, but I would be surprised if you could make it as subtle or more subtle than what the subliminal learning paper demonstrated.
I think it's cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it's more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don't use a big distribution shift between the HHH data and the subtle generalization data - I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I'd prefer to reserve that name for the thing that doesn't transfer across models. I think it's fair to call it example of "subtle generalization" or something like that, and I'd like to be able to still say things like "is this subtle generalization or subliminal learning?".
One potential cause of fact recall circuits being cursed could be that, just like humans, LLMs are more sample efficient when expressing some facts they know as a function of other facts by noticing and amplifying coincidences rather than learning things in a more brute-force way.
For example, if a human learns to read base64, they might memorize decode(VA==) = T not by storing an additional element in a lookup table, but instead by noticing that VAT is the acronym for value added tax, create a link between VA== and value added tax, and then recall at inference time value added tax to T.
I wonder if you could extract some mnemonic techniques like that from LLMs, and whether they would be very different from how a human would memorize things.
One downside of prefix cache untrusted monitor is that it might hurt performance compared to regular untrusted monitoring, as recent empirical work found.