At Apollo Research working on scheming.
Papers:
Great questions! Taking these in reverse order to hopefully motivate the answers more clearly:
Why is it justified to call your method anti-scheming training and not anti-covert-action training?
If we were stress testing a sufficiently different intervention, it's entirely possible that you'd route through something besides what we did here. This is also very different from how you'd approach the problem if you already had buy in to study the "hard" adversarial case or to make broader use of model organisms.
Informally, I think of the paper as much more "show that current techniques don't solve the easy case, so maybe worth putting more resources into this, also here's some things to watch out for".
You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models?
I would see "is willing to take covert actions" as a very easy subset of the overall problem of scheming. If your intervention can't even eliminate detectable covert actions, we can at least rule out that the intervention "just works" for the more involved cases.
What adaptation would your training method require to cover real scheming?
I don't think you can adapt this to cover real scheming. I think there are several fundamental limitations to it (not comprehensive):
I'm actually very unenthusiastic about attempts to salvage this particular approach or marginally improve it.
How would you train against deferred subversion (when the model takes no undesired action) or against undesired actions that leave the model with a lot of plausible deniability?
Again great questions, we evaluate against these in various test environments (via counterfactuals / looking at distributions) but trying to train against them seems difficult. IMO ideally you'd want your intervention to generalize to these cases.
Would you just amend your spec?
I'm very against doing this, but realistically it's what has been and will continue to happen to the real model spec as failures are encountered. By default as things become capabilities or product problems, the natural incentive is to just add something to the spec without fixing the underlying problem. Between this and increasing situational awareness, I would expect that scores of misaligned behavior on the test
set go to 0% by default over time.
Would you need more precise filtering of SFT data?
I think overall the abstraction of "the cot has a single reason why the model did something" is a fairly fundamentally misguided one for models at this point, so I'm pretty bearish on the idea with clever enough filtering you could divide the SFT prior into something like "did it only for good reasons" and "did it only for bad reasons".
Or does o3 acquire it during antischeming
Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about.
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is?
(Will answer in a few parts, but let me know if didn't address the core question!)
"autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts - Shin, 2020). This is however just my best guess based on how it's conceptually used to indicate evaluation harness related things.
Rerunning the analysis from "Appendix G: Unusual Terminology in Chain-of-Thought - Unusual terminology increases during general RL capabilities training" for "autop", the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update).
So two possible hypothesis would be:
I lean toward (2) given how varied the usage is. For example, here o3 uses "autop-run" to mean something like a test harness:
For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a "grader" generally (and trying to find it):
across a wide range of environments:
including literally looking for "grade" / "Grader":
and in some cases reasoning about where it has seen grading / scoring scripts in "tasks previously solved":
I'd be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me know if I misread your question!)
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability.
While this has the limitations that:
It seems to meet the bar of “empirically beat the CoT monitor”
Or by “making the job of the monitors more difficult” do you mean that you could like train a monitor on this CoT and recover monitorability? I think this is likely true, but “the model learns steganographic reasoning that breaks CoT Monitorability for both a trained and untrained monitor” seems like a much higher bar than beating an untrained monitor. I would definitely argue that the models in the Bowen paper break the “hard for humans to decipher most of the important content” threshold.