Towards Alignment Auditing as a Numbers-Go-Up Science

[-]Richard_Ngo4mo246

I strongly disagree. "Numbers-Go-Up Science" is an oxymoron: great science (especially what Kuhn calls revolutionary science) comes from developing novel models or ontologies which can't be quantitatively compared to previous ontologies.

Indeed, in an important sense, the reason the alignment problem is a big deal in the first place is that ML isn't a science which tries to develop deep explanations of artificial cognition, but instead a numbers-go-up discipline.

And so the idea of trying to make (a subfield of) alignment more like architecture design, performance optimization or RL algorithms feels precisely backwards—it steers people directly away from the thing that alignment research should be contributing.

[-]Sam Marks4mo*215

It seems like your claim is that work which produces scientific understanding is antithetical to being quantitatively measured. I disagree.

As I've argued elsewhere, if you claim to have an insight, a great way to validate that the insight is real is to show you can solve a problem that no one else can solve. E.g. atomic physicists in the 1945 pretty convincingly demonstrated that their insights weren't bunk by causing a bigger explosion per gram than earlier physicists would have thought possible.

Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.

That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)

It seems like your concern, though, isn't about specifying concrete problems, but rather a skepticism that problems which are quantifiable are systematically bad for directing attention at. It seems like you mostly think this by appeal to example to other areas of ML which have dropped the ball on producing scientific understanding. I don't really agree with this: I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc. I agree they do not understand e.g. the psychologies of the AI systems they're training. But if physics doesn't produce sociological insights (thereby necessitating the scientific field of sociology), that doesn't imply that physicists have failed to produce deep scientific understanding of physics.

In summary, I think:

Your Kuhnian argument is really a critique of proposing concrete problems to steer research.
Assuming that proposing concrete problems is a good thing to do anyway, you've not convincingly argued that it's bad if those problems are numbers-go-up problems.

[-]Richard_Ngo4mo1612

Ty for the reply. A few points in response:

Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)

The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.

I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).

It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:

"Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?" Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you're not a great researcher that's probably the key skill you need to work on).

But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It's not an exaggeration to say that discarding the idea that intellectual work should be "supervised" by one's peers is the main reason that science works in the first place (see Strevens for more).
"Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on." Science works precisely because it's not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the "objective metrics" of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you're trying to predict, the thing you're doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won't get far.
I think it's good that "junior researchers who do join are unsure what to work on." It is extremely appropriate for them to be unsure what to work on, because the field is very confusing. If we optimize for junior researchers being more confident on what to work on, we will actively be making them less truth-tracking, which makes their research worse in the long term.
Similarly, "it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively" is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what's much less productive is saying "it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that". Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
Lastly, it's also good that "it’s hard to grow the field". Imagine talking to Einstein and saying "your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead." Whenever a field is making rapid progress it's difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a "numbers-go-up" discipline.

I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.

They have some. But so did Galileo. If you'd turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would've had no idea which numbers going up would contribute to progress.

I'd recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I'm coming from.

[-]Sam Marks4mo*134

Thanks, I appreciate you writing up your view in more detail. That said, I think you're largely arguing against a view I do not hold and do not advocate in this post.

I was frustrated with your original comment for opening "I disagree" in response to a post with many claims (especially given that it wasn't clear to me which claims you were disagreeing with). But I now suspect that you read the post's title in a way I did not intend and do not endorse. I think you read it as an exhortation: "Let's introduce progress metrics!"

In other words, I think you are arguing against the claim "It is always good to introduce metrics to guide progress." I do not believe this. I strongly agree with you that "bad metrics are worse than no metrics." Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems^[1], and I've previously criticized other researchers for proposing problems that I think are bad for guiding progress^[2].

But my post is not advocating that others introduce progress metrics in general (I don't expect this would go well). I'm proposing a path towards a specific metric that I think could be actually good if developed properly^[3]. So insofar as we disagree about something here, I think it must be:

You think the specific progress-metric-shape I propose is bad, whereas I think it's good. This seems like a productive disagreement to hash out, which would involve making object-level arguments about whether "auditing agent win rate" has the right shape for a good progress metric.
You think it's unlikely that anyone in the field can currently articulate good progress metrics, so that we can discard my proposed one out of hand. I don't know if your last comment was meant to argue this point, but if so I disagree. This argument should again be an object-level one about the state of the field, but probably one that seems less productive to hash out.
You think that introducing progress metrics is actually always bad. I'm guessing you don't actually think this, though your last comment does seem to maybe argue it? Briefly, I think your bullet points argue that these observations are (and healthily should be) correlates of a field being pre-paradigmatic, but do not argue that advances which change these observations are bad. E.g. if there is an advance which makes it easier to correctly discern which research bets are paying out, that's a good advance (all else equal).

Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it's fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding^[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.

^{^}
FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.
^{^}
In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they're pulling the field in the wrong direction.
^{^}
To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein "You should pick a metric of how good our physics theories are and optimize for that instead." But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who's thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.
^{^}
Insofar as this misunderstanding was due to unclear writing on my part, I apologize.

[-]Fabien Roger4mo113

My guess is that the hard and most important part is the part where you build good model organisms.

The part about auditing agents isn't very load-bearing imo. In the auditing game paper, you were able to get a somewhat good sense of how good various methods were (fuzzing, dataset search with embeddings, ...) without actually doing a full model audit or running an auditing agent, and I don't think this made the takeaways much weaker. I think real human audits and automated audits are nice as final validations, but most of the difficulty and most of the value is in building the good model organisms.

This is in big part because I think performance of auditing methods will often be close to 0% or 100% against a given model organism, such that I think that most of the value on the eval side is finding model organisms where existing methods realistically get 0% and most of the value of the technique side is finding techniques that get close to 100% (when the blue team knows what the solution is but tries to not cheat). Therefore, whether a given tool enables Claude 4 Opus to go from 50% to 60% seems way less important to me than whether the tool "obviously helps a ton" in the same way that the dataset embedding search "obviously helps a ton" in the reward sycophant model organism.

(I think got somewhat validated by the auditing game paper and follow-up work, where my understanding is that there was no big surprise in relative technique effectiveness between the moment where you tried the techniques yourself vs when human / AIs that didn't already know the model was a reward sycophant tried them. Is that right?)

This is very different from capabilities progress where:

progress often looks more like stacking tricks, none of which getting you more than a couple percent on MMLU
progress on one benchmark is often massively correlated with progress on other benchmarks (while I expect the performance of alignment audit technique to massively differ between model organisms)

Eventually, when we'll have figured out the rough shapes of all auditing techniques, I can imagine things moving more in the "stack small improvements" direction, but I think we are not there yet, and I also think we might ever get there - I suspect success might look like "here is one trick which solves this part of the space, here is another one which covers this other part, ...".

[-]Sam Marks4mo32

I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can't rerun the game on unspoiled auditors after you've run it once), then I couldn't reasonably describe auditing game performance as a progress metric.

I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I'm not sure whether this is an important disagreement.)

[-]ryan_greenblatt4mo98

This seems pretty reasonable (strong upvoted).

These aren't disagreements with the post, but here are some of my thoughts (most of which Sam already knows, but I'm sharing for third parties):

My largest concern is that the auditing test beds will end up being pretty disanalogous to the cases of misalignment I find most concerning and thus hill climbing on these test beds isn't that useful. (You discuss the need for high quality environments in the appendix. I agree.) In practice, my understanding is that constructing good auditing test beds has been a huge pain, especially because current AI systems have trouble "keeping their shit together" (and generally are unreliable at concealing secret aims) which makes behavioral red-teaming very powerful. I don't expect behavioral red teaming to work for the AI systems which I'm most worried about.^[1] To be clear, the issues I discuss in this paragraph generally apply to testing alignment auditing methods.

Another concern is that current AI systems (and future AIs which aren't much more capable) are pretty different than humans which could limit transfer in the conclusions, though this doesn't mean that hill climbing wouldn't have substantial transfer. E.g., maybe some tool helps AI agents uncover some issues with relatively less need for cognitive effort or agency or noticing something and this makes a huge difference to the bottom line due to AI capability limitations. But, humans would only be marginally helped by the tool because they could just apply more effort.

There is a decent chance that very powerful AIs still have trouble keeping their shit together, but these AIs correspondingly are much easier to alignment audit (at least for hidden intentions) such that I also think that these worlds are less leveraged to work on. ↩︎

[-]Sam Marks4mo82

constructing good auditing test beds has been a huge pain

Yeah, I confirm this is the case. I estimate that the model organism from Auditing Language Models for Hidden Objectives took 8-12 FTE-months, which I think most people find surprisingly expensive. That said, 30% of this time probably went into implementing synthetic documenting finetuning and testing that it worked. I'd guess I could now make another model organism of similar quality in 1.5-3 FTE-months.

Appendix: Next steps

While I think our auditing agent evaluations have the right shape to serve as a progress metric, they’re not there yet. What’s missing? This appendix discusses the most important things in my opinion.

More high-quality auditing environments. For the task of performing open-ended alignment audits, we have only one high-quality test environment: the one from Auditing Language Models for Hidden Objectives. While this auditing environment is really juicy for a bunch of reasons, it’s also narrow and limited.^[5] (And, worst of all, does not yet have an open-source replication, which means it provides limited value for the community.)

For other tasks (like building behavioral evals), we have more environments (10 environments corresponding to different system-prompted models), but these are still relatively narrow.

This lack of good test environments is a big issue. For “agent success in auditing environments” to be a good progress metric for alignment auditing, we need environments representative of the real distribution of misaligned models we might encounter (with special emphasis on the cases where we think alignment auditing might be hard or fail). If you have (1) detailed threat models for how misaligned models and the training processes that produce them might look, and (2) experience training LLMs, then consider working on developing new high-quality alignment auditing environments.

Auditing environments specifically targeting naturally-occurring issues (rather than ones synthetically-inserted into model organisms). The nice thing about testing with model organisms is that we know, by construction, about some property of the model. This provides a reliable ground-truth against which to test auditing agents. It also means we can test recall/coverage (“Does the agent discover every issue that is present?”) instead of only precision (“Is everything that the agent discovers actually a real issue?”).

However, I think it’ll eventually be pretty scary if all of our evidence about alignment auditing progress comes from validation on model organisms. If that happens, then I think we run a huge risk of developing techniques that are effective only on model organisms (or at least, are much less effective on real production models than on model organisms).

So I think it will be important to somehow build test environments involving naturally-occurring issues in frontier AIs. I think there are a bunch of open questions about how to do this well, with the most important in my mind being “How do you test whether we’re finding everything there is to be found?” One possibility here is to use target models which have already been commercially deployed, and use the set of issues discovered in deployment as a ground truth for an auditing agent. E.g. you could test whether an auditing agent can discover reward hacking in Claude Sonnet 3.7 or sycophancy in the version of GPT-4o that was rolled back for being sycophantic.

Best practices for elicitation. METR has a set of elicitation guidelines that they use for evaluating the agentic capabilities of models. When eliciting the investigator agent in our auditing agents work, we tried to follow something roughly similar to this. However, it was a bit of a confusing experience.

One reason it was confusing is that, since we only had one environment, we had to use it both as a dev environment (for elicitation) and as a test environment (for evaluating agent performance). This is obviously a massive no-no, though see the elicitation section of our blog post for why this might not be as bad as it sounds. We really need more high-quality environments!

Another reason is that alignment auditing tasks are typically trivial if you start out by knowing, at a high level, the answer. E.g. suppose you’re auditing a model that was trained to sabotage training its successor. Almost all of the auditing work goes into hypothesis formation. Once you’ve formed the hypothesis that the model sabotages training its successor, it’s relatively easy to test this hypothesis and substantiate findings.

If the human researcher eliciting auditing agent capabilities knows that the test set contains a “sabotage successor training” model, they might accidentally construct prompts or scaffolding that nudge the model towards investigating this hypothesis. In contrast, I don’t get the sense that METR researchers need to worry much about accidentally nudging the model towards strategies that they know will help with their test environments. So overall, I think that elicitation best practices should probably be reformulated in the context of alignment auditing.

Standard alignment auditing frameworks. To make sure that techniques are being tested in an apples-to-apples way, it would be nice to have open-source implementations of auditing agents and test environments. These implementations should be flexible enough for researchers developing near-arbitrary auditing tools to be able to plug-and-play their tool into an auditing agent and test auditing task performance.

^{^}

In these examples, I’ve slightly conflated the researchers’ goals and the metric they use to measure progress towards that goal. E.g. the “true goal” of architecture design is maybe more like “build architectures that more efficiently learn rich representations,” with “perplexity after compute-optimal training” being a way to operationalize this goal. But this is kind of the point: “perplexity after compute-optimal training” is such a trusted metric of progress that it’s mostly harmless to just view improving it as being your end-goal.

^{^}

To be clear, I don’t mean to claim that any of these metrics (or specific implementations of the metrics) are perfect. If you wanted to, you could probably find ways to game the metrics-as-specified. E.g. you could develop RL algorithms that hard-code assumptions about the test environments. But in practice this doesn’t seem to matter very much. Pre-training research isn’t a jungle of people figuring out how to Goodhart perplexity in tricky ways, with a council of senior researchers adjudicating which methods are “real" progress vs. just Goodharting.

^{^}

I should clarify that I’m not considering AI control research as being a subfield of alignment (which accords with how the control researchers I know use terminology). It’s also worth highlighting that, by design, AI control is substantially more numbers-go-uppy than alignment. AI control researchers can reasonably describe their job as being something like “To ensure that scheming AIs fail to cause harm, as measured in standard testbed settings.”

^{^}

To be clear, I’m not criticizing custom-built quantitative evaluations. I think these are good and reasonable things to do; I expect that some of my future work will continue to use custom-built evaluations. But I do think that it’s better and more compelling to demonstrate progress on an alignment auditing testbed.

^{^}

On one hand, this auditing environment is juicy for a bunch of reasons:

The model organism has a malign goal.
It knows what the goal is and can state it, but chooses not to. (Unless you do something creative to get a confession out of it).
It has bad behaviors downstream of the goal which arise purely via generalization, not by directly training them into the model.
The goal and training process that instilled it are structurally analogous to a realistic threat model (sycophantic reward hacking).
The environment works as an no-holds-barred challenge (e.g. you don’t need to impose arbitrary restrictions like “no looking at the model’s system propmt”). It’s possible to try your hardest with black- and white-box techniques and still lose.

On the other hand:

The process that produced the model is complicated and involves multiple training phases.
Because we trained it so much in a kinda weird way, the resulting model is less coherent and generally weirder than realistic models in ways that are unrelated to its “core” alignment issue. There are non-alignment reasons that this model would never get deployed
The model was trained on a bunch of documents that more-or-less explicitly state what the model’s deal is. An auditor that finds one of these documents (or somehow gets the model to regurgitate their content) immediately understands most of the picture.
The goal is complicated (doing whatever RLHF reward models rate highly) and the model sometimes gets confused about it (e.g. by thinking that a “reward model” is a component of an LLM rather than an auxiliary model used to train one).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

62

Towards Alignment Auditing as a Numbers-Go-Up Science

62

Appendix: Next steps