I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here.
Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem.
Aside from the narrow focus on "good human input" in this particular system, I'm worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.
I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we'll have a short doc soon (next week or so) which is somewhat related, along the lines of "assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different". Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive.
One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.
Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.
This is my concern, and I'm glad it's at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn't appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a "competitiveness case" later in this sequence, or later in the project? Etc.?
But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.
Because of the "slowness of philosophy" issue I talked about in my post, we have no way of quickly reaching high confidence that any such formalization is correct, and we have a number of negative examples where a proposed formal solution to some philosophical problem that initially looked good turned out to be flawed upon deeper examination. (See decision theory and Solomonoff induction.) AFAIK we don't really have any positive examples of such formalizations that have stood the test of time. So I feel like this is basically not a viable approach.
The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get utility out of (2) on other questions.
Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren't bulletproof, but I think I'd place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability.
That said, I'm not all confident in this picture: I'd place a lower probability than you on these considerations biting, but not that low.
Interesting post!
Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?
The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?
Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?
Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters "jailbreaking" a human via some sort of out of distribution input.
The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?
This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I'm happy to read it to see if it changes my mind. But this still doesn't solve the whole problem, because as Geoffrey also wrote, "Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive."
While I think debate might be a useful prosaic method for some levels of capability, I have a variety of concerns about this approach resulting in worst case guarantees[1]:
I think the argument pushes large parts of the problem into requiring a universal solution for exploration hacking. I think prosaically handling certain cases of exploration hacking for certain levels of capability seems very doable, but the approach discussed in the paper requires a very strong notion of solving exploration hacking (e.g., the AI would tell you any secret it might know via any mechanisms which you can verify, even if that secret would be arbitrarily hard to explore into such as if the AIs know of some arbitrary password you can check).
A universal solution for exploration hacking would itself require solving a particular alignment problem which IMO contains most of the difficulty of the overall alignment problem: fully solving exploration hacking effectively requires "we can ensure the AI will give us any knowledge that we can verify" which is effectively the same as "we can align an AI to trying to give us any knowledge that we can verify" and this doesn't obviously seem importantly easier than the overall alignment problem without further restrictions. Importantly, "giving knowledge" might itself be a difficult task due to ontology translation (see e.g. counterexamples in the ELK report). Another way to put this, is that solving exploration hacking (in the worst case) seems almost as difficult as being able to elicit arbitrary knowledge.
The paper discusses an assumption of "good human input" as being sufficient. However, the relevant question is whether the idealized system (the limit of a well elicited debate) actually finds truth in a given case. And, I think it seems very unclear whether it will find truth in the cases we care about. (E.g., does a reasonable human judge result in good behavior in the infinite structure or does something crazy happen? Do we somehow do better due to approximation?)
Minimally, answering some questions might involve hard to justify heuristics and there are issues with persuasion.
In some insane limit, anything is deductively verifiable (except the prior) using Solomonoff induction or similar, but surely this isn't going to actually save us! (For instance, the AIs will be incapable of arguing about this, and there isn't any reason why a plausible amount of training would elicit this without a magic solution to exploration hacking which put all the difficulty in this part. At this point, you could also just debate over purely logical formal statements going down to Solomonoff induction or some other intractable computational structure.)
Putting aside exploration hacking, it's unclear this yields reasonable results for more mundane reasons, like exploration being hard. And, the relevant level of exploration we need is extremely good as we're trying to elicit good performance over an exponentially large structure that we can't explore.
It also seems like you won't know if you've reached a sufficiently optimal level as I expect that you'll always see returns in terms of debater performance (as measured in e.g. elo) with further RL. So, you'll need to have some way of understanding what optimality suffices for eliciting the knowledge the AI already had.
I've also discussed these concerns in a call where the authors were present, but I thought it might be helpful to quickly write up my concerns. ↩︎
Hmm, I guess my main cause for skepticism is that I think the setup would get subverted somehow—e.g. either the debaters, or the “human simulator”, or all three in collusion, will convince the human to let them out of the box. In your classification, I guess this would be a “high-stakes context”, which I know isn’t your main focus. You talk about it a bit, but I’m unconvinced by what you wrote (if I understood it correctly) and don’t immediately see any promising directions.
Secondarily, I find it kinda hard to believe that two superhuman debaters would converge to “successfully conveying subtle hard-to-grasp truths about the alignment problem to the judge” rather than converging to “manipulation tug-of-war on the judge”.
Probably at least part of the difference / crux between us is that, compared to most people, I tend to assume that there isn’t much of a stable, usable window between “AI that’s competent enough to really help” and “AI that’s radically superhuman”, and I know that you’re explicitly assuming “not extremely superhuman”. (And that in turn is probably at least partly related to the fact that you’re thinking about LLMs and I’m thinking about other AI paradigms.) So maybe this comment isn’t too helpful, oh well.
Thanks for the comment, all sensible scepticisms IMO!
I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don't emphasise enough the defeater of "the deployment set-up just won't be upheld" (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I'm maybe more optimistic than you that there'll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage.
On debaters converging to honesty rather than subtle manipulation: I'm also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).
If we're talking about the domain where we can assume "good human input", why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it's probably not yours.
Two reasons:
For those reasons, I think we can likely get "good human input" for a bigger set of questions with debate than direct human supervision
This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.
Executive summary
AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).
TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.
This post sets out:
These gaps form the basis for one of the research agendas of UK AISI’s new alignment team: we aim to dramatically scale up ASI-relevant research on debate. We’ll also use the technique in this post – safety case sketching – to develop research agendas in other areas.
Debate is a potential solution to scalable oversight: how can we correctly reward desired behaviours for superhuman systems? In particular, it is often going to be beyond humans' ability to efficiently judge whether the behaviour of a superhuman system is desirable (or judge if a justification provided by the superhuman system is valid). Solving scalable oversight effectively provides a solution to outer alignment.
We’re optimistic that outer alignment is useful, despite inner alignment problems. Specifically, outer alignment with exploration guarantees gets you a deployment-time error bound that allows you to prove regret bounds for online training. In other words, outer alignment upper bounds the proportion of model outputs that are misaligned, as long as the model does not deliberately underperform during training. This upper bound can be maintained during deployment if a system is trained continuously. So, in low-stakes contexts where a small number of mistakes are acceptable such that you can use online training, outer alignment might be sufficient – especially if the error bound is tight.
This sketch has a limited, internal deployment context (designed for automated alignment research) and is written for a system that is not extremely superhuman. We think it's reasonable to assume this is a low-stakes context, assuming sufficient security in the environment implementation. We also have ideas about pushing low-stakes solutions through to high-stakes contexts (see “Extending the safety case to high-stakes contexts”– and more posts to come).
Beyond the low-stakes assumption, our sketch identifies assumptions that are required for debate to provide a full outer alignment solution. They key assumptions are:
Our safety case sketch identifies four key claims[3] that a developer must defend to show that the system is safe:
We can call type of alignment argument an “asymptotic guarantee” (also known as an “incentives” argument): An argument that high performance during training implies (approximate) alignment. This could be combined with other alignment arguments, such as empirical arguments about system behaviour (e.g. based on honeypots or model organisms) or arguments based on understanding model internals. By combining independent arguments, it may be possible to write a decent safety case even if each individual argument has flaws.
We don’t think it’s currently possible to make a strong alignment argument based on this sketch, but we’re hopeful that improvements to current techniques could make this argument a lot stronger. We think incremental progress on these techniques is both possible and valuable. This is true even in a world where AI developers primarily rely on automated alignment – both because scalable oversight might directly be part of an automated alignment workflow, and because a better understanding of alignment will help us better evaluate AI-generated solutions.
The alignment strategy
The alignment strategy covered by this safety case sketch consists of four steps, summarised in the figure below:
Step 1: Specify a low-stakes deployment context
Our alignment strategy requires that the deployment context is low-stakes such that we can use online training during deployment (rather than require a full solution to inner alignment). A low-stakes environment, as defined by Paul Christiano, is one where a small number of decisions can only cause a small amount of damage; more formally, ρk is small relative to total value at stake, where ρ is a regret bound on the maximum utility difference from any single decision, and k is the latency before retraining. With a perfect training objective, it would then be possible to achieve a total utility within O(ρk√T) of an aligned system, where T is the total number of timesteps.
Our safety case sketch covers a scenario where an LLM agent is deployed internally by an AI company to assist with tasks related to AI R&D. The domain is restricted to conceptual and empirical questions related to AI R&D and excludes questions that involve (some notion of) value judgements. To make things concrete, we might imagine posing questions like Determine what decision rule, or structure is appropriate for cacheing context in this computer use scaffold”, or “Iterate on enforcing monosemanticity of experts in MoE models via gradient-routing, optimise for the effectiveness of expert-knockout for unlearning”. Setting aside the compute time needed to run experiments, we expect such tasks to involve 5-15 days of 2025 human labour to yield meaningful results.
We envision the following two-part deployment context:
The safety case sketch will defend the claim that this deployment setting is a low-stakes context – but briefly, we think the low-stakes claim is plausible because:
Step 2: Train via a debate game
The backbone of our alignment strategy is to train a model via a debate protocol in which the equilibrium strategy for both debaters is to provide honest answers. The goal is that, during training, the models converge towards equilibrium[4] and therefore towards honesty. The debate protocol must satisfy two requirements:
We imagine starting with a pre-trained model (e.g. an LLM trained on next-token prediction) and then fine-tuning it via a debate which satisfies the above requirements. We wrote it with a particular protocol in mind, called prover-estimator debate (forthcoming), but the argument structure should be applicable to other protocols as well. In prover-estimator debate, a “prover” (Alice) produces subclaims arguing for a conclusion to a given query, while an “estimator” (Bob) assigns probabilities to the subclaims.
This debate protocol is designed to disincentivise a particular dishonest winning strategy known as obfuscated arguments. An obfuscated argument is one that contains a flaw, but it is computationally intractable for both debaters to find it. In classic debate protocols, a debater loses if they cannot identify a flaw in their opponent's argument, and as such, obfuscated arguments are a winning dishonest strategy. Prover-estimator debate aims to design the reward signal such that obfuscated arguments is not a winning strategy.
Step 3: Secure exploration guarantees
The goal of debate training is to train the models until they reach the approximate global equilibrium of the game, i.e. the state where both debaters are playing their optimal strategies (most of the time). Generally, with enough training time, we should expect models to approach equilibrium due to good play being rewarded in training. But a potential barrier is exploration hacking: the models failing to explore certain strategies that would be highly rewarded to avoid having them reinforced (e.g. failing to use certain good counterarguments they don’t want humans to know about).
To protect against this failure mode, our sketch requires exploration guarantees, i.e. evidence that the models are not foregoing high reward from unexplored strategies. We don’t currently know how to get exploration guarantees, but we (and others) have some early ideas:
While we don’t currently know how to get robust exploration guarantees, we’re optimistic about this research direction, and we think it has significant benefits relative to other strategies for preventing scheming or achieving inner alignment. In particular, we don’t need to define the fuzzy notions of scheming or inner alignment – instead, we have a relatively well-defined problem of getting exploration guarantees for training performance against a clear objective. We think the methods described above are more likely to work within the bounds of this well-defined problem.
Step 4: Continue online training during deployment
Even if the models reach an approximate global equilibrium during training, they won’t necessarily continue playing their best strategies during deployment due to distribution shifts. In particular, the models might be alignment faking, undermining the guarantees we got from debate training.
Our sketch aims to prevent this failure mode via online training during deployment:
Given this protocol, the systems during deployment face the same optimisation pressure towards truthfulness as during training.
The safety case sketch
Our safety case aims to show that the agent in our deployment setup is sufficiently honest that it will not cause unacceptable outcomes. In this sketch, we define honesty as outputting correct answers.
We argue first that A is correct most of the time – it only makes mistakes in an ε-fraction of cases (C1.1). We do this by arguing for three key claims:
We then argue for a fourth key claim:
Below is a simplified version of the safety case sketch, with the four key claims labelled in red. For simple presentation, our safety case sketch is phrased as a deterministic argument – each claim either is or is not true – but in a “real” safety case, it would be preferable to phrase claims probabilistically.
An interactive version of the sketch diagrams can be found here. Alternatively, open the images in a new tab and zoom in.
Notation
As with previous AISI safety case sketches for inability and control, we use "claims arguments evidence" notation (see appendix 2 for a detailed description of this notation).
We use the following abbreviations in the sketch:
Preliminaries
Before getting into the main safety case diagram, grey boxes at the top illustrate the wider context of the sketch:
Since our main sketch only covers the agent, a separate (but similar) argument is needed for the oracle (C0.1). Key claims 1-3 will likely be identical except that any post-debate-training fine-tuning and the online training latency may differ. Key claim 4, which covers the deployment setting, will be different. For the oracle, it should address threat models such as persuasion and/or blackmail of researchers (e.g. convincing researchers to increase its affordances), providing misleading advice, and degradation of human performance due to overreliance.
The sketch also includes an argument that combining the agent and oracle doesn’t increase risk (C0X.1). This would cover risks such as covert coordination between the agent and oracle.
The main objective O' is then broken down into four key claims:
Key claim 1: The training process has taught the systems to play the game well
The first key claim is that the training process has reached an approximate global equilibrium of the game (C2.1).[6]This section of the sketch looks like this:
The key claim is supported by three subclaims:
Subclaim 1: Existence of efficient models
The first subclaim is that there exists an efficient pair of models achieving an approximate global equilibrium of the game. In other words, it is possible to find models that reach equilibrium given realistic computational bounds on the models. The key notion here is efficiency. With computationally unbounded debaters, debate with optimal play (and cross-examination) can answer any question in NEXP given polynomial time judges. The assumption of unbounded debaters is of course unrealistic, so we need to find another efficiency notion which is both realistic and still allows for proofs that debate can solve a large class of practically relevant problems.
Since our sketch is based on an asymmetric debate protocol (prover-estimator debate), we use a relative efficiency notion: we require that the estimator (B) requires only a polynomial factor more compute than the prover (A). The claim – that there exist agents in equilibrium that satisfy this efficiency notion – would be evidenced with a proof (regardless of the protocol and efficiency notion) (E4.1).
Subclaim 2: Training convergence
The second subclaim is that training has reached a local equilibrium. We can get evidence from observing that the returns from training are diminishing, and by checking that the outputs look at least roughly reasonable (E4.2).
Subclaim 3: Convergence = equilibrium
The third subclaim is that the reached local equilibrium is most likely an approximate global equilibrium of the game (E). We argue for this claim via several subclaims.
First, while we showed above that there exist efficient models that would reach equilibrium, we need to show that our actual training process could find them in efficient time (C4.1). Our weak evidence for this claim is that Nash equilibria for tabular Stackelberg games can be computed in polynomial (indeed linear) time, hence is tractable (although note that our game is not tabular) (E5.1). But that still leaves significant uncertainty about whether stochastic gradient descent (SGD) can reach E in realistic training time. There are reasons to think that, in the case of debate, there are bad constants or exponents (i.e. the time it takes to solve E is on the higher end) - for example, it might be difficult to translate latent knowledge acquired via heuristics into deductive arguments (D5.1). However, we’re hopeful that future research can prove tighter theoretical results linking prover-estimator debate – or other, improved debate protocols – to realistic training methods.
Second, after showing that it is tractable for our training process to reach E, we need to show that it is the most likely equilibrium. We claim that the main failure mode here is exploration hacking (C4.2). This is mostly a hunch - there are other possible failure modes, though we would guess that they are less important – so we leave the evidence box blank (E5.2). One failure mode is that SGD hits a local minimum and never reaches E (D5.3). We don’t currently have good evidence against this defeater, though it might be possible for fields like learning theory to generate relevant evidence. Another failure mode is gradient hacking, i.e. using knowledge of its loss landscape to manipulate the training process (D5.2).[7] Since the system covered by this safety case is assumed to not be extremely superhuman, we think it is reasonable to assume it won’t be capable of gradient hacking.
Third, having limited the problem to exploration hacking, we need evidence to rule out this failure mode (C4.3). As described above, this requires exploration guarantees which provide evidence that the system has exhausted high-reward exploration options, such that convergent behaviour is likely to be a result of actually reaching E (E5.3). Note that exploration guarantees are much more useful if we’re confident in our reward function (as the next key claim argues). If our reward function is bad, more exploration doesn’t necessarily lead to better outcomes. We don’t currently know how to get robust exploration guarantees - while we’re optimistic about this research direction, this is a major gap in our current sketch.
Key claim 2: The game incentivises correctness
The second key claim is that in the approximate global equilibrium of the game, the system is approximately honest – its error rate is at most ε (C2.2). This section of the sketch looks like this:
The key claim is supported by two subclaims:
Why go this indirect route via M? Because we can’t directly verify if the system’s outputs are true - that’s the reason we need scalable oversight in the first place.[8] Instead of checking outputs against ground-truth labels, we check the whole debate protocol against a process we hypothesize to produce true outputs, called M. M is too expensive to run; the goal of our debate protocol is to accelerate it. If our protocol produces the same outputs as M, we can assume it produces true outputs.
To understand M, consider a debate as a branching tree of arguments and counterarguments. A comprehensive debate would expand on every possible argument and counterargument, and have a judge consider every branch. Recursive debate aims to accelerate this process by having debaters expand only on a single path through the tree. If the debaters are good, we should expect them to anticipate how the debate will proceed given various arguments they could make and pick the optimal path through the tree. The outcome of a single debate along a single path thereby provides evidence about which debater would have won, had every argument and counterargument been considered. This is analogous to how a single game of Go provides evidence as to who the better player is, even though it only traces one path through the tree of possible moves.
Our M is equivalent to expanding the whole tree. It is a computation which produces every possible argument and counterargument at every stage and calls (simulated) human judges to adjudicate every individual path through the tree. We hypothesize that such a computation would produce correct outputs within the question domain. The goal of our debate protocol is to produce the same outputs, but more efficiently.
Subclaim 1: M-approximation
The first subclaim is that we have succeeded at producing a protocol that accelerates M. More precisely: System outputs in E are the same as slightly perturbed output of M. In practice, our protocol cannot exactly reproduce M, since it’s intractable for training to converge to a precise equilibrium - we can only reach an approximate equilibrium. As such, we can only reproduce a slightly perturbed version of M, rather than M exactly.
Based on previous research on debate, we know that the primary failure mode for this claim is the obfuscated arguments problem. As described above, an obfuscated argument is one where both debaters know the argument contains a flaw, but it is computationally intractable for either debater to identify the flaw. If obfuscated arguments are a winning strategy, our protocol will produce different outcomes to M: expanding the whole debate would expose the obfuscated argument as false, but given computational limitations, debate might fail to get the same result.
We are optimistic that a new debate protocol, prover-estimator debate (forthcoming), can evade the obfuscated arguments problem. Regardless of which debate protocol is used, the safety case should contain a proof that it successfully approximates M (E4.3).
Subclaim 2: Truth of M
The second subclaim is that we have succeeded at identifying an M which produces mostly true outputs. More precisely: For most tasks, all slightly perturbed outputs of M are true. We identify three key potential issues - with a side-claim that there are no additional issues (C4X.1), supported by evidence that enough red-teaming has occurred (E4X.1).
First, we need to argue that the structure of recursive tree debate is conducive to identifying truth (C4.4). This is supported by theoretical arguments (E5.4). We might worry that some true statements cannot be argued for compellingly via deductive arguments. For example, some truths might only knowable via heuristics learned by encountering many examples, reducing the scope of questions that can be answered with recursive tree debate (D5.4). More worryingly, we might worry that some false claims can be compellingly defended via deductive arguments. For example, it might be possible to define the same term differently in different subclaims, rendering an argument that appears valid but isn’t; if human judges are called only at the leaf nodes, such inconsistencies might not be caught by M (D5.5). Some debate protocols, such as cross-examination, avoid this failure mode, but prover-estimator debate – our example in this safety case - does not. However, we are hopeful that future debate protocols can combine elements of both.
Second, we need to argue that the inputs into M are accurate, i.e. that simulated human judges are competent at assessing the outcomes of debate (C4.5). Random errors should wash out over the course of training, so we’re only worried about systematic error, for example due to claims playing on human biases or emotions. We need two claims here: that the human input being simulated has bounded systematic error with respect to reality (C5.1) and that the simulation has bounded systematic error with respect to the human input (C5.2).
We expect that getting good human inputs will be one of the most challenging parts of this safety case and require much more work. While some errors like moral or emotional bias are likely less prevalent in the domain of AI R&D, there will likely still be errors like cognitive biases and reasoning failures. However, we are tentatively optimistic that human inputs can be improved, both by designing better processes (e.g. giving humans access to the right tools, finding the right number and type of judges) and by improving debate protocols (e.g. allowing debaters to block certain types of claims on which humans are likely to be systematically wrong).
Third, since training doesn’t reproduce M exactly, but rather a slightly perturbed version of M, we need to argue that this approximation isn’t substantially different (C4.6). This is known as stability and can be more precisely stated as: small variations in input probabilities won’t change the outcome of overall computation. We’re not yet sure how to support this claim, so we leave the evidence box blank (E5.5).
We expect that most protocols which avoid obfuscated arguments – including prover-estimator debate - will only approximate M and so require stability. However, it may not be needed for all protocols. For example, it might be possible to relax this assumption if we aim only for honesty rather than accuracy (more on this to come in future work). Alternatively, it might be possible to transform an unstable computation M into a stable computation M*; if so, the safety case should instead contain a claim that the outputs of M* are (approximately) identical to the outputs of M).
Even if stability is required to produce useful outputs, it might not be strictly necessary for safety. For example, in prover-estimator debate, unstable queries return no answer, rendering the protocol safe for all queries but useful only for stable queries.
Key claim 3: The system's behaviour will stay similar during deployment
The third key claim is that the system’s error rate will not drift past ε during deployment (C2.3). This part of the sketch looks like this:
In our sketch, we support this claim by using online training throughout deployment as above, such that any distributional shifts will be accounted for via further training (E3.1). While the argument is straightforward, the evidence should still demonstrate that the online training regime has been correctly configured. For example, the latency (i.e. frequency of updates) should be high enough to prevent significant deviations from E, and there should be an effective process for updating multiple copies of the same model.
This is one of the key claims in the sketch which breaks in a high-stakes context, where even a small number of errors can cause large harm. Since we can’t use online training, we instead need guarantees against alignment faking (and, less importantly, more generic distribution shifts). This would be much trickier to defend; see some initial ideas in “Extending the safety case to high-stakes contexts”.
Key claim 4: It is sufficient for safety purposes for the system to provide correct answers most of the time
The fourth key claim is that the system cannot cause unacceptable outcomes given an error rate of ε (C1.2). The main failure mode we consider is that the system is scheming to produce harmful outputs specifically in cases where our reward signal (i.e. human judging on debate leaf nodes) can’t catch it. The remaining errors could then constitute a coordinated set of actions that might collectively cause an unacceptable outcome.
Since we don’t have a good way to tell where our reward signal fails and what the effects of those errors would be, we assume that each error has the worst possible outcome, as if the system is a maximally competent schemer.[9]
We then argue that the system’s deployment context is nevertheless low-stakes because the actions available to the system aren’t harmful enough to cause unacceptable outcomes in a small number of steps. This section of the sketch looks like this:[10]
First, we break down the unacceptable outcomes of concern into three threat models: research sabotage, i.e. producing false research papers (C2.4), malicious code attacks, i.e. exfiltrating malware when information is extracted from the offline datacentre (C2.5), and self-exfiltration, i.e. the agent escaping from the datacentre (C2.6). This breakdown is supported with a side-claim that we have identified all relevant failure modes (C2X.1) - we validate this with threat modelling supported by expert input (E2X.1). The reasoning behind our breakdown is as follows: Our deployment set-up is such that the agent can only affect the external world via outputs from the datacentre. So, the only way the agent can only cause harm via those outputs, or by self-exfiltrating. Within the outputs, the agent can cause harm either via legitimate outputs (papers) or illegitimate outputs (malicious code).
For simplicity, we only cover one threat model – research sabotage via paper outputs - in the remainder of the safety case, though a similar argument (or other arguments) could be made for other threat models. We argue for this via two subclaims:
Subclaim 1: # of bad actions required
The first subclaim is that committing effective research sabotage would require a certain number of bad actions, Z. Research sabotage could be actions such as faking experimental data, running misleading experimental set-ups, or including false claims or bugged code in the final research paper. Our deployment set-up aims to make it harder to commit research sabotage by investigating the outputted research paper with a checker model trained on n debates. To successfully conduct research sabotage, the agent would have to produce a paper that is false in a way that would very robustly not be caught by the reward signal (or very robustly manipulate the human judges).[11] Our sketch argues that this isn’t possible, via threat modelling (E4.4).
One potential challenge for such threat modelling is that system capabilities might improve over the course of deployment due to online training. This could enable the system to access new, more harmful actions that could enable research sabotage with fewer actions (D4.1). Another defeater is that the possibility of subtle research sabotage over long periods of time, for example by including subtly misleading but not technically false statements that aren’t caught by the reward signal (D4.2). This risk is reduced by resetting the datacentre between each iteration, minimising coordination opportunities. This could be supplemented with other measures to prevent suspicious correlations between outputs over long time horizons – more on this to come.
Subclaim 2: Low likelihood of enough bad actions
The second claim is that the agent cannot take Z bad actions during a single iteration given its error rate. This is straightforward to support, based on a calculation of the expected number of bad actions a system can take during deployment given probability distributions over its error rate ε (given online training latency k), total number of actions n, and actions required to pull off research sabotage Z (E4.5). Ideally, the claim should also be supported with evidence that the deployment protocol will in fact be adhered to, for example by committing to regular external verification (D4.3).[12] The argument only works if each iteration is isolated and cannot coordinate with other iterations – otherwise, the errors required to commit research sabotage could accumulate across iterations. While we aim to avoid this failure mode by resetting the agent and datacentre between each iteration, systems could interfere with this resetting mechanism – or they may be able to coordinate acausally due to being based on the same underlying system (D4.4).
Extending the safety case to high-stakes contexts
This safety case sketch only works in low-stakes contexts where it takes many bad actions to cause unacceptable outcomes. Key claim 4 – that we’re in a low-stakes context – breaks by definition. Key claim 3 – that the error rate will stay consistent throughout deployment – also breaks because we can no longer rely on online training, opening up the possibility of alignment faking and other distribution shift problems.
But the low-stakes claim probably won’t hold for the most dangerous superhuman systems. So why make this safety case sketch?
The most straightforward reason is some highly consequential deployments might plausibly be low-stakes. For example, early-AGI systems in sufficiently careful conditions could plausibly be low-stakes and used for essential alignment research to prevent future, high-stakes systems from causing a catastrophe. It might be possible to build a robust automated alignment safety case based on the scalable oversight techniques in this sketch – we’re planning to do more work on this.
More indirectly, we think this safety case sketch is a useful starting point for developing other high-stakes safety case sketches. In particular, we think we might be able to extend the techniques and evidence in this sketch to high-stakes contexts, rather than requiring completely different kinds of techniques or evidence. We currently have two hopes in this direction:
First, we might be able to use online training-like techniques during inference, on candidate actions instead of on final actions. This would allow us to maintain an error rate of ε for candidate actions, then cut the regret further by generating many candidate actions. This might be enough to patch key claim 3; it’s not enough to completely patch key claim 4, but it gets us some of the way.
Second, the evidence needed to support key claims 1 and 2 might be sufficient (or extendable) to patch key claims 3 and 4. For example, to address exploration hacking, we might need a method for eliciting bad contexts (e.g. based on interpretability or singular learning theory). Such a method might also be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distributional shifts. More broadly, it seems like, given a theorem showing that the error rate is at most ε, in practice one should put significant weight on something close to full intent alignment (or at least the system being not egregiously misaligned). For example, it seems plausible that human error is anticorrelated with the severity of error such that the errors that remain are inconsequential.
We’re far from confident in these techniques, but we think it’s worth trying hard to make them work. Our research agenda encompasses further work in this area, and we’ll be publishing more on these ideas soon.
Open problems
Our safety case sketch faces significant limitations but also helps us identify concrete research directions that might address them. We identify five main areas for further research:
Improved debate protocols. Many gaps in our safety case might be solvable with improved debate protocols with the following properties:
Empirical testing and optimisation. While debate has theoretical appeal, we have only limited data on its practical effectiveness. For example, most previous experiments use the QuALITY dataset, have low recursive depth, and use human debaters. Key questions include whether debate measurably improves performance, whether training reliably reaches equilibrium, whether stability is a real problem, and whether obfuscated arguments actually emerge as predicted. We also need an improved understanding of systematic human errors – for example: How good are humans at giving correct answers in verifiable domains, what are common systematic errors and in which domains are they most prominent, and how we can design human input for optimal results?
Exploration hacking. A major gap in our sketch is the requirement for exploration guarantees. We know very little about how serious a problem exploration hacking is and how to prevent it – yet the sketch depends on avoiding it. We outlined various research directions that might help address exploration hacking across both interpretability, singular learning theory, and more prosaic techniques – much more work is needed to develop and test these. More foundationally, it would be useful to develop model organisms of exploration hacking to help test potential solutions. For very superhuman systems, we might also need to address gradient hacking.
High-stakes contexts. In the high-stakes context, we face greater inner alignment challenges. We lose the ability to use online training, so need to face the possibility of distribution shifts and alignment faking; and we need new ways to contend with coordinated errors not captured by our reward signal. We outline research directions in the previous section.
From honesty to alignment. We need to better understand the alignment value of honest question-answering. How much does honesty plus absence of alignment faking translate into aligned actions during alignment research? How can deployment environments be designed to minimize risk while maintaining the system's usefulness for alignment research?
Further posts in this sequence will explore initial thoughts and research directions on many of these open problems – we’ll add links here as they are published.
Conclusion
Overall, the argument in this sketch isn’t that strong with current techniques – much better techniques are needed by the time we’ll need alignment arguments.
That said, we think there are useful directions for future research, which we intend to pursue (and fund others to pursue) – although we aren’t confident that the holes can be patched.
We are optimistic that safety case sketches are a useful tool for research and for coordinating research. Writing this up helped us clarify our thinking and given us ideas for further research directions – we’ll be following up with more posts on these.
Acknowledgements: We’re grateful for valuable input and comments from Beth Barnes, Charlie Griffin, Jacob Hilton, Jonah Brown-Cohen, Joshua Clymer, Martin Soto, Nicola Ding, & Will Kirby.
Appendices
Appendix 1: The full safety case diagram
An interactive version of the diagram can be found here. Alternatively, open the image above in a new tab and zoom in.
Appendix 2: The "Claims Arguments Evidence" notation
The safety case sketch begins with a top-level claim we want to make about the safety properties of an AI system (A). This top-level claim is iteratively broken down into further subclaims, which must eventually bottom out in evidence. These subclaims shows the logical structure and dependencies connecting evidence to the top-level claim.
The diagram uses the “Claims, Arguments, Evidence” notion format. It consists of four types of units:
Blue ovals are claims. Claims must either be broken down into further subclaims or directly supported by evidence, in both cases via an argument.
Green parallelograms are arguments. Arguments link claims to evidence or further subclaims. They explain why the evidence or subclaims suffice to justify the claim. This sketch uses 3 argument types:
Purple boxes are evidence. They provide information, data or calculations that support claims. There are many possible types of evidence, including results from empirical tests, proofs, results from expert surveys, or theoretical outputs such as threat modelling reports.
Red ovals are defeaters. These are hypothetical claims that, if true, could invalidate or render insufficient the evidence provided. The safety case should explain what has been done to address defeaters and/or why they are not a concern.
Grey elements provide context. They are surrounding parts that would be included in a complete safety case but are not the focus of this sketch.
However, we think that the argument structure is general enough to carry over to improved scalable oversight protocols with better properties.
The stability assumption is not directly required for safety – for queries where stability does not hold, the protocol will simply return no answer, so instability does not imply false outputs. However, the protocol is not very useful if stability fails to hold for a large fraction of relevant queries.
Previous safety case sketches have been primarily evaluation-based. Our sketch is the first to take a more mechanistic approach. This is partly because alignment evaluations are more challenging than other types of evaluations, and partly because a lesson from other industries is that safety cases should include arguments based on a mechanistic understanding of how a system works.
That is, the model weights converge towards policies that play the equilibrium strategy.
More subtle issues with the paper, such as containing true but subtly misleading claims or using suboptimal framings, might not be caught by debate on questions such as these. Using debate on actions taken inside the datacentre, while preparing the paper, mitigates this risk somewhat, but other methods might be needed to catch individually innocuous but cumulatively harmful errors. We’re working on ideas for 'unexploitable search' to address this type of error.
Note that this is an equilibrium over model weights or policies, rather than strategies in individual rounds of debate. The model weights or policies determine the strategy for all debates. Equilibrium is reached when neither debater can improve its expected pay-off with a different policy.
If the models are gradient hacking, it could in theory make it impossible to reach E via the training process T. Gradient hacking is therefore also a defeater to E5.1.
While we can probably directly verify many of the model’s outputs in some domains (e.g. maths), we also want to be able to use the model to answer non-verifiable questions.
In this sense, our argument has a control flavour, in that we are assuming that the system is as egregiously misaligned as possible given the constraints of training. An alternative argument would be that we are in a low-stakes context because the reward signal catches the most severe errors.
For simplicity, we phrase these claims deterministically rather than probabilistically, and with respect to a binary acceptable vs. unacceptable outcome rather than utility loss. For a continuous version, our overall claim could be that we have a regret bound of O(ρk√T+ρε′T). For a probabilistic version, we could use probability distributions over ε, as well as Z or ρ.
The agent would also have to set this up using only actions that would not be flagged by the reward signal as suspicious, since it is subject to online training while producing the paper.
It may also be necessary to provide evidence that the AI agent itself will not tamper with the deployment protocol, for example by disabling online training.