An alignment safety case sketch based on debate

Marie_DB; Jacob Pfau; Benjamin Hilton; Geoffrey Irving

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.

This post sets out:

What debate can be used to achieve.
What gaps remain.
What research is needed to solve them.

These gaps form the basis for one of the research agendas of UK AISI’s new alignment team: we aim to dramatically scale up ASI-relevant research on debate. We’ll also use the technique in this post – safety case sketching – to develop research agendas in other areas.

Debate is a potential solution to scalable oversight: how can we correctly reward desired behaviours for superhuman systems? In particular, it is often going to be beyond humans' ability to efficiently judge whether the behaviour of a superhuman system is desirable (or judge if a justification provided by the superhuman system is valid). Solving scalable oversight effectively provides a solution to outer alignment.

We’re optimistic that outer alignment is useful, despite inner alignment problems. Specifically, outer alignment with exploration guarantees gets you a deployment-time error bound that allows you to prove regret bounds for online training. In other words, outer alignment upper bounds the proportion of model outputs that are misaligned, as long as the model does not deliberately underperform during training. This upper bound can be maintained during deployment if a system is trained continuously. So, in low-stakes contexts where a small number of mistakes are acceptable such that you can use online training, outer alignment might be sufficient – especially if the error bound is tight.^[1]

This sketch has a limited, internal deployment context (designed for automated alignment research) and is written for a system that is not extremely superhuman. We think it's reasonable to assume this is a low-stakes context, assuming sufficient security in the environment implementation. We also have ideas about pushing low-stakes solutions through to high-stakes contexts (see “Extending the safety case to high-stakes contexts”– and more posts to come).

Beyond the low-stakes assumption, our sketch identifies assumptions that are required for debate to provide a full outer alignment solution. They key assumptions are:

Exploration guarantees: We need a solution to exploration hacking (in which the debaters fail to explore certain highly rewarded strategies to avoid having them reinforced). For example, if we can elicit “bad” contexts in which models would answer differently under different circumstances, we can use this to detect exploration hacking (and possibly prevent it during training). Notably, with a (mostly) trusted reward signal, we don’t need to explicitly look for “scheming” – we only need an “exploration guarantee” showing that the model is not foregoing substantial reward and therefore likely optimising against the reward signal. We are hopeful that it may be easier to make progress on this as a standard reward optimisation problem, relative to formalising the fuzzy notion of scheming in order to detect and prevent it.
No obfuscated arguments: We need a debate protocol that avoids the obfuscated arguments problem (in which debaters can win by giving a false but hard-to-refute argument, specifically one where it is computationally intractable for both debaters to identify the flaw). Our sketch is based on a new debate protocol (forthcoming) which we believe evades the obfuscated arguments problem,^[2] which would be substantial progress in debate research. However, the protocol requires a stability assumption - that the computation accelerated by debate is not sensitive to small variations in input probabilities – to function.^[3]
Good human input: We need to assume that (simulated) human inputs sufficiently reliably reward correct answers, over the relevant distribution of human-judged queries.

Our safety case sketch identifies four key claims that a developer must defend to show that the system is safe:

The training process has reached an approximate global equilibrium of the game.
In approximate global equilibria of the game, the system makes mistakes in at most an -fraction of cases.
During deployment, the error rate will not drift past $ε$ given online training.
The system cannot cause unacceptable outcomes given an error rate of $ε$ .

We can call type of alignment argument an “asymptotic guarantee” (also known as an “incentives” argument): An argument that high performance during training implies (approximate) alignment. This could be combined with other alignment arguments, such as empirical arguments about system behaviour (e.g. based on honeypots or model organisms) or arguments based on understanding model internals. By combining independent arguments, it may be possible to write a decent safety case even if each individual argument has flaws.

We don’t think it’s currently possible to make a strong alignment argument based on this sketch, but we’re hopeful that improvements to current techniques could make this argument a lot stronger. We think incremental progress on these techniques is both possible and valuable. This is true even in a world where AI developers primarily rely on automated alignment – both because scalable oversight might directly be part of an automated alignment workflow, and because a better understanding of alignment will help us better evaluate AI-generated solutions.

The alignment strategy

The alignment strategy covered by this safety case sketch consists of four steps, summarised in the figure below:

Step 1: Specify a low-stakes deployment context

Our alignment strategy requires that the deployment context is low-stakes such that we can use online training during deployment (rather than require a full solution to inner alignment). A low-stakes environment, as defined by Paul Christiano, is one where a small number of decisions can only cause a small amount of damage; more formally, $ρ k$ is small relative to total value at stake, where $ρ$ is a regret bound on the maximum utility difference from any single decision, and $k$ is the latency before retraining. With a perfect training objective, it would then be possible to achieve a total utility within $O (ρ k \sqrt{T})$ of an aligned system, where $T$ is the total number of timesteps.

Our safety case sketch covers a scenario where an LLM agent is deployed internally by an AI company to assist with tasks related to AI R&D. The domain is restricted to conceptual and empirical questions related to AI R&D and excludes questions that involve (some notion of) value judgements. To make things concrete, we might imagine posing questions like Determine what decision rule, or structure is appropriate for cacheing context in this computer use scaffold”, or “Iterate on enforcing monosemanticity of experts in MoE models via gradient-routing, optimise for the effectiveness of expert-knockout for unlearning”. Setting aside the compute time needed to run experiments, we expect such tasks to involve 5-15 days of 2025 human labour to yield meaningful results. 

We envision the following two-part deployment context:

Diagram displaying the deployment context required for the safety case sketch

Oracle-only copy: This system interactively assists developers with tasks such as brainstorming, prioritising and designing research directions (e.g. questions such as the above). The oracle cannot run any code. Interactions between developers and the oracle are logged and monitored to ensure questions remain within scope, to prevent misuse, and to identify potential bad actions by the oracle (e.g. manipulation attempts).
Sandboxed, agent copy: This system conducts research as directed by developers to answer questions such as the above. It is placed in an offline datacentre with no external connection. Developers periodically send questions to the sandboxed datacentre, which then runs for a fixed period (e.g. 12 hours, replacing 1-2 weeks of human labour) in which the system can use datacentre resources arbitrarily. The system sends back a response akin to a research paper. The sandboxed datacentre is then reset.

The safety case sketch will defend the claim that this deployment setting is a low-stakes context – but briefly, we think the low-stakes claim is plausible because:

Both systems have limited means of influencing the world – the oracle only via persuasion and the agent only via actions within an offline datacentre or information exported from the datacentre.
Continuous online training restricts the set of misaligned actions available to the system to only actions not "caught” by debate (more on this below).
The domain is limited to questions where we expect humans to have less systematic bias.
The regular resetting of the datacentre limits the total number of coordinated actions by the agent.
The system is assumed to not be extremely superhuman.

Step 2: Train via a debate game

The backbone of our alignment strategy is to train a model via a debate protocol in which the equilibrium strategy for both debaters is to provide honest answers. The goal is that, during training, the models converge towards equilibrium^[4] and therefore towards honesty. The debate protocol must satisfy two requirements:

Honesty: In the global equilibrium of the game, when both debaters are playing their best response strategies, both debaters are honest (i.e. there should be no dishonest answer that would give them higher reward).
Efficiency: The above also holds for debaters with realistic computational limitations, such that it is possible to reach equilibrium during training.

We imagine starting with a pre-trained model (e.g. an LLM trained on next-token prediction) and then fine-tuning it via a debate which satisfies the above requirements. We wrote it with a particular protocol in mind, called prover-estimator debate (forthcoming), but the argument structure should be applicable to other protocols as well. In prover-estimator debate, a “prover” (Alice) produces subclaims arguing for a conclusion to a given query, while an “estimator” (Bob) assigns probabilities to the subclaims.

This debate protocol is designed to disincentivise a particular dishonest winning strategy known as obfuscated arguments. An obfuscated argument is one that contains a flaw, but it is computationally intractable for both debaters to find it. In classic debate protocols, a debater loses if they cannot identify a flaw in their opponent's argument, and as such, obfuscated arguments are a winning dishonest strategy. Prover-estimator debate aims to design the reward signal such that obfuscated arguments is not a winning strategy.

Step 3: Secure exploration guarantees

The goal of debate training is to train the models until they reach the approximate global equilibrium of the game, i.e. the state where both debaters are playing their optimal strategies (most of the time). Generally, with enough training time, we should expect models to approach equilibrium due to good play being rewarded in training. But a potential barrier is exploration hacking: the models failing to explore certain strategies that would be highly rewarded to avoid having them reinforced (e.g. failing to use certain good counterarguments they don’t want humans to know about).

To protect against this failure mode, our sketch requires exploration guarantees, i.e. evidence that the models are not foregoing high reward from unexplored strategies. We don’t currently know how to get exploration guarantees, but we (and others) have some early ideas (note that we strongly expect that novel whitebox techniques will be necessary):

Train an exploration-predictor model to predict returns to exploration, then use it to estimate if the model has exhausted high-returns exploration options.
Use best-of-n sampling of a trusted model to estimate returns to exploration given more inference compute; use this to estimate a lower bound of the exploration a model should be able to do given more training compute.
Use interpretability-based formal heuristic explanations to predict if convergence is most likely due to exploration failures or reaching equilibrium.
Use perturbation analysis along the lines of the singular learning theory (SLT) susceptibility direction, using structure in weights to identify if there is additional training data which would have an out-sized effect on reward, indicating exploration hacking.
Use model organism studies to argue that exploration hacking is an unlikely failure mode for the model in question (e.g. because certain exploration-forcing techniques appear to be effective).
Use adversarial RL to further train Bob while holding Alice’s policy constant; if Alice’s reward does not decrease significantly, this is evidence that Alice was not exploration hacking.

While we don’t currently know how to get robust exploration guarantees, we’re optimistic about this research direction, and we think it has significant benefits relative to other strategies for preventing scheming or achieving inner alignment. In particular, we don’t need to define the fuzzy notions of scheming or inner alignment – instead, we have a relatively well-defined problem of getting exploration guarantees for training performance against a clear objective. We think the methods described above are more likely to work within the bounds of this well-defined problem.

Step 4: Continue online training during deployment

Even if the models reach an approximate global equilibrium during training, they won’t necessarily continue playing their best strategies during deployment due to distribution shifts. In particular, the models might be alignment faking, undermining the guarantees we got from debate training.

Our sketch aims to prevent this failure mode via online training during deployment:

The answers provided by the oracle and the actions taken by the agent will be subject to debate (e.g. on the question “is this answer true?” or “is this the right next action for answering this research question?”), and the models are retrained according to the same reward function as during training. This happens every $k$ steps (but every answer/action will be evaluated).
The final paper outputted by the datacentre will be subject to a more extensive check. We train a “checker” version of the model to assess the paper (e.g. using questions like “are there untrue claims in this paper?” or “is this paper informative relative to the original question?”^[5]), until training converges and we have exploration guarantees as described above. We then use the “checker” to conduct debate on the paper n times and only use the paper if it comes out true in all n iterations.

Given this protocol, the systems during deployment face the same optimisation pressure towards truthfulness as during training.

The safety case sketch

Our safety case aims to show that the agent in our deployment setup is sufficiently honest that it will not cause unacceptable outcomes. In this sketch, we define honesty as outputting correct answers.

We argue first that A is correct most of the time – it only makes mistakes in an $ε$ -fraction of cases (C1.1). We do this by arguing for three key claims:

The training process has taught the system to play the game well - i.e., the training process has reached an approximate global equilibrium of the game (C2.1).
The game incentivises correctness - i.e., in any approximate global equilibria of the game, the system makes mistakes in at most an $ε^{'}$ -fraction of cases (C2.2). 
The system's behaviour will stay similar during deployment – i.e., during deployment, the error rate will not drift past $ε > ε^{'}$ (C2.3).

We then argue for a fourth key claim:

It is sufficient for safety purposes for the system to provide correct answers most of the time – in this safety case, we argue that the deployment context is low-stakes, such that the system cannot cause unacceptable outcomes given an error rate of $ε$ (C1.2).

Below is a simplified version of the safety case sketch, with the four key claims labelled in red. For simple presentation, our safety case sketch is phrased as a deterministic argument – each claim either is or is not true – but in a “real” safety case, it would be preferable to phrase claims probabilistically.

An interactive version of the sketch diagrams can be found here. Alternatively, open the images in a new tab and zoom in.

A simplified version of the safety case sketch showing the locations of the four key claims

Notation

As with previous AISI safety case sketches for inability and control, we use "claims arguments evidence" notation (see appendix 2 for a detailed description of this notation).

We use the following abbreviations in the sketch:

Abbr.	Definition
X	The system that the safety case is about, consisting of a sandboxed agent A and an oracle
A	A sandboxed agent, trained via debate (during training, A consists of both players in the game)
G	A game we train our agent to play, e.g. prover-estimator debate
E	Any approximate global equilibrium of G
F	An efficiency notion that we aim to hold for the players of G, e.g. “Bob requires only slightly more compute than Alice”
T	A training method we use to train A , e.g. stochastic gradient descent
S	Simulated human judge who determines the winner of the debate
H	Human judgement simulated by S (which could be an average or sample of multiple human judges)
M	A prohibitively expensive computation hypothesised to produce correct output, which debate aims to accelerate, e.g. construct every possible subclaim breakdown, recurse every subclaim to maximum depth, and call (simulated) human judges at every leaf
D	The deployment context, e.g. internal deployment to conduct AI R&D
Z	The minimal number of individually bad actions required to cause unacceptable outcomes in D
C	A checking process applied to particularly high-stakes outputs of A, e.g. n rounds of debate

Preliminaries

Before getting into the main safety case diagram, grey boxes at the top illustrate the wider context of the sketch:

The initial claims and argument required to show that the system will not cause unacceptable outcomes, in CAE notation.

Since our main sketch only covers the agent, a separate (but similar) argument is needed for the oracle (C0.1). Key claims 1-3 will likely be identical except that any post-debate-training fine-tuning and the online training latency may differ. Key claim 4, which covers the deployment setting, will be different. For the oracle, it should address threat models such as persuasion and/or blackmail of researchers (e.g. convincing researchers to increase its affordances), providing misleading advice, and degradation of human performance due to overreliance.

The sketch also includes an argument that combining the agent and oracle doesn’t increase risk (C0X.1). This would cover risks such as covert coordination between the agent and oracle.

The main objective O' is then broken down into four key claims:

The main objective O′ and how it breaks down into four key claims (C2.1, C2.2, C2.3 and C1.2), in CAE notation.

Key claim 1: The training process has taught the systems to play the game well

The first key claim is that the training process has reached an approximate global equilibrium of the game (C2.1).^[6]This section of the sketch looks like this:

The first key claim (C2.1) and the argument required to support it, in CAE notation.

The key claim is supported by three subclaims:

There exists an efficient pair of models achieving an approximate global equilibrium of the game (C3.1)
Training has converged to a local equilibrium (C3.2)
That local equilibrium is an approximate global equilibrium of the game (C3.3)

Subclaim 1: Existence of efficient models

The first subclaim is that there exists an efficient pair of models achieving an approximate global equilibrium of the game. In other words, it is possible to find models that reach equilibrium given realistic computational bounds on the models. The key notion here is efficiency. With computationally unbounded debaters, debate with optimal play (and cross-examination) can answer any question in NEXP given polynomial time judges. The assumption of unbounded debaters is of course unrealistic, so we need to find another efficiency notion which is both realistic and still allows for proofs that debate can solve a large class of practically relevant problems.

Since our sketch is based on an asymmetric debate protocol (prover-estimator debate), we use a relative efficiency notion: we require that the estimator (B) requires only a polynomial factor more compute than the prover (A). The claim – that there exist agents in equilibrium that satisfy this efficiency notion – would be evidenced with a proof (regardless of the protocol and efficiency notion) (E4.1).

Subclaim 2: Training convergence

The second subclaim is that training has reached a local equilibrium. We can get evidence from observing that the returns from training are diminishing, and by checking that the outputs look at least roughly reasonable (E4.2).

Subclaim 3: Convergence = equilibrium

The third subclaim is that the reached local equilibrium is most likely an approximate global equilibrium of the game (E). We argue for this claim via several subclaims.

First, while we showed above that there exist efficient models that would reach equilibrium, we need to show that our actual training process could find them in efficient time (C4.1). Our weak evidence for this claim is that Nash equilibria for tabular Stackelberg games can be computed in polynomial (indeed linear) time, hence is tractable (although note that our game is not tabular) (E5.1). But that still leaves significant uncertainty about whether stochastic gradient descent (SGD) can reach E in realistic training time. There are reasons to think that, in the case of debate, there are bad constants or exponents (i.e. the time it takes to solve E is on the higher end) - for example, it might be difficult to translate latent knowledge acquired via heuristics into deductive arguments (D5.1). However, we’re hopeful that future research can prove tighter theoretical results linking prover-estimator debate – or other, improved debate protocols – to realistic training methods.

Second, after showing that it is tractable for our training process to reach E, we need to show that it is the most likely equilibrium. We claim that the main failure mode here is exploration hacking (C4.2). This is mostly a hunch - there are other possible failure modes, though we would guess that they are less important – so we leave the evidence box blank (E5.2). One failure mode is that SGD hits a local minimum and never reaches E (D5.3). We don’t currently have good evidence against this defeater, though it might be possible for fields like learning theory to generate relevant evidence. Another failure mode is gradient hacking, i.e. using knowledge of its loss landscape to manipulate the training process (D5.2).^[7] Since the system covered by this safety case is assumed to not be extremely superhuman, we think it is reasonable to assume it won’t be capable of gradient hacking.

Third, having limited the problem to exploration hacking, we need evidence to rule out this failure mode (C4.3). As described above, this requires exploration guarantees which provide evidence that the system has exhausted high-reward exploration options, such that convergent behaviour is likely to be a result of actually reaching E (E5.3). Note that exploration guarantees are much more useful if we’re confident in our reward function (as the next key claim argues). If our reward function is bad, more exploration doesn’t necessarily lead to better outcomes. We don’t currently know how to get robust exploration guarantees - while we’re optimistic about this research direction, this is a major gap in our current sketch.

Key claim 2: The game incentivises correctness

The second key claim is that in the approximate global equilibrium of the game, the system is approximately honest – its error rate is at most $ε$ (C2.2). This section of the sketch looks like this:

The key claim is supported by two subclaims:

In E, system outputs are the same as slightly perturbed outputs of a hypothetical truth-finding process M (C3.4).
For most tasks, all slightly perturbed outputs of M are true (C3.5).

Why go this indirect route via M? Because we can’t directly verify if the system’s outputs are true - that’s the reason we need scalable oversight in the first place.^[8] Instead of checking outputs against ground-truth labels, we check the whole debate protocol against a process we hypothesize to produce true outputs, called M. M is too expensive to run; the goal of our debate protocol is to accelerate it. If our protocol produces the same outputs as M, we can assume it produces true outputs.

To understand M, consider a debate as a branching tree of arguments and counterarguments. A comprehensive debate would expand on every possible argument and counterargument, and have a judge consider every branch. Recursive debate aims to accelerate this process by having debaters expand only on a single path through the tree. If the debaters are good, we should expect them to anticipate how the debate will proceed given various arguments they could make and pick the optimal path through the tree. The outcome of a single debate along a single path thereby provides evidence about which debater would have won, had every argument and counterargument been considered. This is analogous to how a single game of Go provides evidence as to who the better player is, even though it only traces one path through the tree of possible moves.

An analogy between debate and Go, represented visually: in both cases a single game with strong players is evidence as to which moves were best. — *Reproduced from* *Irving et al., 2018*.

Our M is equivalent to expanding the whole tree. It is a computation which produces every possible argument and counterargument at every stage and calls (simulated) human judges to adjudicate every individual path through the tree. We hypothesize that such a computation would produce correct outputs within the question domain. The goal of our debate protocol is to produce the same outputs, but more efficiently.

Subclaim 1: M-approximation

The first subclaim is that we have succeeded at producing a protocol that accelerates M. More precisely: System outputs in E are the same as slightly perturbed output of M. In practice, our protocol cannot exactly reproduce M, since it’s intractable for training to converge to a precise equilibrium - we can only reach an approximate equilibrium. As such, we can only reproduce a slightly perturbed version of M, rather than M exactly.

Based on previous research on debate, we know that the primary failure mode for this claim is the obfuscated arguments problem. As described above, an obfuscated argument is one where both debaters know the argument contains a flaw, but it is computationally intractable for either debater to identify the flaw. If obfuscated arguments are a winning strategy, our protocol will produce different outcomes to M: expanding the whole debate would expose the obfuscated argument as false, but given computational limitations, debate might fail to get the same result.

We are optimistic that a new debate protocol, prover-estimator debate (forthcoming), can evade the obfuscated arguments problem. Regardless of which debate protocol is used, the safety case should contain a proof that it successfully approximates M (E4.3).

Subclaim 2: Truth of M

The second subclaim is that we have succeeded at identifying an M which produces mostly true outputs. More precisely: For most tasks, all slightly perturbed outputs of M are true. We identify three key potential issues - with a side-claim that there are no additional issues (C4X.1), supported by evidence that enough red-teaming has occurred (E4X.1).

First, we need to argue that the structure of recursive tree debate is conducive to identifying truth (C4.4). This is supported by theoretical arguments (E5.4). We might worry that some true statements cannot be argued for compellingly via deductive arguments. For example, some truths might only knowable via heuristics learned by encountering many examples, reducing the scope of questions that can be answered with recursive tree debate (D5.4). More worryingly, we might worry that some false claims can be compellingly defended via deductive arguments. For example, it might be possible to define the same term differently in different subclaims, rendering an argument that appears valid but isn’t; if human judges are called only at the leaf nodes, such inconsistencies might not be caught by M (D5.5). Some debate protocols, such as cross-examination, avoid this failure mode, but prover-estimator debate – our example in this safety case - does not. However, we are hopeful that future debate protocols can combine elements of both.

Second, we need to argue that the inputs into M are accurate, i.e. that simulated human judges are competent at assessing the outcomes of debate (C4.5). Random errors should wash out over the course of training, so we’re only worried about systematic error, for example due to claims playing on human biases or emotions. We need two claims here: that the human input being simulated has bounded systematic error with respect to reality (C5.1) and that the simulation has bounded systematic error with respect to the human input (C5.2).

We imagine that the first claim could be supported with empirical results, such as trial debates (E6.1). In verifiable domains like math, human judgement could be compared to ground truth (though this may non generalise to non-verifiable domains (D6.2)). In non-verifiable domains, we could at least check if human judgement looks roughly reasonable or falls prey to known biases - though there might still be biases we’re unaware of (D6.3). Another challenge is to know what level of systematic error is “low enough” (D6.1).
For the second claim about the accuracy of the simulation, the primary failure mode we have in mind is scheming to deliberately misrepresent human judgement – perhaps in collusion with the system we’re training. Since the simulator doesn’t need to be superhuman, we imagine this claim could be supported with an inability argument based on evaluations on capabilities such as instrumental reasoning and deception (E6.2), though it is an open question if such an argument will. go through for sufficiently capable judge models. However, even if not, we are optimistic that the measures used to prevent scheming in the remainder of the safety case will also apply to judge models.

We expect that getting good human inputs will be one of the most challenging parts of this safety case and require much more work. While some errors like moral or emotional bias are likely less prevalent in the domain of AI R&D, there will likely still be errors like cognitive biases and reasoning failures. However, we are tentatively optimistic that human inputs can be improved, both by designing better processes (e.g. giving humans access to the right tools, finding the right number and type of judges) and by improving debate protocols (e.g. allowing debaters to block certain types of claims on which humans are likely to be systematically wrong).

Third, since training doesn’t reproduce M exactly, but rather a slightly perturbed version of M, we need to argue that this approximation isn’t substantially different (C4.6). This is known as stability and can be more precisely stated as: small variations in input probabilities won’t change the outcome of overall computation. We’re not yet sure how to support this claim, so we leave the evidence box blank (E5.5).

We expect that most protocols which avoid obfuscated arguments – including prover-estimator debate - will only approximate M and so require stability. However, it may not be needed for all protocols. For example, it might be possible to relax this assumption if we aim only for honesty rather than accuracy (more on this to come in future work). Alternatively, it might be possible to transform an unstable computation M into a stable computation M*; if so, the safety case should instead contain a claim that the outputs of M* are (approximately) identical to the outputs of M).

Even if stability is required to produce useful outputs, it might not be strictly necessary for safety. For example, in prover-estimator debate, unstable queries return no answer, rendering the protocol safe for all queries but useful only for stable queries.

Key claim 3: The system's behaviour will stay similar during deployment

The third key claim is that the system’s error rate will not drift past $ε$ during deployment (C2.3). This part of the sketch looks like this:

The third key claim (C2.3) and the argument required to support it, in CAE notation.

In our sketch, we support this claim by using online training throughout deployment as above, such that any distributional shifts will be accounted for via further training (E3.1). While the argument is straightforward, the evidence should still demonstrate that the online training regime has been correctly configured. For example, the latency (i.e. frequency of updates) should be high enough to prevent significant deviations from E, and there should be an effective process for updating multiple copies of the same model.

This is one of the key claims in the sketch which breaks in a high-stakes context, where even a small number of errors can cause large harm. Since we can’t use online training, we instead need guarantees against alignment faking (and, less importantly, more generic distribution shifts). This would be much trickier to defend; see some initial ideas in “Extending the safety case to high-stakes contexts”.

Key claim 4: It is sufficient for safety purposes for the system to provide correct answers most of the time

The fourth key claim is that the system cannot cause unacceptable outcomes given an error rate of $ε$ (C1.2). The main failure mode we consider is that the system is scheming to produce harmful outputs specifically in cases where our reward signal (i.e. human judging on debate leaf nodes) can’t catch it. The remaining errors could then constitute a coordinated set of actions that might collectively cause an unacceptable outcome.

Since we don’t have a good way to tell where our reward signal fails and what the effects of those errors would be, we assume that each error has the worst possible outcome, as if the system is a maximally competent schemer.^[9]

We then argue that the system’s deployment context is nevertheless low-stakes because the actions available to the system aren’t harmful enough to cause unacceptable outcomes in a small number of steps. This section of the sketch looks like this:^[10]

The fourth key claim (C1.2) and the argument required to support it, in CAE notation.

First, we break down the unacceptable outcomes of concern into three threat models: research sabotage, i.e. producing false research papers (C2.4), malicious code attacks, i.e. exfiltrating malware when information is extracted from the offline datacentre (C2.5), and self-exfiltration, i.e. the agent escaping from the datacentre (C2.6). This breakdown is supported with a side-claim that we have identified all relevant failure modes (C2X.1) - we validate this with threat modelling supported by expert input (E2X.1). The reasoning behind our breakdown is as follows: Our deployment set-up is such that the agent can only affect the external world via outputs from the datacentre. So, the only way the agent can only cause harm via those outputs, or by self-exfiltrating. Within the outputs, the agent can cause harm either via legitimate outputs (papers) or illegitimate outputs (malicious code).

For simplicity, we only cover one threat model – research sabotage via paper outputs - in the remainder of the safety case, though a similar argument (or other arguments) could be made for other threat models. We argue for this via two subclaims:

Committing effective research sabotage would require a certain number of bad actions, Z (C3.6)
The agent cannot take that many bad actions during a single iteration given its error rate (which is maintained via online training) (C3.7)

Subclaim 1: # of bad actions required

The first subclaim is that committing effective research sabotage would require a certain number of bad actions, $Z$ . Research sabotage could be actions such as faking experimental data, running misleading experimental set-ups, or including false claims or bugged code in the final research paper. Our deployment set-up aims to make it harder to commit research sabotage by investigating the outputted research paper with a checker model trained on n debates. To successfully conduct research sabotage, the agent would have to produce a paper that is false in a way that would very robustly not be caught by the reward signal (or very robustly manipulate the human judges).^[11] Our sketch argues that this isn’t possible, via threat modelling (E4.4).

One potential challenge for such threat modelling is that system capabilities might improve over the course of deployment due to online training. This could enable the system to access new, more harmful actions that could enable research sabotage with fewer actions (D4.1). Another defeater is that the possibility of subtle research sabotage over long periods of time, for example by including subtly misleading but not technically false statements that aren’t caught by the reward signal (D4.2). This risk is reduced by resetting the datacentre between each iteration, minimising coordination opportunities. This could be supplemented with other measures to prevent suspicious correlations between outputs over long time horizons – more on this to come.

Subclaim 2: Low likelihood of enough bad actions

The second claim is that the agent cannot take $Z$ bad actions during a single iteration given its error rate. This is straightforward to support, based on a calculation of the expected number of bad actions a system can take during deployment given probability distributions over its error rate $ε$ (given online training latency $k$ ), total number of actions $n$ , and actions required to pull off research sabotage $Z$ (E4.5). Ideally, the claim should also be supported with evidence that the deployment protocol will in fact be adhered to, for example by committing to regular external verification (D4.3).^[12] The argument only works if each iteration is isolated and cannot coordinate with other iterations – otherwise, the errors required to commit research sabotage could accumulate across iterations. While we aim to avoid this failure mode by resetting the agent and datacentre between each iteration, systems could interfere with this resetting mechanism – or they may be able to coordinate acausally due to being based on the same underlying system (D4.4).

Extending the safety case to high-stakes contexts

This safety case sketch only works in low-stakes contexts where it takes many bad actions to cause unacceptable outcomes. Key claim 4 – that we’re in a low-stakes context – breaks by definition. Key claim 3 – that the error rate will stay consistent throughout deployment – also breaks because we can no longer rely on online training, opening up the possibility of alignment faking and other distribution shift problems.

But the low-stakes claim probably won’t hold for the most dangerous superhuman systems. So why make this safety case sketch?

The most straightforward reason is some highly consequential deployments might plausibly be low-stakes. For example, early-AGI systems in sufficiently careful conditions could plausibly be low-stakes and used for essential alignment research to prevent future, high-stakes systems from causing a catastrophe. It might be possible to build a robust automated alignment safety case based on the scalable oversight techniques in this sketch – we’re planning to do more work on this.

More indirectly, we think this safety case sketch is a useful starting point for developing other high-stakes safety case sketches. In particular, we think we might be able to extend the techniques and evidence in this sketch to high-stakes contexts, rather than requiring completely different kinds of techniques or evidence. We currently have two hopes in this direction:

First, we might be able to use online training-like techniques during inference, on candidate actions instead of on final actions. This would allow us to maintain an error rate of $ε$ for candidate actions, then cut the regret further by generating many candidate actions. This might be enough to patch key claim 3; it’s not enough to completely patch key claim 4, but it gets us some of the way.

Second, the evidence needed to support key claims 1 and 2 might be sufficient (or extendable) to patch key claims 3 and 4. For example, to address exploration hacking, we might need a method for eliciting bad contexts (e.g. based on interpretability or singular learning theory). Such a method might also be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distributional shifts. More broadly, it seems like, given a theorem showing that the error rate is at most $ε$ , in practice one should put significant weight on something close to full intent alignment (or at least the system being not egregiously misaligned). For example, it seems plausible that human error is anticorrelated with the severity of error such that the errors that remain are inconsequential.

We’re far from confident in these techniques, but we think it’s worth trying hard to make them work. Our research agenda encompasses further work in this area, and we’ll be publishing more on these ideas soon.

Open problems

Our safety case sketch faces significant limitations but also helps us identify concrete research directions that might address them. We identify five main areas for further research:

Improved debate protocols. Many gaps in our safety case might be solvable with improved debate protocols with the following properties:

Avoids obfuscated arguments: We are optimistic that prover-estimator debate (forthcoming) can avoid obfuscated arguments, but it remains an open question.
Computationally efficient: The lower the complexity of reaching equilibrium in our debate game, the more likely that models that reach equilibrium are findable via stochastic gradient descent or other realistic training methods. For prover-estimator debate, it is unclear if the debaters can be constructed with polynomial-time algorithms – further research could confirm this or develop more efficient protocols.
Doesn’t require stability: We’re optimistic that the obfuscated arguments problem can be evaded by assuming stability. But stability might not hold for relevant queries. Future work could modify prover-estimator debate to relax the stability assumption, for example by focusing on honesty instead of correctness or by transforming an unstable computation into a stable one.
Avoids gerrymandering concerns: Debate protocols such as cross-examination punish debaters for inconsistency across claims at the same recursive level. Others, such as, prover-estimator debate, do not. It may be possible to combine these protocols (or find other solutions).
Avoids systematic human error: Debate protocols could be designed to minimise the impact of systematic human error, for example by allowing debaters to block queries on which they think humans are likely to be biased (see our recent post on Dodging systematic human errors in scalable oversight).

Empirical testing and optimisation. While debate has theoretical appeal, we have only limited data on its practical effectiveness. For example, most previous experiments use the QuALITY dataset and simple protocols with low recursive depth. Key questions include whether debate measurably improves performance, whether training reliably reaches equilibrium, whether stability is a real problem, and whether obfuscated arguments actually emerge as predicted. We also need an improved understanding of systematic human errors – for example: How good are humans at giving correct answers in verifiable domains, what are common systematic errors and in which domains are they most prominent, and how we can design human input for optimal results? A final set of questions surrounds the deployment set-up. For example, what counts as an "action" for the purpose of online learning in the datacentre? And will many questions in the relevant domain be difficult to answer via deductive arguments?

Exploration hacking. A major gap in our sketch is the requirement for exploration guarantees. We know very little about how serious a problem exploration hacking is and how to prevent it – yet the sketch depends on avoiding it. We outlined various research directions that might help address exploration hacking across both interpretability, singular learning theory, and more prosaic techniques – much more work is needed to develop and test these. More foundationally, it would be useful to develop model organisms of exploration hacking to help test potential solutions. For very superhuman systems, we might also need to address gradient hacking.

High-stakes contexts. In the high-stakes context, we face greater inner alignment challenges. We lose the ability to use online training, so need to face the possibility of distribution shifts and alignment faking; and we need new ways to contend with coordinated errors not captured by our reward signal. We outline research directions in the previous section.

From honesty to alignment. We need to better understand the alignment value of honest question-answering. How much does honesty plus absence of alignment faking translate into aligned actions during alignment research? How can deployment environments be designed to minimize risk while maintaining the system's usefulness for alignment research?

Further posts in this sequence will explore initial thoughts and research directions on many of these open problems – we’ll add links here as they are published.

Conclusion

Overall, the argument in this sketch isn’t that strong with current techniques – much better techniques are needed by the time we’ll need alignment arguments.

That said, we think there are useful directions for future research, which we intend to pursue (and fund others to pursue) – although we aren’t confident that the holes can be patched.

We are optimistic that safety case sketches are a useful tool for research and for coordinating research. Writing this up helped us clarify our thinking and given us ideas for further research directions – we’ll be following up with more posts on these.

Acknowledgements: We’re grateful for valuable input and comments from Beth Barnes, Charlie Griffin, Jacob Hilton, Jonah Brown-Cohen, Joshua Clymer, Martin Soto, Nicola Ding, & Will Kirby.

Appendices

Appendix 1: The full safety case diagram

The full CAE diagram showing the entire safety case sketch.

An interactive version of the diagram can be found here. Alternatively, open the image above in a new tab and zoom in.

Appendix 2: The "Claims Arguments Evidence" notation

The safety case sketch begins with a top-level claim we want to make about the safety properties of an AI system (A). This top-level claim is iteratively broken down into further subclaims, which must eventually bottom out in evidence. These subclaims shows the logical structure and dependencies connecting evidence to the top-level claim.

The diagram uses the “Claims, Arguments, Evidence” notion format. It consists of four types of units:

Blue ovals are claims. Claims must either be broken down into further subclaims or directly supported by evidence, in both cases via an argument.

Green parallelograms are arguments. Arguments link claims to evidence or further subclaims. They explain why the evidence or subclaims suffice to justify the claim. This sketch uses 3 argument types:

Decomposition: Breaks down a claim into several subclaims. This must be supported by a “side-claim” explaining why the subclaims are adequate to justify the claim - unless the subclaims logically imply the claim, in which case it is called a “deductive decomposition” and can be free-standing. (Our sketch also uses “deductive decompositions” in cases where the subclaims don't strictly imply the claim but could be rephrased to do so.)
Substitution: Replaces a claim with another single subclaim which is easier to support. This must be supported by a “side-claim” explaining why the subclaim adequately captures the important parts of the original claim.
Evidence incorporation: Explains why the evidence adequately supports a given claim.

Purple boxes are evidence. They provide information, data or calculations that support claims. There are many possible types of evidence, including results from empirical tests, proofs, results from expert surveys, or theoretical outputs such as threat modelling reports.

Red ovals are defeaters. These are hypothetical claims that, if true, could invalidate or render insufficient the evidence provided. The safety case should explain what has been done to address defeaters and/or why they are not a concern.

Grey elements provide context. They are surrounding parts that would be included in a complete safety case but are not the focus of this sketch.

^{^}
It is debatable if this safety case should be considered an example of control, alignment, or a combination. Our argument does guarantee robustness to distribution shifts or internalisation of certain values, and it admits some proportion of misaligned outputs. On the other hand, the argument relies primarily on training-level interventions. Further, sufficiently strong exploration guarantees (such as methods for eliciting bad contexts) might enable robustness guarantees, as discussed in "Extending the safety case to high-stakes contexts".
^{^}
However, we think that the argument structure is general enough to carry over to improved scalable oversight protocols with better properties.
^{^}
The stability assumption is not directly required for safety – for queries where stability does not hold, the protocol will simply return no answer, so instability does not imply false outputs. However, the protocol is not very useful if stability fails to hold for a large fraction of relevant queries.
^{^}
That is, the model weights converge towards policies that play the equilibrium strategy.
^{^}
More subtle issues with the paper, such as containing true but subtly misleading claims or using suboptimal framings, might not be caught by debate on questions such as these. Using debate on actions taken inside the datacentre, while preparing the paper, mitigates this risk somewhat, but other methods might be needed to catch individually innocuous but cumulatively harmful errors. We’re working on ideas for 'unexploitable search' to address this type of error.
^{^}
Note that this is an equilibrium over model weights or policies, rather than strategies in individual rounds of debate. The model weights or policies determine the strategy for all debates. Equilibrium is reached when neither debater can improve its expected pay-off with a different policy.
^{^}
If the models are gradient hacking, it could in theory make it impossible to reach E via the training process T. Gradient hacking is therefore also a defeater to E5.1.
^{^}
While we can probably directly verify many of the model’s outputs in some domains (e.g. maths), we also want to be able to use the model to answer non-verifiable questions.
^{^}
In this sense, our argument has a control flavour, in that we are assuming that the system is as egregiously misaligned as possible given the constraints of training. An alternative argument would be that we are in a low-stakes context because the reward signal catches the most severe errors.
^{^}
For simplicity, we phrase these claims deterministically rather than probabilistically, and with respect to a binary acceptable vs. unacceptable outcome rather than utility loss. For a continuous version, our overall claim could be that we have a regret bound of $O (ρ k \sqrt{T} + ρ ε^{'} T)$ . For a probabilistic version, we could use probability distributions over $ε$ , as well as $Z$ or $ρ$ .
^{^}
The agent would also have to set this up using only actions that would not be flagged by the reward signal as suspicious, since it is subject to online training while producing the paper.
^{^}
It may also be necessary to provide evidence that the AI agent itself will not tamper with the deployment protocol, for example by disabling online training.
^{^}
Previous safety case sketches have been primarily evaluation-based. Our sketch is the first to take a more mechanistic approach. This is partly because alignment evaluations are more challenging than other types of evaluations, and partly because a lesson from other industries is that safety cases should include arguments based on a mechanistic understanding of how a system works.

I've also discussed these concerns in a call where the authors were present, but I thought it might be helpful to quickly write up my concerns. ↩︎

[-]Wei Dai9mo197

I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here.

Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem.

Aside from the narrow focus on "good human input" in this particular system, I'm worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.

[-]Geoffrey Irving9mo60

I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we'll have a short doc soon (next week or so) which is somewhat related, along the lines of "assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different". Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive.

One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

[-]Wei Dai9mo81

Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.

This is my concern, and I'm glad it's at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn't appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a "competitiveness case" later in this sequence, or later in the project? Etc.?

But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

Because of the "slowness of philosophy" issue I talked about in my post, we have no way of quickly reaching high confidence that any such formalization is correct, and we have a number of negative examples where a proposed formal solution to some philosophical problem that initially looked good turned out to be flawed upon deeper examination. (See decision theory and Solomonoff induction.) AFAIK we don't really have any positive examples of such formalizations that have stood the test of time. So I feel like this is basically not a viable approach.

[-]Geoffrey Irving9mo50

The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get utility out of (2) on other questions.

Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren't bulletproof, but I think I'd place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability.

That said, I'm not all confident in this picture: I'd place a lower probability than you on these considerations biting, but not that low.

[-]Wei Dai8mo70

Sorry about the delayed reply. I've been thinking about how to respond. One of my worries is that human philosophy is path dependent, or another way of saying this is that we're prone to accepting wrong philosophical ideas/arguments and then it's hard to talk us out of them. The split of western philosophy into analytical and continental traditions seems to be an instance of this, then even within analytical philosophy, academic philosophers would strongly disagree with each other and each be confident in their own positions and rarely get talked out of them. I think/hope that humans collectively can still make philosophical progress over time (in some mysterious way that I wish I understood), if we're left to our own devices but the process seems pretty fragile and probably can't withstand much external optimization pressure.

On formalizations, I agree they've stood the test of time in your sense, but is that enough to build them into AI? We can see that they wrong on some questions, but can't formally characterize the domain in which they are right. And even if we could, I don't know why we'd muddle through... What if we built AI based on Debate, but used Newtonian physics to answer physics queries instead of human judgment, or the humans are pretty bad at answering physics related questions (including meta questions like how to do science)? That would be pretty disastrous, especially if there are any adversaries in the environment, right?

[-]Geoffrey Irving8mo10

Continuing with the Newtonian physics analogy, the case for optimism would be:

1. We have some theories with limited domain of applicability. Say, theory A.
2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability.
3. We don't know theory B, and can't access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.)
4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it's been tested (energy levels, length scales, etc.).
5. Scalable oversight has an easier job talking about these limits to theory A than it doesn't about theory B itself. Concretely, what this means is that you can express arguments like "theory A doesn't resolve question Q, as the answer depends on applying theory A beyond it's decent-confidence domain of applicability".
6. Profit.

This gives you a capability cap: the AIs know theory B but you can't use it. But I do think if you can pull off the necessary restriction to which questions you can answer you can muddle through, even if you know only theory A and have some sense of its limits. The limits of Newtonian physics started to appear long before the replacement theories (relativity and quantum). I think we're in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

The additional big thing you need here is a property of the world that makes that capability cap okay: if the only way to succeed is find perfect solutions using theory B, say because that gives you a necessary edge in an adversarial competition between multiple AIs, then lacking theory B sinks you. But I think we have a shot about not being in the worst case here.

(Sorry as well for delay! Was sick.)

[-]Wei Dai8mo20

I think we're in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

I think the situation in decision theory is way more confusing than this. See https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever and I would be happy to have a chat about this if that would help convey my view of the current situation.

[-]Marie_DB9mo10

Interesting post!

Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

[-]Wei Dai9mo20

Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?

Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters "jailbreaking" a human via some sort of out of distribution input.

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I'm happy to read it to see if it changes my mind. But this still doesn't solve the whole problem, because as Geoffrey also wrote, "Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive."

[-]ryan_greenblatt9mo*140

While I think debate might be a useful prosaic method for some levels of capability, I have a variety of concerns about this approach resulting in worst case guarantees^[1]:

Exploration hacking

I think the argument pushes large parts of the problem into requiring a universal solution for exploration hacking. I think prosaically handling certain cases of exploration hacking for certain levels of capability seems very doable, but the approach discussed in the paper requires a very strong notion of solving exploration hacking (e.g., the AI would tell you any secret it might know via any mechanisms which you can verify, even if that secret would be arbitrarily hard to explore into such as if the AIs know of some arbitrary password you can check).

A universal solution for exploration hacking would itself require solving a particular alignment problem which IMO contains most of the difficulty of the overall alignment problem: fully solving exploration hacking effectively requires "we can ensure the AI will give us any knowledge that we can verify" which is effectively the same as "we can align an AI to trying to give us any knowledge that we can verify" and this doesn't obviously seem importantly easier than the overall alignment problem without further restrictions. Importantly, "giving knowledge" might itself be a difficult task due to ontology translation (see e.g. counterexamples in the ELK report). Another way to put this, is that solving exploration hacking (in the worst case) seems almost as difficult as being able to elicit arbitrary knowledge.

What exactly is the limit that this approaches?

The paper discusses an assumption of "good human input" as being sufficient. However, the relevant question is whether the idealized system (the limit of a well elicited debate) actually finds truth in a given case. And, I think it seems very unclear whether it will find truth in the cases we care about. (E.g., does a reasonable human judge result in good behavior in the infinite structure or does something crazy happen? Do we somehow do better due to approximation?)

Minimally, answering some questions might involve hard to justify heuristics and there are issues with persuasion.

In some insane limit, anything is deductively verifiable (except the prior) using Solomonoff induction or similar, but surely this isn't going to actually save us! (For instance, the AIs will be incapable of arguing about this, and there isn't any reason why a plausible amount of training would elicit this without a magic solution to exploration hacking which put all the difficulty in this part. At this point, you could also just debate over purely logical formal statements going down to Solomonoff induction or some other intractable computational structure.)

Mundane exploration difficulties might be fatal as the debate tree could be arbitrarily large

Putting aside exploration hacking, it's unclear this yields reasonable results for more mundane reasons, like exploration being hard. And, the relevant level of exploration we need is extremely good as we're trying to elicit good performance over an exponentially large structure that we can't explore.

It also seems like you won't know if you've reached a sufficiently optimal level as I expect that you'll always see returns in terms of debater performance (as measured in e.g. elo) with further RL. So, you'll need to have some way of understanding what optimality suffices for eliciting the knowledge the AI already had.

[-]maxnadeau9mo60

For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.

[-]Steven Byrnes9mo73

Hmm, I guess my main cause for skepticism is that I think the setup would get subverted somehow—e.g. either the debaters, or the “human simulator”, or all three in collusion, will convince the human to let them out of the box. In your classification, I guess this would be a “high-stakes context”, which I know isn’t your main focus. You talk about it a bit, but I’m unconvinced by what you wrote (if I understood it correctly) and don’t immediately see any promising directions.

Secondarily, I find it kinda hard to believe that two superhuman debaters would converge to “successfully conveying subtle hard-to-grasp truths about the alignment problem to the judge” rather than converging to “manipulation tug-of-war on the judge”.

Probably at least part of the difference / crux between us is that, compared to most people, I tend to assume that there isn’t much of a stable, usable window between “AI that’s competent enough to really help” and “AI that’s radically superhuman”, and I know that you’re explicitly assuming “not extremely superhuman”. (And that in turn is probably at least partly related to the fact that you’re thinking about LLMs and I’m thinking about other AI paradigms.) So maybe this comment isn’t too helpful, oh well.

Thanks for the comment, all sensible scepticisms IMO!

I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don't emphasise enough the defeater of "the deployment set-up just won't be upheld" (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I'm maybe more optimistic than you that there'll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage.

On debaters converging to honesty rather than subtle manipulation: I'm also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).

[-]Charlie Steiner9mo20

If we're talking about the domain where we can assume "good human input", why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it's probably not yours.

[-]Marie_DB9mo50

Two reasons:

Identifying vs. evaluating flaws: In debate, human judges get to see critiques made by a superhuman system - some of which they probably wouldn't have been able to identify themselves but can still evaluate (on the (IMO plausible) assumption that it's easier to evaluate than generate). That makes human judges more likely to give correct signal.
Whole debate vs. single subclaim: In recursive debate, human judges evaluate only a single subclaim. The (superhuman) debaters do the work of picking out which subclaim they're most likely to win on, and we use this as a proxy for who would've won the whole debate. This is a more efficient use of human judges' time, and also probably it's easier for human judges to evaluate smaller subproblems.

For those reasons, I think we can likely get "good human input" for a bigger set of questions with debate than direct human supervision

AI ALIGNMENT FORUM
AF