Thanks for the post (and for linking the research agenda, which I haven't yet read through)! I'm glad that, even if you use the framing of debate (which I don't expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.
(If this post is "what would help make debate work for AI alignment," you can also imagine framings "what would help make updating on human feedback work" [common ARC framing] and "what would help make model-based RL work" [common Charlie framing])
I'd put these subproblems into two buckets:
I think there's maybe a missing bucket, which is:
The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda.
Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control. No known technical mitigations are reliable past AGI.
Our plan is to break down promising alignment agendas by developing safety case sketches. We'll use these sketches to identify specific holes and gaps in current approaches. We expect that many of these gaps can be formulated as well-defined subproblems within existing fields (e.g., theoretical computer science). By identifying researchers with relevant expertise who aren't currently working on alignment and funding their efforts on these subproblems, we hope to substantially increase parallel progress on alignment.
Our initial focus is on using scalable oversight to train honest AI systems, using a combination of theory about training equilibria and empirical evidence about the results of training.
This post covers:
1. Why safety case-oriented alignment research?
Arriving at robust evidence that human-level AI systems are aligned requires complementary advances across empirical science, theory, and engineering. We need a theoretical argument for why our method’s effectiveness, empirical data validating the theory, and engineering work on making the method low cost. Each of these subproblems informs the other: for instance, theoretical protocols are not useful unless efficient implementations can be found.
This is just one example of a way in which alignment research in different areas can be complementary. The interdependencies between different areas of alignment research are often hard to spot and not completely understood, which creates a coordination challenge: the most useful research depends on what others are doing. This difficulty identifying which problems are most critical to solve next leads to a focus on problems that are theoretically appealing or empirically tractable, leaving other areas neglected despite their system-level importance.
But there is also a coordination opportunity. Many subproblems can be tackled in parallel due to their hierarchical structure, varying dependencies, and requirements for diverse expertise. So, by mapping these interdependencies and coordinating distributed efforts effectively, the field can advance on multiple fronts simultaneously.
We propose using safety case sketches as an organising frame for clarifying the relationships between the claims, arguments and evidence entailed by alignment proposals. Within this frame, the AISI Alignment Team research agenda involves two tasks: (1) decompose alignment proposals into safety case sketches, leaving unfilled holes for future research, and (2) engage with the alignment and broader research communities to solve resulting alignment subproblems.
2. Our initial focus: honesty and asymptotic guarantees
There are two features of our initial research that are worth highlighting because they somewhat differentiate our direction from others in the field.
First, we specifically focus on sketches which make top-level claims about the honesty of AI systems.
We do this because honesty (in at least some domains) is likely to be a necessary condition for the safety of superintelligent systems: the opposite would be a deceptive system that can hide crucial safety-relevant information. We think honesty addresses a reasonable portion of potential risks, as many routes to large-scale harm from misaligned systems flow through deception. More broadly, we can use honest systems to prevent all sorts of harm by asking them whether a proposed course of action might cause harm.
We also think that honesty is useful as a first step – for example, if we could build honest systems we could use them to conduct research into other aspects of alignment without a risk of research sabotage. Finally, we think honesty is more tractable than other aspects of alignment: it's relatively easy to be precise about, is in some cases verifiable, and is approximately the target of a variety of existing agendas, including debate, ARC, and Bengio’s Scientist AI.
Second, we aim to make claims about various properties of AI systems using 'asymptotic guarantees’: proofs that if you give a process 'enough resources' (human data quality, training time, neural network capacity, etc.) then some system specification will be guaranteed. We use asymptotic guarantees alongside evidence (both theoretical and empirical) that the relevant process will in fact converge – for example, ways to mitigate risks of sandbagging and exploration hacking. We’re optimistic about asymptotic guarantees because pure empirics based on current models may be too weak for sufficient safety; on the other hand, full formal verification may be too hard to realistically achieve.
We are currently far from being able to develop safety cases based on asymptotic guarantees – and not just because of the exploration hacking problem mentioned above. Theory expects debate (and other scalable oversight methods) will not converge to approximate honesty: the obfuscated arguments problem shows that honesty can be intractably harder than deception in some debate games. If we can solve the obfuscated arguments problem, asymptotic guarantees based on learning theory will usually guarantee a success rate of 1-ε, where 0<ε<<1 is the error tolerance. While standard learning theory will treat these ε failure cases as randomly distributed according to the underlying data distribution, the worst-case for safety is when a sufficiently advanced AI model could strategically control the failures.
We aim to conduct and fund theory research (in complexity theory, game theory, learning theory and other areas) to both improve our ‘asymptotic guarantees’ and develop ways of showing convergence.
This isn’t to say that only theoretical work is needed. We expect asymptotic guarantees to be one piece of an alignment safety case, combined with other evidence sources such as evaluations based on model behavior or model internals. We conduct empirical research (in machine learning, cognitive science and other areas) to validate that our theoretical research applies to real models, and cover other relevant gaps (such as human input). By combining multiple evidence sources, we can make a stronger argument overall.
Example: Debate safety case sketch
The basic idea of debate is to train an AI system via a “debate game” in which the winning strategy is to provide honest and accurate answers. Our sketch is based on the work-in-progress prover-estimator (PE) debate protocol that aims to correct the obfuscated arguments failure mode of previous debate protocols. .
We sketch arguments for four key claims that a developer must defend to show that the debate training has succeeded in producing a trustworthy system:
Through this process we’re scoping subproblems which we think are solvable and highly relevant to ensuring that debate works in practice. An example (on which we’ve already published) is eliciting bad contexts. Identifying eliciting bad contexts as a subproblem helps in two ways: First, we can scope interpretability work (or other approaches such as singular learning theory) towards a subgoal which contributes to alignment without requiring full neural network decompilation. Second, we can scope scalable oversight work assuming a factored solution to eliciting bad contexts, allowing for parallel research progress.
The full sketch is the next post of this sequence.
3. Future work
Concrete open problems in honesty
Below we list some categories of problems surfaced by our safety case sketch for debate. These are not novel to the safety case sketch, but the safety case sketch clarifies the scope of precisely what problem needs to be solved and what assumptions are available.
In general, there are now many well-defined problems directly relevant to alignment, and in particular the situation is better than it was a few years ago — not just in debate (see, e.g. ARC and Scientist AI research agendas). Many of these subproblems highlight how external research agendas (e.g. white-box methods including singular learning theory) could contribute to a scalable oversight safety case. We think that the volume of problems is such that the number of theorists working on alignment could grow by an order of magnitude studying solely existing problems. The majority of these questions are parallelizable and modular: they do not depend on progress on other questions, and can be understood and defined without requiring a full picture of the debate safety case.
More details on our empirical approach
We aim to run experiments to validate our theoretical claims. Building experiments and evaluations for alignment methods has previously succeeded in surfacing limitations of alignment methods (e.g. the obfuscated arguments problem in debate) and we expect empirical evidence to continue to be a crucial part of the overall picture.
For alignment experiments, we'll need new datasets and evaluation methods that differ from traditional datasets measuring capabilities (such as coding or writing skills). An alignment evaluation needs to carefully work around the fact that we’re usually trying to find evidence about what might work on future models, rather than the ones we have to day.
We can currently see three possible approaches:
We list a few concrete empirical problems we would like to see solved in section 4.
Moving beyond honesty: automated alignment
Honesty may be necessary to trust automated alignment research, but it is not sufficient (for example, the AI researcher could be honest but wrong). So, we are interested in developing safety case sketches for automated alignment.
As of late 2024, Google claimed ~25% of their new code was written by AI. There is a robust trend towards AI taking on larger autonomous, agentic roles in software engineering and well-defined research problems. As such, AI will likely play a large role in the research and engineering involved in aligning future AI systems. Understanding where such automation improves over direct human efforts, and where delegating work to AIs risks undermining alignment research is a priority.
It’s not just safety cases: we are also interested in research that maps or expands the set of alignment subproblems that can be safely tackled using AI agents, such as by studying long-horizon generalisation, reward hacking in the code agent setting, etc. We have so far released one post focused on automating interpretability.
4. List of open problems we’d like to see solved
We recognise many of these are not concrete enough to work on yet! We plan on publishing a series of posts on key open problems over the coming weeks, and will update this list with links as they are released.
4.1 Empirical problems
Here are some empirical problems arising from our debate research:
4.2 Theoretical problems
Here are some theoretical problems arising from our debate research:
ARC have also set out some theory problems. For example, settling the computational no-coincidence conjecture or separating efficient defendability from efficient PAC learnability (conjecture 5.4 in Christiano et al., 2024).
We will update this list as we make these problems more concrete and identify new open problems.
Collaborate with us
We want to collaborate with researchers across disciplines on approaches to AI alignment. We’re particularly keen to hear from:
Both theoretical and empirical proposals are welcomed: express interest here. You can also apply for grants directly here.