The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.
Apply now to join researchers worldwide in advancing AI safety.
Achieving high-assurance alignment will require formal guarantees as well empirical observations. Just as the concept of entropy made information theory far more tractable, we suspect that there are intuitive concepts in AI security which—if formalised mathematically—could make previously intractable problems suddenly approachable. For example, many approaches to training safe AI rely on humans or weaker AI systems supervising more capable ones. Complexity theory helps us create and assess these approaches: under what conditions do they have desirable equilibria, when are they provably impossible, and how can we bound their error?
Problem summary: AI safety via debate (Irving et al., 2018) is a potential method for training safely advanced AI to understand complex human goals. Agents are trained via self play on a zero sum debate game where a human judges which of the agents gave the most true, useful information. There are alternative debate protocols, such as cross-examination (Barnes and Christiano, 2020), doubly-efficient debate (Brown-Cohen et al., 2023) and prover-estimator debate (Brown-Cohen et al., 2025). All of these protocols are insufficient in at least some way (Buhl et al., 2025); can we make improvements?
Why this matters: AI safety via debate is a proposed solution to the scalable oversight problem: correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Buhl et al., 2025 for an explanation of how debate could lead to a compelling argument that an AI system is safe, and Brown-Cohen et al., 2025 for a recent debate protocol. Previous research in this area includes Irving et al., 2018, Barnes and Christiano, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024. Leike et al., 2018 introduce the broad idea of recursive reward modelling beyond the topic of debate.
Problem summary: When we use AI systems to solve complex problems, they will need to propose and argue for different claims about the validity or safety of their solutions. To verify these claims, we'll need evidence—often provided by the AIs themselves. This evidence is typically imprecise or probabilistic, requiring us to combine multiple sources to reach confidence about a claim. However, it is difficult to assess whether multiple pieces of evidence are truly independent (providing robust redundancy) or actually correlated or stemming from the same underlying source. This creates a critical failure mode: we might become overconfident in a claim because an AI provides many pieces of evidence that appear independent but are actually correlated. While this is a general problem in reasoning, it's especially crucial for the soundness of scalable oversight methods.
Why this matters: Determining independence is necessary for correctly calibrating our confidence on AI-proposed arguments or solutions. For example, in the theoretical study of scalable oversight methods, we sometimes need to assume the existence of uncorrelated pieces of evidence to prove the method sound, e.g. assuming s-supportedness in prover-estimator debate. We also expect this issue to arise empirically, and it might be especially harmful to human oversight on AI, given the many systematic biases that human judges are known to display. These biases are highly correlated across time and across different judges, rather than being independently distributed errors.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Brown-Cohen et al., 2025, and Bengio et al., 2025, which lifts scalable oversight to the Bayesian setting.
Problem summary: One framing of the scalable oversight problem is whether a computationally-bounded trusted verifier can learn to solve tasks through interacting with a more powerful but untrusted prover (Brown-Cohen et al., 2025).
Why this matters: As AI capabilities improve, we will need the ability to oversee the output of models which we will be unable to directly judge. Using AI models to oversee other AI models is a key tool to provide training signals toward desirable traits, such as honesty, in this setting.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Hammond and Adam-Day, 2025. Previous research in this area includes Kirchner et al., 2024 and Amit et al., 2024.
Problem summary: Future interpretability techniques may afford us a high-level circuit understanding of a model but without complete understanding of subprocesses. We could model this situation as a small circuit calling heuristic functions. Such a framing has been a successful theoretical model in several areas of complexity theory, including differential privacy and statistical fairness.
Why this matters: A number of existing AI safety tools might be modelable by small circuits and heuristic functions, if we can understand these tools better we might be able to classify when they will or will not work. For example, we could understand when we can trust our interpretability tools or what interventions might improve interpretability.
Examples of subproblems:
Suggestions to learn more: For an example of where small circuits have already been useful, see Brown-Cohen et al., 2025 for an alignment use case, for other use cases, see Dwork et al., 2022a, 2022b. A potential application is in Irving et al., 2025.
Problem summary: The theorems for prover-estimator debate (Brown-Cohen et al., 2025) show that, in equilibrium, most of the time, the prover's claims are correct (Theorem 6.1). Intuitively, honesty is necessary but not sufficient for correctness. Formalising this notion could allow us to weaken the stability and resource requirements for prover-estimator debate. There are various properties of honesty we could consider.
Why this matters: In practice, we won't know in advance whether a given problem can be solved correctly through debate. For problems where no perfect answer exists, current theorems provide no guarantees—the debate could output anything. If we could formalize ‘honesty’ as a weaker property than correctness, we could ensure that debate returns the model's genuine best guess even when certainty is impossible, while still maintaining safety guarantees.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Brown-Cohen et al., 2025. Previous research in this area includes Amit et al., 2024, Neyman, 2024.
Problem summary: Over time, if a reward training signal admits multiple solutions, these free choices risk being exploited over time (the ‘exploitable search problem’; Pfau and Irving, 2025). Unexploitable search methods should force a model to randomize performance over all adversarial side-objectives the model can access—in a sense to be formalized by the method.
Why this matters: We would like to be able to use AI on under-specified tasks, like research advice, without a uniquely verifiable correct solution. In this setting, we could be vulnerable to an AI system adversarially selecting solutions that are plausible but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Pfau and Irving (2025).
Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?
Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems.
Examples of subproblems:
Established subproblems:
Suggestions to learn more: We’d recommend reading Christiano et al., 2024. Previous research in this area also includes Goldwasser et al., 2022, 2025 and Gluch and Goldwasser, 2025.
Problem summary: Is there a way of formalising the concept of a ‘heuristic argument’ – informal arguments about the behaviour of a mathematical system – to make them machine-checkable? For example, if we assume some facts are uncorrelated or independent can we derive useful conclusions for training safe AI systems?
Why this matters: Many of the arguments we could make about AI systems, for example about their internal structures or the probability of rare failures, will be informal or probabilistic heuristics, rather than watertight proofs. We would like to be able to use heuristic arguments that are machine-checkable but not necessarily human-understandable to make such careful, but not certain, claims about AI systems.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Hilton (2024) for a more indepth look at the problem and a summary of some results, and Christiano et al., 2022 and Neyman (2025) for results in this area.
Problem summary: Are there alternative approaches to building and deploying advanced AI systems that credibly mitage the risks from misaligned systems? How hard might these be to achieve given concerted effort?
Why this matters: We’re very uncertain about whether existing approaches will be successful, and whether they will scale to even more powerful AI systems.
Suggestions to learn more: We’d recommend reading Hilton (2024), Appel and Kosoy (2025) and Nayebi (2025).
View The Alignment Project website to learn more or apply for funding here.