Geoffrey Irving

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202512

The Alignment Project by UK AISI

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

The need to relativise in debate

Summary: This post highlights the need for results in AI safety, such as debate or scalable oversight, to 'relativise', i.e. for the result to hold even when all parties are given access to a black box 'oracle' (the oracle might be a powerful problem solver, a random function, or a...

Jun 26, 202531

Prover-Estimator Debate: A New Scalable Oversight Protocol

Linkpost to arXiv: https://arxiv.org/abs/2506.13609. Summary: We present a scalable oversight protocol where honesty is incentivized at equilibrium. Prior debate protocols allowed a dishonest AI to force an honest AI opponent to solve a computationally intractable problem in order to win. In contrast, prover-estimator debate incentivizes honest equilibrium behavior, even when...

Jun 17, 202589

Unexploitable search: blocking malicious use of free parameters

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time....

May 21, 202540

Dodging systematic human errors in scalable oversight

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this. Not too many errors in unknown places The complexity theory models...

May 14, 202534

An alignment safety case sketch based on debate

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here. Executive summary AI safety via debate is a promising method for solving part of the alignment...

May 8, 202561

Geoffrey Irving

Geoffrey Irving

DeepMind is hiring for the Scalable Alignment and Alignment Teams

UK AISI’s Alignment Team: Research Agenda

Prover-Estimator Debate: A New Scalable Oversight Protocol

Automation collapse

Geoffrey Irving

DeepMind is hiring for the Scalable Alignment and Alignment Teams

UK AISI’s Alignment Team: Research Agenda

Prover-Estimator Debate: A New Scalable Oversight Protocol

Automation collapse

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

The need to relativise in debate

Prover-Estimator Debate: A New Scalable Oversight Protocol

Unexploitable search: blocking malicious use of free parameters

Dodging systematic human errors in scalable oversight

An alignment safety case sketch based on debate