x

AI ALIGNMENT FORUM

AF

Benjamin Hilton — AI Alignment Forum

Benjamin Hilton

Top postsTop post

Benjamin Hilton

Message

Head of Alignment at UK AI Security Institute (AISI). Previously 80,000 Hours, HM Treasury, Cabinet Office, Department for International Trade, Imperial College London.

437

Ω

69

6

1

4y

Benjamin Hilton

Head of Alignment at UK AI Security Institute (AISI). Previously 80,000 Hours, HM Treasury, Cabinet Office, Department for International Trade, Imperial College London.

Top postsTop post

UK AISI’s Alignment Team: Research Agenda

The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda. Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control. No known technical mitigations are reliable past AGI. Our plan is to break down promising alignment agendas by developing safety case sketches. We'll use these sketches to identify specific holes and gaps in current approaches. We expect that many of these gaps can be formulated as well-defined subproblems within existing fields (e.g., theoretical computer science). By identifying researchers with relevant expertise who aren't currently working on alignment and funding their efforts on these subproblems, we hope to substantially increase parallel progress on alignment. Our initial focus is on using scalable oversight to train honest AI systems, using a combination of theory about training equilibria and empirical evidence about the results of training. This post covers: 1. The safety case methodology 2. Our initial focus on honesty and asymptotic guarantees 3. Future work 4. A list (which we'll keep updated) of open problems 1. Why safety case-oriented alignment research? Arriving at robust evidence that human-level AI systems are aligned requires complementary advances across empirical science, theory, and engineering. We need a theoretical argument for why our method’s effectiveness, empirical data validating the theory, and engineering work on making the method low cost. Each of these subproblems informs the other: for instance, theoretical protocols are not useful unless efficient implementations can be found. This is just one example of a way in which alignment research in different areas can be complementary. The interdependencies between different a

Automation collapse

An alignment safety case sketch based on debate

A sketch of an AI control safety case

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 2025•12

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 2025•10

Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 2025•4

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 2025•14

The Alignment Project by UK AISI

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 2025•29

An alignment safety case sketch based on debate

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here. Executive summary AI safety via debate is a promising method for solving part of the alignment...

May 8, 2025•62

UK AISI’s Alignment Team: Research Agenda

The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda. Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course...

May 7, 2025•115

Load More (7/9)