Charlie Griffin

Practical challenges of control monitoring in frontier AI deployments

TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments. This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read...

Jan 1219

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Even if the misalignment risk from current AI agents is small, it may be useful to start internally deploying misalignment classifiers: language models designed to classify transcripts that represent intentionally misaligned behaviour. Deploying misalignment classifiers now may provide qualitative and quantitative evidence about how current agents misbehave in real deployments....

Aug 15, 202568

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring). Abstract An AI control protocol is a plan...

Mar 24, 202535

Games for AI Control

[UPDATE: This post is now updated to reflect our conference-length preprint, available on arXiv. Louis Thomson helped with this update. New results are marked "NEW".] This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by...

Jul 11, 202445

Scenario Forecasting Workshop: Materials and Learnings

Disclaimer: While some participants and organizers of this exercise work in industry, no proprietary info was used to inform these scenarios, and they represent the views of their individual authors alone. Overview In the vein of What 2026 Looks Like and AI Timelines discussion, we recently hosted a scenario forecasting...

Mar 8, 202450

Five projects from AI Safety Hub Labs 2023

AI Safety Hub Labs is a research programme that helps early-career researchers to complete an AI safety research project. Projects are completed in groups of 3-5 participants, supervised by a more senior safety researcher, and managed by AI Safety Hub. This summer’s programme was unpaid due to funding constraints. It...

Nov 8, 202347

Goodhart's Law in Reinforcement Learning

Produced As Part Of The OxAI Safety Labs program, mentored by Joar Skalse. TL;DR This is a blog post introducing our new paper, "Goodhart's Law in Reinforcement Learning" (to appear at ICLR 2024). We study Goodhart's law in RL empirically, provide a geometric explanation for why it occurs, and use...

Oct 16, 2023126

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Charlie Griffin

Charlie Griffin

Goodhart's Law in Reinforcement Learning

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Scenario Forecasting Workshop: Materials and Learnings

Five projects from AI Safety Hub Labs 2023

Charlie Griffin

Goodhart's Law in Reinforcement Learning

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Scenario Forecasting Workshop: Materials and Learnings

Five projects from AI Safety Hub Labs 2023

Practical challenges of control monitoring in frontier AI deployments

Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Games for AI Control

Scenario Forecasting Workshop: Materials and Learnings

Five projects from AI Safety Hub Labs 2023

Goodhart's Law in Reinforcement Learning