Seb Farquhar — AI Alignment Forum

GDM AI Control Roadmap

by Mary Phuong, Erik Jenner, Rohin Shah, and Seb Farquhar

GDM has published an AI Control Roadmap! From the executive summary: > We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain. > >...

Jun 1887

Testing Gemini models for scheming tendencies

by Vika, David Lindner, Seb Farquhar, and Rohin Shah

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for...

May 2947

MONA: Managed Myopia with Approval Feedback

Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in...

Jan 23, 202581

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

by Rohin Shah, Seb Farquhar, and Anca Dragan

We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things...

Aug 20, 2024222

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about...

Dec 18, 2023149