Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure.

Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback.

Summary

This is a brief overview of our research agenda, recent progress, and future objectives.

Having the ability to robustly detect, interpret, and modify an AI’s objectives could allow us to directly solve the inner alignment problem. Our work focuses on a top-down approach, where we focus on clarifying our understanding of how objectives might exist in an AI’s internals and developing methods to detect and understand them.^[1]

This post is meant to do quite a few things:

We’ll start by outlining the problem and potential solution.
We then present our initial theory on objectives.
Next, we look at some initial empirical work that shows how we hope to test theory-based predictions.
We then illustrate how we intend to go from theory to objective detection methods by producing an initial (but crude) objective detection method.
Finally, we conclude by discussing related work and future directions.

Introduction to objective detection

In this section, we outline how objective detection could be used to tackle the inner alignment problem, clarify what we mean when we refer to an internal objective, and present our initial theory on objectives.

Background

A major concern is that we may accidentally train AIs that pursue misaligned objectives. It is insufficient to rely on behavioral observations to confidently deduce the true objectives of an AI system. This is in part due to the problem of deceptive alignment. Therefore, we may need to rely on advanced interpretability tools to confidently deduce the true objectives of AI systems.

Prior work has discussed how agentic AIs are likely to have internal objectives used to select actions by predicting whether they will lead to target outcomes. If an overseer had an objective detection method that could robustly detect and interpret all of the internal objectives of an AI (in training and deployment), it could confidently know whether or not t...

Posts

Wikitag Contributions

Comments