Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

118

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer, Ethan Perez

This is a linkpost for https://arxiv.org/abs/2401.05566

I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.

EDIT: That announcement is now up!

Abstract:

Humans are capable of strategically

...

(See More - 807 more words)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

123

evhub, Nicholas Schiefer, Carson Denison, Ethan Perez

TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.

If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.

The Problem

We don’t currently have ~any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators:

Deceptive inner misalignment (a la Hubinger et al. 2019): where a model obtains good performance on the training objective, in

...

(Continue Reading - 5369 more words)

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub, Nicholas Schiefer

This is a linkpost for https://arxiv.org/abs/2211.09169

Overview

In some neural networks, individual neurons correspond to natural "features" in the input. Such monosemantic neurons are much easier to interpret, because in a sense they only do one thing. By contrast, some neurons are polysemantic, meaning that they fire in response to multiple unrelated features in the input. Polysemantic neurons are much harder to characterize because they can serve multiple distinct functions in a network.

Recently, Elhage+22 and Scherlis+22 demonstrated that architectural choices can affect monosemanticity, raising the prospect that we might be able to engineer models to be more monosemantic. In this work we report preliminary attempts to engineer monosemanticity in toy models.

Toy Model

The simplest architecture that we could study is a one-layer model. However, a core question we wanted to answer is: how does the...

(See More - 827 more words)

ELK Proposal - Make the Reporter care about the Predictor’s beliefs

Adam Jermyn, Nicholas Schiefer

(This proposal received an honorable mention in the ELK prize results, and we believe was classified among strategies which “reward reporters that are sensitive to what’s actually happening in the world”. We do not think that the counterexample to that class of strategies works against our proposal, though, and we have explained why in a note at the end. Feedback, disagreement, and new failure modes are very welcome!)

Basic idea

A Human Simulator only cares about the observations that the human sees and how the human interprets those observations, not the predictor’s understanding of the vault. The Truthful Reporter, by contrast, cares about the predictor’s understanding of the vault, accessed via the posterior distribution returned by the predictor.

We propose a regularizer which favours having the Reporter depend on the...

(Continue Reading - 1639 more words)