bilalchughtai

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

How Can Interpretability Researchers Help AGI Go Well?

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202566

A Pragmatic Vision for Interpretability

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025131

Detecting Strategic Deception Using Linear Probes

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. Abstract: > AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient,...

Feb 6, 2025104

Paper: Open Problems in Mechanistic Interpretability

TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field. This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences....

Jan 29, 202571

Activation space interpretability may be doomed

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model...

Jan 8, 2025153

Unlearning via RMU is mostly shallow

This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program -...

Jul 23, 202457

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

bilalchughtai

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

Activation space interpretability may be doomed

Unlearning via RMU is mostly shallow