x

AI ALIGNMENT FORUM

AF

GDM Interp Progress Updates — AI Alignment Forum

GDM Interp Progress Updates

Apr 19, 2024 by Neel Nanda

36[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

0

40[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

2y

3

26The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy, Neel Nanda

1y

0

58Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah, Neel Nanda

1y

6

60A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

7mo

10

33How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, lewis smith

7mo

1

33Models May Behave Worse When Eval Aware

Senthooran Rajamanoharan, Neel Nanda

8d

0

23Building and evaluating model diffing agents

bilalchughtai, Josh Engels, Neel Nanda

7d

0

27SFT Drives Gemini’s Safety Properties

Josh Engels, Arthur Conmy, bilalchughtai, Neel Nanda

6d

0

23Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels, Neel Nanda

5d

0

24Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy, Neel Nanda

3d

0