x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
GDM Interp Progress Updates — AI Alignment Forum
GDM Interp Progress Updates
36
[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
,
Vikrant Varma
2y
0
40
[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
,
Vikrant Varma
2y
3
26
The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
Arthur Conmy
,
Neel Nanda
1y
0
58
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
lewis smith
,
Senthooran Rajamanoharan
,
Arthur Conmy
,
CallumMcDougall
,
Tom Lieberum
,
János Kramár
,
Rohin Shah
,
Neel Nanda
1y
6
60
A Pragmatic Vision for Interpretability
Neel Nanda
,
Josh Engels
,
Arthur Conmy
,
Senthooran Rajamanoharan
,
bilalchughtai
,
CallumMcDougall
,
János Kramár
,
lewis smith
7mo
10
33
How Can Interpretability Researchers Help AGI Go Well?
Neel Nanda
,
Josh Engels
,
Senthooran Rajamanoharan
,
Arthur Conmy
,
bilalchughtai
,
CallumMcDougall
,
János Kramár
,
lewis smith
7mo
1
33
Models May Behave Worse When Eval Aware
Senthooran Rajamanoharan
,
Neel Nanda
8d
0
23
Building and evaluating model diffing agents
bilalchughtai
,
Josh Engels
,
Neel Nanda
7d
0
27
SFT Drives Gemini’s Safety Properties
Josh Engels
,
Arthur Conmy
,
bilalchughtai
,
Neel Nanda
6d
0
23
Why Do Naive SFT Filters For Safety Properties Fail?
Josh Engels
,
Neel Nanda
5d
0
24
Synthetic document finetuning for instilling positive traits
CallumMcDougall
,
Arthur Conmy
,
Neel Nanda
3d
0