This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Nicholas Goldowsky-Dill
Interpretability Researcher at Apollo Research
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
0
Nicholas Goldowsky-Dill's Shortform
10mo
0
26
Stress Testing Deliberative Alignment for Anti-Scheming Training
12h
0
71
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
6mo
1
46
Detecting Strategic Deception Using Linear Probes
7mo
0
52
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
1y
7
42
Apollo Research 1-year update
1y
0
56
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
1y
1
28
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
1y
8
22
Causal scrubbing: results on induction heads
3y
0
20
Causal scrubbing: results on a paren balance checker
3y
2
11
Causal scrubbing: Appendix
3y
1
Comments