AI ALIGNMENT FORUM
AF

Nicholas Goldowsky-Dill
Ω118060
Message
Dialogue
Subscribe

Interpretability Researcher at Apollo Research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
0Nicholas Goldowsky-Dill's Shortform
10mo
0
26Stress Testing Deliberative Alignment for Anti-Scheming Training
12h
0
71Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
6mo
1
46Detecting Strategic Deception Using Linear Probes
7mo
0
52A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
1y
7
42Apollo Research 1-year update
1y
0
56The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
1y
1
28Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
1y
8
22Causal scrubbing: results on induction heads
3y
0
20Causal scrubbing: results on a paren balance checker
3y
2
11Causal scrubbing: Appendix
3y
1
Load More