AI ALIGNMENT FORUM
AF

2320
Wikitags

Apollo Research (org)

This page is a stub.
Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Apollo Research (org)
1
101SAE feature geometry is outside the superposition hypothesis
jake_mendel
1y
0
1
89Announcing Apollo Research
Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
4
1
54You can remove GPT2’s LayerNorm by fine-tuning for an hour
StefanHex
1y
3
1
52A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill
1y
7
1
45Attribution-based parameter decomposition
Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey
10mo
5
1
56The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn
1y
1
1
43Sparsify: A mechanistic interpretability research agenda
Lee Sharkey
2y
17
1
42Apollo Research 1-year update
Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb
1y
0
1
37We need a Science of Evals
Marius Hobbhahn, Jérémy Scheurer
2y
9
1
35[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex, jake_mendel
1y
0
1
34An Opinionated Evals Reading List
Marius Hobbhahn, Jérémy Scheurer
1y
0
1
33Understanding strategic deception and deceptive alignment
Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun
2y
0
1
26A starter guide for evals
Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke
2y
0
1
28Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
1y
8
1
25Theories of Change for AI Auditing
Lee Sharkey, beren, Marius Hobbhahn
2y
0
Load More (15/18)
Add Posts