AI ALIGNMENT FORUM
AF

2463
Wikitags

Apollo Research (org)

This page is a stub.
Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Apollo Research (org)
101SAE feature geometry is outside the superposition hypothesis
jake_mendel
1y
0
89Announcing Apollo Research
Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
4
54You can remove GPT2’s LayerNorm by fine-tuning for an hour
StefanHex
1y
3
52A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill
1y
7
45Attribution-based parameter decomposition
Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey
9mo
5
56The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn
1y
1
43Sparsify: A mechanistic interpretability research agenda
Lee Sharkey
2y
17
42Apollo Research 1-year update
Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb
1y
0
37We need a Science of Evals
Marius Hobbhahn, Jérémy Scheurer
2y
9
35[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex, jake_mendel
1y
0
34An Opinionated Evals Reading List
Marius Hobbhahn, Jérémy Scheurer
1y
0
33Understanding strategic deception and deceptive alignment
Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun
2y
0
28Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
1y
8
26A starter guide for evals
Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke
2y
0
25Theories of Change for AI Auditing
Lee Sharkey, beren, Marius Hobbhahn
2y
0
Load More (15/18)
Add Posts