AI ALIGNMENT FORUM
AF

Wikitags

Apollo Research (org)

This page is a stub.
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Apollo Research (org)
102SAE feature geometry is outside the superposition hypothesis
Jake Mendel
1y
0
89Announcing Apollo Research
Marius Hobbhahn, Beren Millidge, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
4
53You can remove GPT2’s LayerNorm by fine-tuning for an hour
Stefan Heimersheim
11mo
3
52A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey, Lucius Bushnaq, Dan Braun, Stefan Heimersheim, Nicholas Goldowsky-Dill
1y
7
45Attribution-based parameter decomposition
Lucius Bushnaq, Dan Braun, Stefan Heimersheim, Jake Mendel, Lee Sharkey
5mo
5
56The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq, Jake Mendel, Dan Braun, Stefan Heimersheim, Nicholas Goldowsky-Dill, Kaarel Hänni, Avery, Joern Stoehler, Cindy Wu, Magdalena Wache, Marius Hobbhahn
1y
1
43Sparsify: A mechanistic interpretability research agenda
Lee Sharkey
1y
17
42Apollo Research 1-year update
Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, Stefan Heimersheim, Jake Mendel, AlexMeinke, rusheb
1y
0
38We need a Science of Evals
Marius Hobbhahn, Jérémy Scheurer
1y
9
35[Interim research report] Activation plateaus & sensitive directions in GPT2
Stefan Heimersheim, Jake Mendel
1y
0
34An Opinionated Evals Reading List
Marius Hobbhahn, Jérémy Scheurer
9mo
0
33Understanding strategic deception and deceptive alignment
Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun
2y
0
28Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
1y
8
26A starter guide for evals
Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke
2y
0
25Theories of Change for AI Auditing
Lee Sharkey, Beren Millidge, Marius Hobbhahn
2y
0
Load More (15/18)
Add Posts