x

AI ALIGNMENT FORUM

AF

Apollo Research (org) — AI Alignment Forum

Apollo Research (org)

This page is a stub.

Add Posts

1

1

Posts tagged Apollo Research (org)

1

101SAE feature geometry is outside the superposition hypothesis

2y

1

1

90Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer

3y

4

1

54You can remove GPT2’s LayerNorm by fine-tuning for an hour

2y

3

1

52A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill

2y

7

1

45Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey

1y

5

1

56The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn

2y

1

1

43Sparsify: A mechanistic interpretability research agenda

2y

17

1

42Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb

2y

0

1

39We need a Science of Evals

Marius Hobbhahn, Jérémy Scheurer

2y

9

1

34[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex, jake_mendel

2y

0

1

34An Opinionated Evals Reading List

Marius Hobbhahn, Jérémy Scheurer

2y

0

1

33Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun

3y

0

1

26A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke

2y

0

1

28Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

2y

8

1

25Theories of Change for AI Auditing

Lee Sharkey, beren, Marius Hobbhahn

3y

0

Load More (15/18)

Add Posts