This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Interpretability (ML & AI)
•
Applied to
[Full Post] Progress Update #1 from the GDM Mech Interp Team
by
Neel Nanda
16h
ago
•
Applied to
[Summary] Progress Update #1 from the GDM Mech Interp Team
by
Neel Nanda
21h
ago
•
Applied to
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
by
Raymond Arnold
2d
ago
•
Applied to
Transformers Represent Belief State Geometry in their Residual Stream
by
Adam Shai
4d
ago
•
Applied to
Experiments with an alternative method to promote sparsity in sparse autoencoders
by
Eoin Farrell
4d
ago
•
Applied to
Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?
by
right..enough?
8d
ago
•
Applied to
Scaling Laws and Superposition
by
Pavan Katta
10d
ago
•
Applied to
Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
by
Marius Hobbhahn
12d
ago
•
Applied to
Normalizing Sparse Autoencoders
by
Fengyuan Hu
12d
ago
•
Applied to
DSLT 0. Distilling Singular Learning Theory
by
Jaehyuk Lim
14d
ago
•
Applied to
Ophiology (or, how the Mamba architecture works)
by
Danielle Ensign
17d
ago
•
Applied to
Sparsify: A mechanistic interpretability research agenda
by
Marius Hobbhahn
17d
ago
•
Applied to
A Selection of Randomly Selected SAE Features
by
CallumMcDougall
19d
ago
•
Applied to
SAE-VIS: Announcement Post
by
Raymond Arnold
20d
ago