AI ALIGNMENT FORUMTags
AF

Interpretability (ML & AI)

•

Applied to [Full Post] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda 16h ago

•

Applied to [Summary] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda 21h ago

•

Applied to Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Raymond Arnold 2d ago

•

Applied to Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai 4d ago

•

Applied to Experiments with an alternative method to promote sparsity in sparse autoencoders by Eoin Farrell 4d ago

•

Applied to Barcoding LLM Training Data Subsets. Anyone trying this for interpretability? by right..enough? 8d ago

•

Applied to Scaling Laws and Superposition by Pavan Katta 10d ago

•

Applied to Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition by Marius Hobbhahn 12d ago

•

Applied to Normalizing Sparse Autoencoders by Fengyuan Hu 12d ago

•

Applied to DSLT 0. Distilling Singular Learning Theory by Jaehyuk Lim 14d ago

•

Applied to Ophiology (or, how the Mamba architecture works) by Danielle Ensign 17d ago

•

Applied to Sparsify: A mechanistic interpretability research agenda by Marius Hobbhahn 17d ago

•

Applied to A Selection of Randomly Selected SAE Features by CallumMcDougall 19d ago

•

Applied to SAE-VIS: Announcement Post by Raymond Arnold 20d ago