This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Interpretability (ML & AI)
•
Applied to
Decompiling Tracr Transformers - An interpretability experiment
by
Hannes Thurnherr
2d
ago
•
Applied to
Some findings from training SAEs on Othello-GPT
by
vlad k
2d
ago
•
Applied to
Towards White Box Deep Learning
by
Maciej Satkiewicz
3d
ago
•
Applied to
Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
by
Johnny Lin
3d
ago
•
Applied to
Mechanism for feature learning in neural networks and backpropagation-free machine learning models
by
Matt Goldenberg
10d
ago
•
Applied to
AtP*: An efficient and scalable method for localizing LLM behaviour to components
by
Neel Nanda
11d
ago
•
Applied to
Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
by
Logan Riggs Smith
14d
ago
•
Applied to
Sparse autoencoders find composed features in small toy models
by
Evan Anders
15d
ago
•
Applied to
Can quantised autoencoders find and interpret circuits in language models?
by
Charlie O'Neill
15d
ago
•
Applied to
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
by
Sonia Joseph
17d
ago
•
Applied to
Understanding SAE Features with the Logit Lens
by
Joseph Isaac Bloom
18d
ago
•
Applied to
Exploring the Evolution and Migration of Different Layer Embedding in LLMs
by
Ruixuan Huang
21d
ago
•
Applied to
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
by
Robert_AIZI
24d
ago
•
Applied to
Anomalous Concept Detection for Detecting Hidden Cognition
by
Paul Colognese
25d
ago
•
Applied to
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
by
robertzk
1mo
ago
•
Applied to
What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.
by
Joshua Clancy
1mo
ago