AI ALIGNMENT FORUM
AF

Wikitags

Interpretability (ML & AI)

Edited by niplav, Multicore, et al. last updated 22nd Jan 2025

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding . This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification "horse".

See Also

  • Explainable Artificial Intelligence on Wikipedia
  • Interpretable Machine Learning, textbook

Research

  • Circuits Thread
  • Transformer Circuits Thread
circuits in transformer models
Transformer Circuits
Subscribe
4
Subscribe
4
Discussion0
Discussion0
Posts tagged Interpretability (ML & AI)
30A small update to the Sparse Coding interim research report
Lee Sharkey, Dan Braun, Beren Millidge
2y
5
18Interpretability in ML: A Broad Overview
[anonymous]5y
0
80Timaeus's First Four Months
Jesse Hoogland, Daniel Murfet, Stan van Wingerden, Alexander Gietelink Oldenziel
1y
1
107A Mechanistic Interpretability Analysis of Grokking
Neel Nanda, Tom Lieberum
3y
18
90Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob, Jake Mendel, Kaarel Hänni
1y
8
69[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey, Dan Braun, Beren Millidge
3y
14
66Chris Olah’s views on AGI safety
Evan Hubinger
6y
30
47A Longlist of Theories of Impact for Interpretability
Neel Nanda
3y
21
43Re-Examining LayerNorm
Eric Winsor
3y
1
39200 Concrete Open Problems in Mechanistic Interpretability: Introduction
Neel Nanda
3y
0
35A Problem to Solve Before Building a Deception Detector
Eleni Angelou, lewis smith
5mo
1
19Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee, Neel Nanda
2y
1
104Tracing the Thoughts of a Large Language Model
Adam Jermyn
3mo
4
77How To Go From Interpretability To Alignment: Just Retarget The Search
johnswentworth
3y
24
69The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Beren Millidge, Sid Black
3y
11
Load More (15/393)
Add Posts