AI ALIGNMENT FORUMTags
AF

Interpretability (ML & AI)

EditHistorySubscribe

Help improve this page (4 flags)

EditHistorySubscribe

Help improve this page (4 flags)

Interpretability (ML & AI)

Contributors

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification "horse"....

Posts tagged Interpretability (ML & AI)

6

30A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun, Beren Millidge

1y

5

2

18Interpretability in ML: A Broad Overview

[anonymous]4y

0

5

43Re-Examining LayerNorm

2y

1

3

66Chris Olah’s views on AGI safety

5y

30

2

48A Longlist of Theories of Impact for Interpretability

2y

18

3

38200 Concrete Open Problems in Mechanistic Interpretability: Introduction

2y

0

3

19Finding Neurons in a Haystack: Case Studies with Sparse Probing

Wes Gurnee, Neel Nanda

1y

0

2

107A Mechanistic Interpretability Analysis of Grokking

Neel Nanda, Tom Lieberum

2y

17

1

69The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Beren Millidge, Sid Black

2y

11

2

73How To Go From Interpretability To Alignment: Just Retarget The Search

2y

24

2

48Searching for Search

Nicholas Kees Dupuis, janus

2y

0

1

142SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow, mwatkins

1y

16

2

93Against Almost Every Theory of Impact of Interpretability

Charbel-Raphael Segerie

1y

7

2

76A transparency and interpretability tech tree

2y

10

4

32Residual stream norms grow exponentially over the forward pass

Stefan Heimersheim, Alex Turner

1y

6