AI ALIGNMENT FORUMTags
AF

Interpretability (ML & AI)

EditHistorySubscribe

Help improve this page (4 flags)

EditHistorySubscribe

Help improve this page (4 flags)

Interpretability (ML & AI)

Contributors

You are viewing revision 1.4.0, last edited by niplav

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification "horse"....

Posts tagged Interpretability (ML & AI)

6

30A small update to the Sparse Coding interim research report

Lee Sharkey, Beren Millidge

2y

5

2

18Interpretability in ML: A Broad Overview

[anonymous]4y

0

5

43Re-Examining LayerNorm

2y

1

5

69[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Beren Millidge

2y

14

3

66Chris Olah’s views on AGI safety

5y

30

2

46A Longlist of Theories of Impact for Interpretability

3y

18

3

39200 Concrete Open Problems in Mechanistic Interpretability: Introduction

2y

0

3

19Finding Neurons in a Haystack: Case Studies with Sparse Probing

Wes Gurnee, Neel Nanda

2y

0

2

107A Mechanistic Interpretability Analysis of Grokking

Neel Nanda, Tom Lieberum

2y

17

2

74How To Go From Interpretability To Alignment: Just Retarget The Search

2y

24

1

69The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Beren Millidge, Sid Black

2y

11

2

50Searching for Search

Nicholas Kees Dupuis, janus

2y

0

1

142SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow, mwatkins

2y

16

2

96Against Almost Every Theory of Impact of Interpretability

Charbel-Raphael Segerie

1y

7

2

76A transparency and interpretability tech tree

2y

10