This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Interpretability (ML & AI)
•
Applied to
Refusal mechanisms: initial experiments with Llama-2-7b-chat
by
Nina Rimsky
2h
ago
•
Applied to
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment
by
Roger Dearnaley
2d
ago
•
Applied to
Deep Forgetting & Unlearning for Safely-Scoped LLMs
by
Stephen Casper
3d
ago
•
Applied to
Mechanistic interpretability through clustering
by
Alistair Fraser
4d
ago
•
Applied to
Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study
by
Karolis Ramanauskas
4d
ago
•
Applied to
Why using activation for interpreting GPT-2?
by
sprout_ust
6d
ago
•
Applied to
How useful is mechanistic interpretability?
by
Ryan Greenblatt
8d
ago
•
Applied to
Intro to Superposition & Sparse Autoencoders (Colab exercises)
by
CallumMcDougall
9d
ago
•
Applied to
A day in the life of a mechanistic interpretability researcher
by
Bill Benzon
10d
ago
•
Applied to
AISC project: TinyEvals
by
Jett Janiak
16d
ago
•
Applied to
A framing for interpretability
by
Nina Rimsky
24d
ago
•
Applied to
Incidental polysemanticity
by
Victor Lecomte
25d
ago
•
Applied to
Is Interpretability All We Need?
by
Roger Dearnaley
25d
ago
•
Applied to
Eliciting Latent Knowledge in Comprehensive AI Services Models
by
acabodi
25d
ago
•
Applied to
AISC Project: Modelling Trajectories of Language Models
by
Nicky Pochinkov
25d
ago
niplav
v1.4.0
Nov 10th 2023
(+40)
LW
2
Explainable Artificial Intelligence
on Wikipedia
Transformer Circuits
Interpretable Machine Learning
, textbook