This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Interpretability (ML & AI)
•
Applied to
AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda
by
DanielFilan
at
9h
•
Applied to
ChatGPT: Tantalizing afterthoughts in search of story trajectories [induction heads]
by
Bill Benzon
at
1d
•
Applied to
More findings on maximal data dimension
by
Marius Hobbhahn
at
2d
•
Applied to
More findings on Memorization and double descent
by
Marius Hobbhahn
at
3d
•
Applied to
No Really, Attention is ALL You Need - Attention can do feedforward networks
by
Robert_AIZI
at
4d
•
Applied to
Mechanistic Interpretability Quickstart Guide
by
Neel Nanda
at
4d
•
Applied to
Spooky action at a distance in the loss landscape
by
Jesse Hoogland
at
8d
•
Applied to
[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision".
by
gekaklam
at
10d
•
Applied to
How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!
by
Stefan Heimersheim
at
11d
•
Applied to
List of links: Formal Methods, Embedded Agency, 3d world models, and some tools
by
thegearstoascension
at
12d
•
Applied to
Deconfusing "Capabilities vs. Alignment"
by
RobertM
at
12d
•
Applied to
Large language models learn to represent the world
by
Andrea_Miotti
at
13d
•
Applied to
Transformer Mech Interp: Any visualizations?
by
Raymond Arnold
at
17d
•
Applied to
200 COP in MI: Studying Learned Features in Language Models
by
Neel Nanda
at
17d
•
Applied to
Reflections on Trusting Trust & AI
by
Itay Yona
at
19d
•
Applied to
Neural networks generalize because of this one weird trick
by
Jesse Hoogland
at
19d
•
Applied to
How does GPT-3 spend its 175B parameters?
by
Raymond Arnold
at
22d
•
Applied to
Can we efficiently distinguish different mechanisms?
by
Multicore
at
22d