This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Neuronpedia has received 1 year of funding from LTFF. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with high-level direction and product management. We’d love to talk to you about how Neuronpedia can speed up your SAE research. Fill out this short form to get in touch. 

 

Introduction

Why Neuronpedia?

  • While SAEs are exciting, they introduce a number of engineering challenges which can slow or even prevent research from being done. These challenges arise because SAEs decompose neural networks into numerous distinct features which we are trying to understand.
  • Even when working with relatively small language models like GPT2-Small, there are challenges in producing and storing feature dashboards, automatically interpreting features and organizing insights we collect about features.
  • As the community produces more SAEs, trains SAEs on larger models and develops better techniques for understanding and analyzing these artifacts, there’s going to be an ever increasing number of engineering challenges in the way of research into model internals. 

What It's Already Useful For

  • Generating and hosting more than 600k SAE feature dashboards. This includes GPT2-small residual stream SAEs for Joseph Bloom (RES-JB) and attention out SAEs for Connor Kissane and Robert Kryzanowski (ATT-KK). 
  • Running Python servers to enable prompt-based feature search and feature specific prompt testing. This makes it easier for researchers and the community to qualitatively test features, write up results and enable researchers  to share SAEs. 
  • Generating automatic explanations of features with GPT3.5-Turbo, which can be navigated via U-MAPS. The automatic explanations are not currently amazing, but hopefully we can iterate and improve on this. 

Making It More Useful

  1. Researchers / Labs publish SAEs that are uploaded to Neuronpedia: There are many different labs training SAEs so it’s a safe bet that there will be many SAEs in the public domain in the near future. We make it easy for researchers to host their SAEs and enable functionalities like feature testing, UMAPs, and more.
  2. Generate feature dashboards, automatic interpretability and run benchmarks: Feature dashboards are currently being produced using Callum’s reproduction of the Anthropic feature dashboards. Plausibly these can be improved over time to provide richer insights into features. Automatic explanations of features are provided using OpenAI’s method. We’ll explore adding SAE benchmarks, which seem like an important part of quality control for SAEs.
  3. Work with researchers to add functionality that accelerates research: Given the scale of SAEs, there are likely lots of ways we can make it easier to conduct or share research with the platform. This might involve things like developing more functionality for circuit style analysis or supporting novel techniques for making sense of SAEs. 

We’re inspired by analogies to biology and bioinformatics and expect many different framings to be useful. Modern biosciences leverage vast amounts of shared infrastructure for storing and analyzing genomic, proteomic and phenotypic data. The study of neural networks may benefit from similar resources which enable researchers to share results, curate insights and perform experiments. 

Strategy: Iterate Very Quickly

We'll be the first to admit that we don't know exactly what the most useful functionalities for researchers will be - nobody does. We fully expect some of the things we build will be not useful, killed off, or deprecated. That's fine. We expect to continuously take feedback + ideas, then ship them quickly, in order to learn what works or doesn't work.

We also anticipate that some features will only be useful for a short period of time. For example, we might build a specific visualizer for researchers to test a certain hypothesis. Once the hypothesis has been sufficiently evaluated by multiple researchers collaboratively on Neuronpedia, the visualizer becomes less useful. In our view, Neuronpedia is doing well if it can continuously make itself obsolete by iterating quickly on the qualitative research that is necessary for early scientific fields to mature.

 

The rest of this post goes into more detail about Neuronpedia, including simple demos. We highly recommend going to Neuronpedia and interacting with it.

 

Current Neuronpedia Functionality

Hosting SAE Feature Dashboards

Anthropic released feature dashboards for their SAEs, which were then reproduced by Callum McDougall in SAE Vis. Neuronpedia uses SAE Vis and a database of feature-activating examples to provide an interface for hosted SAE features.

An example feature page with a feature dashboard, explanations, lists, comments, and more.
GPT2-Small Residual Stream - Layer 9 - Feature 21474

We plan to provide documentation which makes the current dashboard / interface more accessible in the future, but in the meantime, refer to the Anthropic documentation.  

Feature Testing

Since maximum activating examples can be misleading, Neuronpedia runs servers with language models and SAEs which let users test SAE features with new text. We find this to be a crucial component of validating researcher understanding of a feature. 

A GIF of testing a SAE feature with a custom sentence. This seems to be Star Wars related, and we test it with a sentence about Obi-Wan and Palpatine, to see that those tokens indeed activate.
GPT2 Small Residual Stream - Layer 9 - Feature 21474

Automatic Explanations and UMAP for Exploration

While imperfect, OpenAI’s method for automatic neuron interpretability is a useful start for automating interpretability of SAE features. We’ve automatically interpreted Joseph Bloom’s residual stream SAE features and will explore providing automatic interpretation for new SAEs as they are uploaded. 

A GIF of exploring layer 9 residual SAEs via UMAP, and saving a custom list with selected directions (UMAPcreated list)

Live Feature Inference

To understand networks better, we want to see which features fire on specific prompts. This is essential for applying SAEs to circuit analysis and other applications of SAEs. We’re currently iterating on an interface for showing SAE decompositions of model activations, which also supports filtering by specific tokens and/or layers:

A GIF of searching ~300,000 features in Joseph Bloom’s GPT2-Small Residual SAEs, running live inference on a custom sentence. Results are sorted by highest activating neurons. You can then filter by specific layers or tokens, or even by the sum of multiple tokens. (See the exact search here)

Enabling Collaboration

Similar to how internet platforms helped accelerate researchers across the globe for the human genome, we think there's potential for significant value-add by helping SAE researchers collaborate more easily and effectively through Neuronpedia. To that end, Neuronpedia has features that might better enable consensus building and collaboration.

  • Sharable by Default: Nearly everything is directly sharable and linkable by default. This allows easy direct-linking to any SAE, layer, neuron/feature, list, for any model. This makes the data-sharing part of collaboration way easier.
  • Searching and comparing across SAEs: Neuronpedia wants to enable finding the best SAEs (and try to identify the strengths/weaknesses of each set of SAEs). This is made possible by having a unified platform to interactively test and compare different SAEs - something that is more difficult to do in the status quo, where SAEs (and their data) exist in different places in wildly different formats.
  • Feature Curation + Commenting
    • Starring: Users can bookmark features they find interesting by starring them.
    • Lists: Users can curate lists of features which they want to share with others. Joseph Bloom has already made extensive use of this feature in a recent LessWrong post and it’s clear this feature could be used in other posts such as this post by Connor Kissane and Robert Kryzanowski.
    • Commenting: Both features and lists facilitate commenting.
An example of a feature list. See the list here.

 

Future Work

As mentioned in the introduction, nobody knows exactly what will be the most useful features for researchers. But we do have some initial hunches - here are a select few:

Circuit Analysis

Various researchers are attempting to use SAEs for circuit analysis. Supporting these researchers with features similar to OpenAI’s transformer debugger may be useful (eg: logits lens, attribution, paired prompt comparisons), though we don’t want to put in a lot of effort just to recreate Jupyter notebooks. 

Understanding + Red-Teaming Features

We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand functionality here such as integrating language models like GPT4 to assist users in generating text to test features. Other functionality in this category may include, finding similar features, automatically annotating features via token set enrichment or testing for feature types like long / short prefix induction features.

Quality Control and Benchmarking

As we develop better methods for training SAEs, it will be useful to provide some level of quality assurance around public SAEs which are being shared by researchers. Some work has already been published benchmarking SAEs and we’d be excited to facilitate further work and record results on the platform, for all hosted SAEs. 


 

FAQ

Who’s involved with the project?

Johnny Lin is an ex-Apple engineer who built privacy/consumer apps before going full-time into interpretability last year. His previous apps had over 1M+ organic downloads, and his writings have appeared in the Washington Post, Forbes, FastCompany, and others. Johnny is a contributor to the Automated Interpretability repository.

Joseph Bloom is an Independent Researcher in Mechanistic Interpretability working on Sparse Autoencoders. Before working in AI Safety, Joseph worked as a Data Scientist at a Software-Startup where he worked with academic labs and biopharmaceutical companies to process and extract insights from data. Joseph is currently doing MATS under Neel Nanda and has recently been working on various SAE related projects

I’d like to upload my SAE weights to Neuronpedia. How do I do this? 

To upload your SAEs, fill out this <5 minute form to get started.

This sounds cool, how can I help?

To get involved, join the Open Source Mechanistic Interp Slack (click here), and then join the #neuronpedia channel.

An easy way to help is to use Neuronpedia. Explore GPT2-Small and the SAEs, find interesting patterns and features, zoom around the UMAP, search for things, etc. You can also make lists and comment on features. SAE researchers love nothing more than to find new insights on their SAEs!

Along the way, we’re sure you’ll find bugs, have feature requests, and questions. That’s a critical part of helping out Neuronpedia. Please report that to either the Slack channel/DM, or emailing me at johnny@neuronpedia.org

For those interested in upskilling to work on SAEs, we recommend programs like ARENA or MATS

You seem to be super into SAEs, what if SAEs suck?

We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention. 

Is Neuronpedia an AI safety org? 

We want to clarify that Neuronpedia is not an AI safety organization. Right now, we’re experimenting with the idea that Neuronpedia can meaningfully accelerate valuable research and are focused on testing this hypothesis as robustly as possible. However, if and when it makes sense, we may try to be more ambitious. 

Are there dual-use risks associated with SAEs? 

As with any area of technical AI safety, it’s important to consider the risk of work being used to accelerate the development of misaligned or poorly aligned systems. We think that the benefits outweigh the risks in terms of training / studying neural networks with Sparse Autoencoders, but don’t take it for granted that this will always be the case regardless of future advancements. For this reason, we plan to consult with members of the mechanistic interpretability community when considering any actions where we are “first movers” (ie: the first to publish a technique or method) or which constitute a significant advance beyond the current SOTA. 

New Comment