TLDR:

Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method can also help interpret SAE features.
We reimplement the algorithm in the paper, adapt it based on suggestions from the authors and apply it to some interesting residual stream SAE latents in GPT-2 Small and Gemma 1 2B. The prompts mostly match the kind of text one would expect to activate a feature from looking at the max activating examples. We find prompts that activate features more than any dataset examples for GPT-2 (24576 prompts of 128 tokens). We can find examples that highly activate two features at once. We find examples for refusal-like features from our refusal features post and induction/repetition-like features from our ICL features post.
We believe this technique is useful is similar contexts as self-explanation – interpreting directions in the absence of a dataset during early experimentation in a cheaper way for individual features. The major differences are that this technique doesn’t have a minimum requirement for model size (because it doesn’t rely on the model’s verbalization capability) and takes longer to run (<1min vs 5min)
We visualize a few known features and 200 random features for Gemma 1 2B. The results can be viewed at https://neverix.github.io/epo-visualizer/

Introduction

Methods like EPO try to solve a problem similar to SolidGoldMagikarp (plus, prompt generation). To give a brief overview:

Neural networks contain features^[1] that activate more or less strongly on different examples. One common way to interpret features in vision models is to go through a dataset, record activations of the feature on each image, and show the highest-activating images for each feature.

An example of vision model interpretation by optimizing examples from https://distill.pub/2017/feature-visualization/. Optimization makes the objects the neuron is looking for more clear.

There is one obvious problem with this: if there is a pattern that is rare or never appears in the dataset, it's possible the maximum activating examples computed from the dataset will be misleading: https://arxiv.abs/2104.07143. As an example, in the fi...

Posts

Wikitag Contributions

Comments