Logan Riggs Smith

Wiki Contributions


Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist's comment & my response)

I do think SAE's find the relevant features, but inefficiently compressed (see Josh & Isaac's work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other. 

[I also think the SAE's could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]

The PM is pretty bad (it's trained on hh). 

It's actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 -> 1.36 if you only calculate over that remaining ~136k subset.

My understanding is there's 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn't separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn't have the most complex features to choose from.

The in-distribution argument is most likely the case for the "Thank you. My pleasure" case, because the assistant never (AFAIK, I didn't check) said that phrase as a response. Only "My pleasure" after the user said " thank you". 

I prefer when they are directly mentioned in the post/paper!

That would be a more honest picture. The simplest change I could think of was adding it to the high-level takeaways.

I do think you could use SAE features to beat that baseline if done in the way specified by General Takeaways. Specifically, if you have a completion that seems to do unjustifiably better, then you can find all feature's effects on the rewards that were different than your baseline completion. 

Features help come up with hypotheses, but also isolates the effect. If do have a specific hypothesis as mentioned, then you should be able to find features that capture that hypothesis (if SAEs are doing their job). When you create some alternative completion based on your hypothesis, you might unknowingly add/remove additional negative & positive features e.g. just wanting to remove completion-length, you also remove the end-of-sentence punctuation. 

In general, I think it's hard to come up with the perfect counterfactual, but SAE's at least let you know if you're adding or removing specific reward-relevant features in your counterfactual completions. 


There were some features that didn't work, specifically ones that activated on movie names & famous people's names, which I couldn't get to work. Currently I think they're actually part of a "items in a list" group of reward-relevant features (like the urls were), but I didn't attempt to change prompts based off items in a list. 

For "unsupervised find spurious features over a large dataset" my prior is low given my current implementation (ie I didn't find all the reward-relevant features).

However, this could be improved with more compute, SAEs over layers, data, and better filtering of the resulting feature results (and better versions of SAEs that e.g. fix feature splitting, train directly for reward).

From my (quick) read, it's not obvious how well this approach compares to the baseline of "just look at things the model likes and try to understand the spurious features of the PM" 

From this section, you could augment this with SAE features by finding the features relevant for causing one completion to be different than the other. I think this is the most straightforwardly useful application. A couple of gotcha's:

  1. Some features are outlier dimensions or high-frequency features which will affect both completions (or even most text), so include some baselines which shouldn't be affected (which requires a hypothesis)
  2. You should look over multiple layers (though if you do multiple residual stream SAEs you'll find near-duplicate features)

Thanks so much! All the links and info will save me time:)

Regarding cos-sim, after thinking a bit, I think it's more sinister. For cross-cos-sim comparison, you get different results if you take the max over the 0th or 1st dimension (equivalent to doing cos(local, e2e) vs cos(e2e, local). As an example, you could have 2 features each, 3 point in the same direction and 1 points opposte. Making up numbers:

feature-directions(1D) = [ [1],[1]] & [[1],[-1]]
cos-sim = [[1, 1], [-1, -1]]

For more intuition, suppose 4 local features surround 1 e2e feature (and the other features are pointed elsewhere). Then the 4 local features will all have high max-cos sim but the e2e only has 1. So it's not just double-counting, but quadruple counting. You could see for yourself if you swap your dim=1 to 0 in your code.

But my original comment showed your results are still directionally correct when doing [global max w/ replacement] (if I coded it correctly). 

Btw it's not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.

The decoder directions have degrees of freedom, but the encoder directions...might have similar degrees of freedom and I'm wrong, lol. BUT! they might be functionally equivalent, so they activate on similar datapoints across seeds. That is more laborious to check though, waaaah. 

I can check both (encoder directions first) because previous literature is really only on the SVD of gradient (ie the output), but an SAE might be more constrained when separating out inputs into sparse features. Thanks for prompting for my intuition!

The e2e having different feature directions across seeds was quite the bummer, but then I thought "are the encoder directions different though?"

Intuitively the encoder directions affect which datapoints each feature activates on, and the decoder is the causal downstream effect. For e2e, we would expect widely different decoder directions because there are many free parameters (from some other work that showed SVD of gradients had many zero singular values, meaning moving in most directions don't effect the downstream loss), but not necessarily encoder directions. 

If the encoder directions are similar across seeds, I'd trust them to inform relevant features for the model output (in cases where we don't care about connections w/ downstream layers).

However, I was not able to find the SAEs for various seeds. 

Trying to replicate Cos-sim Plots

I downloaded the similar CE at layer 6 for all three types of SAEs & took their cos-sim (last column in figure 3).

I think your cos-sim metric gives different results if you take the max over the first or 2nd dimension (or equivalently swapped the order of decoders multiplied by each other). If so, I think this is because you might double-count or something? Regardless, I ended up doing some hungarian algorithm to take the overall max (but don't double-count), but it's on cpu, so I only did the first 10k/40k features. Below is results for both encoder & decoder, which do replicate the directional results.

Nonzero Features

Additionally I thought that some results were from counting nonzero features, which, for the encoder is some high-cos-sim features, and decoder is the low-cos-sim features weirdly enough.

Would appreciate if y'all upload any repeated seeds!

My code is temporarily hosted (for a few weeks maybe?) here.

What a cool paper! Congrats!:)

What's cool:
1. e2e saes learn very different features every seed. I'm glad y'all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would've predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the "intermediate activations aren't similar" problem.

It looks like you've left for future work postraining SAE_local on KL or downstream loss as future work, but that's a very interesting part! Specifically the approximation of SAE_e2e+downstream as you train on number of tokens.

Did y'all try ablations on SAE_e2e+downstream? For example, only training on the next layers Reconstruction loss or next N-layers rec loss?

I've only done replications on the mlp_out & attn_out for layers 0 & 1 for gpt2 small & pythia-70M

I chose same cos-sim instead of epsilon perturbations. My KL divergence is log plot, because one KL is ~2.6 for random perturbations. 

I'm getting different results for GPT-2 attn_out Layer 0. My random perturbation is very large KL. This was replicated last week when I was checking how robust GPT2 vs Pythia is to perturbations in input (picture below). I think both results are actually correct, but my perturbation is for a low cos-sim (which if you see below shoots up for very small cos-sim diff). This is further substantiated by my SAE KL divergence for that layer being 0.46 which is larger than the SAE you show.

Your main results were on the residual stream, so I can try to replicate there next.

For my perturbation graph:

I add noise to change the cos-sim, but keep the norm at around 0.9 (which is similar to my SAE's). GPT2 layer 0 attn_out really is an outlier in non-robustness compared to other layers. The results here show that different layers have different levels of robustness to noise for downstream CE loss. Combining w/ your results, it would be nice to add points for the SAE's cos-sim/CE. 

An alternative hypothesis to yours is that SAE's outperform random perturbation at lower cos-sim, but suck at higher-cos-sim (which we care more about). 

Correct. So they’re connecting a feature in F2 to a feature in F1.

If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change? 

If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?

Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

Load More