Lee Sharkey

Research engineer at Apollo Research (London). 

My main research interests are mechanistic interpretability and inner alignment. 

Wiki Contributions

Comments

Great! I'm curious, what was it about the sparsity penalty that you changed your mind about? 

Comments on the outcomes of the post:

  • I'm reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
  • I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
  • Two parallel works used the method identified in the post (sparse autoencoders - SAEs) or slight modification:
    • Cunningham et al. (2023)(https://arxiv.org/abs/2309.08600), a project which I supervised.
    • Bricken et al. (2023)(https://transformer-circuits.pub/2023/monosemantic-features), the Anthropic paper 'Towards Monosemanticity'. 
  • That two teams were able to use the results to explore complementary directions in parallel I think partly validates Conjecture's policy (at that time) of publishing quick, scrappy results that optimize for impact rather than rigour. I make this note because that policy attracted some criticism that I perceived to be undue, and to highlight that some of the benefits of the policy can only be observed after longer periods.


Some regrets related to the post:

  • It was pretty silly of me to divide the L1 loss by the number of dictionary elements. The idea was that this means that the L1 loss per dictionary element remains roughly constant even as you scale dictionaries. But that isn't what you want - you want more penalty as you scale, assuming the number of data-generating features is fixed. This made it more difficult than it needed to be to find the right hyperparameters. Fortunately, Logan Smith (iirc) identified this issue while working on Cunningham et al. 
  • The language model results were underwhelming. I strongly suspect they were undertrained. This was a addressed in a follow up post (https://www.alignmentforum.org/posts/DezghAd4bdxivEknM/a-small-update-to-the-sparse-coding-interim-research-report). 
  • I regret giving a specific number of potential features: "Here we found very weak, tentative evidence that, for a model of size d_model = 256, the number of features in superposition was over 100,000. This is a large scaling factor and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves. " Despite all the qualifications and expressions of deep uncertainty, I got the impression that many people read too much into this. I think avoiding publishing the LM results or not giving a specific figure could have avoided this misunderstanding.

Outlying issues:

  • In their current formulation, SAEs leave a few important problems unaddressed, including:
    • SAEs probably don't learn the most functionally relevant features. They find directions in the activations that are separable, but that doesn't necessarily reflect the network's ontology. The features learned by SAEs are probably too granular.
    • SAEs don't automatically provide a way to summarize the interactions between features (i.e. there is a gap between features and circuits).
    • The SAEs used in the above mentioned papers aren't a very satisfying solution to dealing with attention head polysemanticity. 
    • SAEs optimize two losses: Reconstruction and L1. The L1 loss penalizes the feature coefficients. I think this penalty means that, in expectation, they'll systematically undershoot the correct prediction for the coefficients (this has been observed empirically in private correspondence). 

I and collaborators are working on each of these problems.

Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877

But I think you're right to flag it - other references don't really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535
 

Thanks Akash! 

I agree that this feels neglected.

Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712

Looking forward to it coming out! 

Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.

Dictionary learning - This is one of my main bets for comprehensive interpretability. 

Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709 

No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report. 

I strongly suspect this is the case too! 

In fact, we might be able to speed up the learning of common features even further:

Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point. 

*The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time. 

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.


I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers

I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post. 

I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post. 
 

But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.

^ This is an example of what I called 'representational kludging'. It's an ugly way to represent "Be concerned about the monitors detecting suspicious instructions", but it turns out to function similarly to such a representation. 

It also involves 'passively externalised representations', since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
 

A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?

^ This is an example of the AI edging toward 'thinking in the gaps in our  tools'. 
 

Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)

Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)

Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.

^ This is an additional example of 'representational kludging', since the AI is using representational tools that are nearby in embedding space but that aren't exactly appropriate for the job, but have a sufficiently similar function that they still do the job.

 

Overall comments: 

I really liked this post! 

I like it because priorly there didn't exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class ('deep deception'). I agree that it's underappreciated that we're still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.

Thanks for your interest!

The autoencoder losses reported are the train losses. And you're right to point at noise potentially being an issue. It's my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data. 

> I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corners"), to see if this method produces strong false positives.

I'd be curious to see these results too! 

>  Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I'm not sure how many features you should expect in layer 0.

A rough estimate would be somewhere on the order of the vocabulary size (here 50k). A reason to think it might be more is that layer 0 MLP activations follow an attention layer, which means that features may represent combinations of token embeddings at different sequence positions and there are more potential combinations of tokens than in the vocabulary. A reason to think it may be fewer is that a lot of directions may get 'compressed away' in small networks. 

 

Load More