Ryan Greenblatt

An important note here is that our final 80% loss recovered results in loss which is worse than a constant model! (aka, a rock)

Specifically, note that the dataset consists of 26% balanced sequences as discussed here. So, the loss of a constant model is (equivalent to the entropy of the labels). This is less than the 1.22 loss we get for our final 72% loss recovered experiment.

We think this is explanation is still quite meaningful - adversarially constructed causal scrubbing hypothesis preserve variance and the model is very confident. So, getting to such a loss is still meaningful. For instance, if you train an affine model from the log probability the scrubbed model outputs to a new logit trained for getting optimal loss on the dataset, I expect this does much better than a constant model (a rock).

We probably should have included the constant model comparison in the table and related discussion (like in this comment) in the original post. Sorry. We may edit this into the post later.

Overall, my view is that we will need to solve the optimization problem of 'what properties of the activation distribution are sufficient to explain how the model behaves', but this solution can be represented somewhat implicitly and I don't currently see how you'd transition it into a solution to superposition in the sense I think you mean.

I'll try to explain why I have this view, but it seems likely I'll fail (at least partially because of my own confusions).

Quickly, some background so we're hopefully on the same page (or at least closer):

I'm imagining the setting described here. Note that anomalies are detected with respect to a distribution (for a new datapoint ! So, we need a distribution where we're happy with the reason why the model works.

This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples).

Now the approaches to anomaly detection I typically think about work roughly like this:

- Try to find an 'explanation'/'hypothesis' for the variance on which doesn't also 'explain' deviation from the mean on . (We're worst casing over explanations)
- If we succeed, then is anomalous. Otherwise, it's considered non-anomalous.

Note that I'm using scare quotes around explanation/hypothesis - I'm refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it's not clear exactly which properties we will and won't need.

This stated approach is very inefficient (it requires learning an explanation for each new datum !), but various optimizations are plausible (e.g., having a minimal base explanation for which we can quickly finetune for each datum ).

I'm typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arguments (which have quite different properties).

Now back to superposition.

A working solution must let you know if atypical features have fired, but not *which* atypical features or what direction those atypical features use. Beyond this, we might hope that the 'explanation' for the variance on can tell use which directions the model uses for representing important information. This will sometimes be true, but I think this is probably false in general, though I'm having trouble articulating my intuitions for this. Minimally, I think it's very unclear how you would extract this information if you use causal scrubbing based approaches.

I plan on walking through an example which is similar to how we plan on tacking anomaly detection with causal scrubbing in a future comment, but I need to go get lunch.

Thanks for the great comment clarifying your thinking!

I would be interested in seeing the data dimensionality curve for the validation set on MNIST (as opposed to just the train set) - it seems like the stated theory should make pretty clear predictions about what you'd see. (Or maybe I'll try to make reproduction happpen and do some more experiments).

These results also suggest that if superposition is widespread, mechanistic anomaly detection will require solving superposition

I feel pretty confused, but my overall view is that many of the routes I currently feel are most promising don't require solving superposition. At least, they don't require solving superposition in the sense I think you mean. I'm not sure the rest of my comment here will make sense without more context, but I'll try.

Specifically, these routes require decoding superposition, but not obviously more so than training neural networks requires decoding superposition. Overall, the hope would be something like 'SGD is already finding weights which decode the representations the NN is using, so we can learn a mechanistic "hypothesis" for how these weights work via SGD itself'. I think this exact argument doesn't quite pan out (see here for related discussion) and it's not clear what we need as far as mechanistic 'hypotheses' go.

Calling individual tokens the 'State' and a generated sequence the 'Trajectory' is wrong/misleading IMO.

I would instead call a sequence as a whole the 'State'. This follows the meaning from Dynamical systems.

Then, you could refer to a Trajectory which is a list of sequence each with one more token.

(That said, I'm not sure thinking about trajectories is useful in this context for various reasons)

In prior work I've done, I've found that activations have tails between and (typically closer to ). As such, they're probably better modeled as logistic distributions.

That said, different directions in the residual stream have quite different distributions. This depends considerably on how you select directions - I imagine random directions are more gaussian due to CLT. (Note that averaging together heavier tailed distributions takes a *very* long time to be become gaussian.)
But, if you look at (e.g.) the directions selected by neurons optimized for sparsity, I've commonly observed bimodal distributions, heavy skew, etc. My low confidence guess is that this is primarily because various facts about language have these properties and exhibiting this structure in the model is an efficient way to capture this.

This is a broadly similar point to @Fabien Roger.

I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).

One reason I feel sad about depending on incidental properties is that it likely implies the solution isn't robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:

- Directly trying to train against the ELK outputs (IMO quite important)
- The AI gaming the ELK solution by manipulating it's thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)

That's not to say non-robust ELK approaches aren't useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.

It also seems plausible that work like this could eventually lead to a robust solution.

Here are some dumb questions. Perhaps the answer to all of them is 'this work is preliminary, we'll address this + more later' or 'hey, see section 4 where we talked about this in detail' or 'your question doesn't make sense'

- In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there's no particular reason to think the feature scales are normalized very much (though maybe they're normalized a bit due to weight decay and similar). Is there some reason this is ok?
- Insofar as you're interested in taking features out of superposition, why not
*learn*a toy superimposed representation. E.g., learn a low rank autoencoder like in the toy models paper and then learn to extract features from this representation? I don't see a particular reason why you used a hand derived superposition representation (which seems less realistic to me?). - Beyond this, I imagine it would be nicer if you trained a model do computation in superposition and then tried to decode the representations the model uses - you should still be able to know what the 'real' features are (I think).

In the spirit of Evan's original post here's a (half baked) simple model:

Simplicity claims are claims about how many bits (in the human prior) it takes to *explain*^{[1]} some amount of performance in the NN prior.

E.g., suppose we train a model which gets 2 nats of loss with 100 Billion parameters and we can *explain* this model getting 2.5 nats using a 300 KB human understandable manual (we might run into issues with irreducible complexity such that making a useful manual is hard, but let's put that aside for now).

So, 'simplicity' of this sort is lower bounded by the relative parameter efficiency of neural networks in practice vs the human prior.

In practice, you do worse than this insofar as NNs express things which are anti-natural in the human prior (in terms of parameter efficiency).

We can also reason about how 'compressible' the explanation is in a naive prior (e.g., a formal framework for expressing explanations which doesn't utilize cleverer reasoning technology than NNs themselves). I don't quite mean compressible - presumably this ends up getting you insane stuff as compression usually does.

by

*explain*, I mean something like the idea of heuristic arguments from ARC. ↩︎

And yet, whenever we actually delve into these systems, it turns out that there's a ton of ultimately-relatively-simple internal structure.

I'm not sure exactly what you mean by "ton of ultimately-relatively-simple internal structure".

I'll suppose you mean "a high percentage of what models use parameters for is ultimately simple to humans" (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).

If so, this hasn't been my experience doing interp work or from the interp work I've seen (though it's hard to tell: perhaps there exists a short explaination that hasn't been found?). Beyond this, I don't think you can/should make a large update (in either direction) from Olah et al's prior work. The work should down-weight the probability of complete uninterpretablity or extreme easiness.

As such, I expect (and observe) that views about the tractability of humans understanding models come down largely to priors or evidence from other domains.

Even believing in a relatively strong version of the natural abstractions hypothesis doesn't (on its own) imply that we should be able to understand

allconcepts the AI uses. Just the ones which:These three properties seem reasonably likely in practice for some common stuff like 'trees' or 'dogs'.