Lucius Bushnaq

AI notkilleveryoneism researcher at Apollo, focused on interpretability.

Wiki Contributions


Nice! We were originally planning to train sparse MLPs like this this week.

Do you have any plans of doing something similar for attention layers? Replacing them with wider attention layers with a sparsity penalty, on the hypothesis that they'd then become more monosemantic?

Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?

Thank you, I've been hoping someone would write this disclaimer post.

I'd add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them.

Just because the simple meaning of a direction doesn't jump out at an interp researcher when they look at a couple of activating dataset examples doesn't mean it doesn't have one. Humans probably wouldn't even always recognise the concepts other humans think in on sight.

Imagine a researcher who hasn't studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it's monitoring: 'It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.'

I also have this impression, except it seems to me that it's been like this for several months at least. 

The Open Philanthropy people I asked at EAG said they think the bottleneck is that they currently don't have enough qualified AI Safety grantmakers to hand out money fast enough. And right now, the bulk of almost everyone's funding seems to ultimately come from Open Philanthropy, directly or indirectly.

CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.

I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

Wouldn't you want to directly compare the divergence on outputs between the original graph  and ablated graph  instead? The  divergence between their output distributions over the data is the first thing that'd come to my mind. Or keeping whatever the original loss function is, but with the outputs of  as the new ground truth labels.

That's still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?

These manifolds generally extend out to infinity, so it isn't really meaningful to talk about literal "basin volume".[4]  We can focus instead on their dimensionality.

Once you take priors over the parameters into account, I would not expect this to continue holding. I'd guess that if you want to get the volume of regions in which the loss is close to the perfect loss, directions that are not flat are going to matter a lot. Whether a given non-flat direction is incredibly steep, or half the width given by the prior could make a huge difference.

I still think the information loss framework could make sense however. I'd guess that there should be a more general relation where the less information there is to distinguish different data points, the more e.g. principal directions in the Hessian of the loss function will tend to be broad. 

I'd also be interested in seeing what happens if you look at cases with non-zero/non-perfect loss. That should give you second order terms in the network output, but these again look to me like they'd tend to give you broader principal directions if you have less information exchange in the network. For example, a modular network might have low-dimensional off-diagonals, which you can show with the Schur complement is equivalent to having sparse off-diagonals, which I think would give you less extreme eigenvalues.

I know we've discussed these points before, but I thought I'd repeat them here where people can see them.

A very good point!

I agree that fix 1. seems bad, and doesn't capture what we care about.

At first glance, fix 2. seems more promising to me, but I'll need to think about it.

Thank you very much for pointing this out.