Lucius Bushnaq — AI Alignment Forum

AI ALIGNMENT FORUM
AF

In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.

Humans addicted to drugs often exhibit weird meta-preferences like 'I want to stop wanting the drug', or 'I want to find an even better kind of drug'.

For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.

'Seek to directly inhabit the cognitive state caused by the button press', 'along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ', 'make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed' and just generally 'maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions' all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.

Anti-Slop Interventions?

Lucius Bushnaq1y913

If you de-slopify the models, how do you avoid people then using them to accelerate capabilities research just as much as safety research? Why wouldn't that leave us with the same gap in progress between the two we have right now, or even a worse gap? Except that everything would be moving to the finish line even faster, so Earth would have even less time to react.

Is the idea that it wouldn't help safety go differentially faster at all, but rather just that it may preempt people latching on to false slop-solutions for alignment as an additional source of confidence that racing ahead is fine? If that is the main payoff you envision, I don't think it'd be worth the downside of everything happening even faster. I think time is very precious, and sources of confidence already abound for those who go looking for them.

Transcoders enable fine-grained interpretable circuit analysis for language models

Lucius Bushnaq2y40

Nice! We were originally planning to train sparse MLPs like this this week.

Do you have any plans of doing something similar for attention layers? Replacing them with wider attention layers with a sparsity penalty, on the hypothesis that they'd then become more monosemantic?

Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?

Superposition is not "just" neuron polysemanticity

Lucius Bushnaq2y58

Thank you, I've been hoping someone would write this disclaimer post.

I'd add on another possible explanation for polysemanticity, which is that the model might be thinking in a limited number of linearly represented concepts, but those concepts need not match onto concepts humans are already familiar with. At least not all of them.

Just because the simple meaning of a direction doesn't jump out at an interp researcher when they look at a couple of activating dataset examples doesn't mean it doesn't have one. Humans probably wouldn't even always recognise the concepts other humans think in on sight.

Imagine a researcher who hasn't studied thermodynamics much looking at a direction in a model that tracks the estimated entropy of a thermodynamic system it's monitoring: 'It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.'

Alignment Grantmaking is Funding-Limited Right Now

Lucius Bushnaq3y821

I also have this impression, except it seems to me that it's been like this for several months at least.

The Open Philanthropy people I asked at EAG said they think the bottleneck is that they currently don't have enough qualified AI Safety grantmakers to hand out money fast enough. And right now, the bulk of almost everyone's funding seems to ultimately come from Open Philanthropy, directly or indirectly.

Practical Pitfalls of Causal Scrubbing

Lucius Bushnaq3y00

CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.

I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

Wouldn't you want to directly compare the divergence on outputs between the original graph and ablated graph $I$ instead? The $D_{K L}$ divergence between their output distributions over the data is the first thing that'd come to my mind. Or keeping whatever the original loss function is, but with the outputs of $G$ as the new ground truth labels.

That's still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?

Information Loss --> Basin flatness

Lucius Bushnaq4y00

These manifolds generally extend out to infinity, so it isn't really meaningful to talk about literal "basin volume".^[4] We can focus instead on their dimensionality.

Once you take priors over the parameters into account, I would not expect this to continue holding. I'd guess that if you want to get the volume of regions in which the loss is close to the perfect loss, directions that are not flat are going to matter a lot. Whether a given non-flat direction is incredibly steep, or half the width given by the prior could make a huge difference.

I still think the information loss framework could make sense however. I'd guess that there should be a more general relation where the less information there is to distinguish different data points, the more e.g. principal directions in the Hessian of the loss function will tend to be broad.

I'd also be interested in seeing what happens if you look at cases with non-zero/non-perfect loss. That should give you second order terms in the network output, but these again look to me like they'd tend to give you broader principal directions if you have less information exchange in the network. For example, a modular network might have low-dimensional off-diagonals, which you can show with the Schur complement is equivalent to having sparse off-diagonals, which I think would give you less extreme eigenvalues.

I know we've discussed these points before, but I thought I'd repeat them here where people can see them.

Project Intro: Selection Theorems for Modularity

Lucius Bushnaq4y20

A very good point!

I agree that fix 1. seems bad, and doesn't capture what we care about.

At first glance, fix 2. seems more promising to me, but I'll need to think about it.

Thank you very much for pointing this out.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments