Robert Kirk

PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.

Wiki Contributions


Question: How do we train an agent which makes lots of diamonds, without also being able to robustly grade expected-diamond-production for every plan the agent might consider?

I thought you were about to answer this question in the ensuing text, but it didn't feel like to me you gave an answer. You described the goal (values-child), but not how the mother would produce values-child rather than produce evaluation-child. How do you do this?

You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit

I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can't remember if they investigate power-law feature frequencies or just uniform frequencies

were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.

My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it's something I'm considering trying to do as well, hence the code requests).

I think (at least in our case) it might be simpler to get at this question, and I think the first thing I'd do to understand connectivity is ask "how much regularization do I need to move from one basin to the other?" So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?

That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn't yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it's less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it's a good sign.

Good question! We haven't tried that precise experiment, but have tried something quite similar. Specifically, we've got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.

I'm not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?

That's cool, looking forward to seeing more detail. I think these results don't seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven't seen it investigated in models that aren't overparametrised in a classical sense.

I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn't penalise monosemanticity (or polysemanticity) in this toy model, and also (much more speculatively) that the sparse well-performing subnetworks that IMP finds in other networks possibly also maintain their levels of poly/mono-semanticity. If we also think these networks are favoured towards poly or mono, then that hints at how the overall learning process if favoured towards poly or mono.

This work looks super interesting, definitely keen to see more! Will you open-source your code for running the experiments and producing plots? I'd definitely be keen to play around with it. (They already did here: I just missed it. Thanks! Although it would be useful to have the plotting code as well, if that's easy to share?)

Note that we primarily study the regime where there are more features than embedding dimensions (i.e. the sparse feature layer is wider than the input) but where features are sufficiently sparse that the number of features present in any given sample is smaller than the embedding dimension. We think this is likely the relevant limit for e.g. language models, where there are a vast array of possible features but few are present in any given sample.

I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I'm uncertain whether the other part of the regime (that you don't mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers in transformers (analogously k) are much lower dimension than the "true intrinsic dimension of features in natural language" (analogously N), even if it is larger than the input dimension (embedding dimension* num_tokens, analogously d). So I expect  whereas in your regime . Do you think you'd be able to find monosemantic networks for ? Did you try out this regime at all (I don't think I could find it in the paper).

In the paper you say that you weakly believe that monosemantic and polysemantic network parametrisations are likely in different loss basins, given they're implementing very different algorithms. I think (given the size of your networks) it should be easy to test for at least linear mode connectivity with something like git re-basin ( Have you tried doing that? I think there are also algorithms for finding non-linear (e.g. quadratic) mode connectivity, although I'm less familiar with them. If it is the case that they're in different basins, I'd be curious to see whether there are just two basins (poly vs mono), or a basin for each level of monosemanticity, or if even within a level of polysemanticity there are multiple basins. If it's one of the former cases, it's be interesting to do something like the connectivity-based fine-tuning talked about here (, in effect optimise for a new parametrisation that is linearly disconnected from the previous one), and see if doing that from a polysemantic initialisation can produce a more monosemantic one, or if it just becomes polysemantic in a different way.

You also mentioned your initial attempts at sparsity through a hard-coded initially sparse matrix failed; I'd be very curious to see whether a lottery ticket-style iterative magnitude pruning was able to produce sparse matrices from the high-latent-dimension monosemantic networks that are still monosemantic, or more broadly how the LTH interacts with polysemanticity - are lottery tickets less polysemantic, or more, or do they not really change the monosemanticity?

If my understanding of the bias decay method is correct, is a large initial part of training only reducing the bias (through weight decay) until certain neurons start firing? If that's the case, could you calculate the maximum output in the latent dimension on the dataset at the start of training (say B), and then initialise the bias to be just below -B, so that you skip almost all of the portion of training that's only moving the bias term. You could do this per-neuron or just maxing over neurons. Or is this portion of training relatively small compared to the rest of training, and the slower convergence is more due to less neurons getting gradients even when some of them are outputting higher than the bias?

Thanks for writing the post, and it's great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.

Some comments and questions:

  • I think "science of deep learning" would be a better term than "deep learning theory" for what you're describing, given that I think all the phenomena you list aren't yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that's a different genre of work to the science of DL work.
  • In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.:  if lottery tickets do exist and are of equal performance, then it's possible they'd be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
  • When you say "Automating Mechanistic Interpretability research", do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
    Most of the text in that section implies automating (1) to me, but "Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety" seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.

I've now had a conversation with Evan where he's explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it's likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you'd need to fully encode the training objective.

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

The main argument of the post isn't "ASI/AGI may be causally confused, what are the consequences of that" but rather "Scaling up static pretraining may result in causally confused models, which hence probably wouldn't be considered ASI/AGI". I think in practice if we get AGI/ASI, then almost by definition I'd think it's not causally confused.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity

In a theoretical sense this may be true (I'm not really familiar with the argument), but in practice OOD misgeneralisation is probably a spectrum, and models can be more or less causally confused about how the world works. We're arguing here that static training, even when scaled up, plausibly doesn't lead to a model that isn't causally confused about a lot of how the world works.

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

No reason, I'll edit the post to use goal misgeneralisation. Goal misgeneralisation is the standard term but hasn't been so for very long (see e.g. this tweet:

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks

Given that the model is trained statically, while it could hypothesise about additional variables of the kinds your listed, it can never know which variables or which values for those variables are correct without domain labels or interventional data. Specifically while "Discovering such hidden confounders doesn't give interventional capacity" is true, to discover these confounders he needed interventional capacity.

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

We're not saying that P(shorts, icecream) is good for decision making, but P(shorts, do(icecream)) is useful in sofar as the goal is to make someone where shorts, and providing icecream is one of the possible actions (as the causal model will demonstrate that providing icecream isn't useful for making someone where shorts).

What do these symbols in parens before the claims mean?

They are meant to be referring to the previous parts of the argument, but I've just realised that this hasn't worked as the labels aren't correct. I'll fix that.

When you talk about whether we're in a high or low path-dependence "world", do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it's more likely that some training processes are highly path-dependent and some aren't. We definitely have evidence that some are path-dependent, e.g. Ethan's comment and other examples like, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don't think we have conclusive of any particular existing training process being low-path dependence, because the burden of proof is heavy for proving that two models are basically equivalent on basically all inputs (given that they're very unlikely to literally have identical weights, so the equivalence would have to be at a high level of abstraction).

Reasoning about the path dependence of a training process specifically, rather than whether all of the ML/AGI development world is path dependent, seems more precise, and also allows us to reason about whether we want a high or low path-dependence training process, and considering that as an intervention, rather than a state of the world we can't change.

When you say "the knowledge of what our goals are should be present in all models", by "knowledge of what our goals are" do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:

The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded

I guess I don't understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn't have a hard-coded pointer to what our goals are. I'd imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.

(I realise I've been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don't think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)

To try and run my (probably inaccurate) simulation of you: I imagine you don't think that's a contradiction above. So you'd think that "knowledge of what our goals are" doesn't mean a pointer to our goals in all the AI's world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it's optimisation process), but wouldn't enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it's the world model, and probably using this simpler information about our goals in some way). I'm struggling to imagine what that would look like.

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal "do 'what the training process incentives'", and the alignment model has the long-term goal "do 'what the training process incentivises'" (as a pointer/de dicto), and also rederives it with the same level of complexity. I think "do 'what the training process incetivises'" (as a pointer/de dicto) isn't a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI's internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we're assuming they do.

(ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I'm less sure it'll emerge (or less sure that your analysis demonstrates that). I'm trying to understand where we disagree, and whether you've considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn't going to happen.)

It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)

Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model that doesn't need to do such a construction (becauase it has it hard-coded as you say). I guess I'm arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

whereas the corrigible model needs it to be hardcoded

The corrigible model needs to be able to robustly point to our goals, in a way that doesn't change. One way of doing this is having the goals hardcoded. Another way might be to instead have a pointer to the output of a procedure that is executed at runtime that always constructs our goals in the activations. If the deceptive model can reliably construct in it's activations something that actually points towards our goals, then the corrible model could also have such a procedure, and make it's goal be a pointer to the output of such a procedure. Then the only difference in model complexity is that the deceptive model points to some arbitrary attribute of the world model (or whatever), and the aligned model points to the output of this computation, that both models posses.

I think at a high level I'm trying to say that any way in which the deceptive model can robustly point at our goals such that it can pursue them instrumentally, the aligned model can robustly point at them to pursue them terminally. SGD+DL+whatever may favour one way of another of robustly pointing at such goals (either in the weights, or through a procedure that robustly outputs them in the activations), but both deceptive and aligned models could make use of that.

Load More