*[Epistemic status: brainstorming, less confused than last time since it seems to provide for a nice taxonomy of things to look into rather than just a bunch of random loose threads.]*

*This post would not exist without the **EleutherAI discord** and all the wonderful people there. Special thanks to Evan Hubinger, John Schulman, and Adam Shimi for providing feedback on drafts of this post.*

The goal of this post is to identify a different way of framing the gradient hacking problem, connect it to our intuitions, and use it to create a rough taxonomy of different types of potential gradient hacking. Hopefully, using this taxonomy, we can tackle more well-defined subproblems to narrow down the space of things that could potentially gradient hack.

When we think about gradient hacking, the most intuitive framing is to consider some kind of agent embedded inside a larger network (like a GPT) that somehow intentionally modifies the loss landscape of the larger network with respect to the base loss, and that this modification makes it so that in optimizing for the base objective, the base optimizer also happens to optimize the mesaobjective. Here I consider the base objective to be a function from the params of the network to the reals, that has all the training data baked in for simplicity, and the mesaobjective another function , possibly with some constraint that both objectives have to be indifferent between models which behave the same on all inputs. The "somehow" is often considered to be some kind of perturbing or otherwise making the output of the larger network worse whenever the mesaobjective isn't met, therefore creating an incentive for gradient descent to improve the mesaobjective. One example of this line of thinking can be found in my last post about gradient hacking. Unfortunately, I think there are some confusions with this framing.

## Full loss landscapes

First off, the base loss landscape of the entire model is a function that's the same across all training steps, and the configuration of the weights selects somewhere on this loss landscape. Configuring the weights differently can put the model on a different spot on this landscape, but it can't change the shape of the landscape itself.

Note that this *doesn't* contradict the interpretation of the gradient hacker as having control over the loss landscape through subjunctive dependence. As an analogy, in Newcomb's problem even if you accept that there is subjunctive dependence of the contents of the box on your decision and conclude you should one-box, it's still *true* that the contents of the box cannot change after Omega has set them up and that there is no *causal* dependence of the contents of the box on your action, even though the dominated action argument no longer holds because of the subjunctive dependence.

To emphasize this landscape covering the entire model, let's call the loss landscape of the base loss with respect to the entire model a *full loss landscape*. Any configuration of the network really just selects a point somewhere on this landscape, and to claim that any such point effectively does gradient hacking would be to argue that as gradient descent evolves this point over time, it manages to also get better at the mesaobjective in some sense. This may seem trivial, but I've noticed that unless you consider this explicitly, it's really easy to get it mixed up with cases where the landscape really does change. This suggests a new framing, where we define gradient hackers with respect to a particular mesaobjective as subsets of parameter space that tend to improve on the given mesaobjective over time, with different mesaobjectives defining different subsets (this is a bit tricky to formalize and I'm still working on it).

Throughout this post I'll be arguing that there are essentially three main behaviors that models containing gradient hackers can exhibit:

- either they converge to some local minimum of the base objective, or
- they don't (by taking advantage of some kind of inadequacy in the base optimizer) or
- the training loss isn't actually the same across all training steps in the first place (mostly RL).

Since I think this distinction is really important, I'm going to call these convergent gradient hackers, nonconvergent gradient hackers, and nonstationary gradient hackers respectively. In other words, no matter how it carves up the model, any proposal for gradient hacking must either argue that the specific local minimum of the base objective that's converged to is also advantageous for the mesaobjective in some way, or must identify some *specific* optimization inadequacy that's being used to keep the model away from local minima, or must argue that the training loss is nonstationary.

I want to emphasize that I'm * not* arguing that because the base loss is already fixed in place before we start training, and the gradient hacker lives inside the model which is instantiated after the base loss is frozen in place, therefore the gradient hacker cannot influence the base loss, therefore gradient hacking is not possible.

Rather, I'm trying to argue that if there do exist gradient hackers, the right way to think about them is that they must: be identified with a local minimum of the base loss (convergent), or have some mechanism for breaking convergence and not converging to a local minimum of the base loss (nonconvergent), or have some mechanism for making the base loss nonstationary. With some careful formalisms, these three cases should be general enough to cover every possible gradient hacker.

In the Newcomb analogy, what I'm doing is dividing the space of possible universes into ones where either the opaque box is empty or it isn't, and *not* making the subsequent "dominated action" argument.

## Partial loss landscapes

I think this framing is really different from the intuition I mentioned at the beginning. Thankfully, I think there's still a way to connect that intuition to this framing by looking at the problem from a different angle. It really does feel like there's some sense in which by configuring some of the weights we can affect some sort of loss landscape — the loss landscape of the base loss wrt all parameters *except* those that are part of the agent; let’s call those parameters the agent parameters. Let's call this a *partial loss landscape*, because it's only a slice of the entire loss landscape. One way to think about this is you're considering the gradient hacking subnetwork as part of the effective loss function *for the part of the network that it's trying to optimize*. This seems to be what we intuitively think about when we think of "shaping the loss landscape."

The partial loss landscape changes as you change around the agent parameters, and you can set things up so that the partial loss landscape lines up with the things your gradient hacker wants to do to the rest of the model as defined by the mesaobjective. One of my hopes when first looking into gradient hacking was essentially to handcraft a model with agent parameters that were both stable and shaped the partial loss landscape to match up with the mesaobjective. However, I quickly ran into problems with keeping the agent stable in a noncontrived architecture.

SGD initially pushes the model in the right direction, however, since the agent parameters are part of the model too, they can and will start drifting due to gradient descent as well. As a result, the partial loss landscape changes throughout training too. Intuitively, the agent parameters will want to change in the direction that makes the mesaobjective being enforced easier.

Initially I was hoping to find a clever way to do gradient protection that also works in noncontrived architectures, but I now think that these difficulties are a fundamental problem resulting from the constraint in the previous section, it's just that rearranging the problem in this way makes it harder to realize. Basically my point in this section is that I think my new framing resolves some of the confusion I had with my previous framing and explains why I was getting nowhere earlier.

## Categorizing failures of gradient hacking

Here is a look into each of the three categories I identified above. This is also sort of a future work section, since I have some vague ideas about what might be useful for each of these subcases.

### Convergent - Local (and even global) minima can be arbitrarily bad

It could be that being able to specify a local minimum is really powerful, so convergent gradient hackers are too. I don't know enough about ML theory to say much about the structure of local minima, but as long as local minima are abundant enough this is a serious concern. For instance, imagine a contrived toy example where there are two local minima: one where the model outputs optimally and achieves loss 0.0, and one where the model outputs optimally 99% of the time and does something extremely dangerous the other 1% of the time and achieves loss 0.1. If the model converges to the second one then it might be reasonable to argue that the model is gradient hacking. It might be that it's impossible for this to happen (there are some papers that suggest global convergence of certain overparameterized NNs), but I wouldn't bank on it.

Even if you assume convergence to a global min, you still might not be safe. As a contrived example, the MNIST classification loss is totally indifferent between a model that just solves MNIST perfectly and uses the rest of the parameters to do nothing, and one that checks to see if the solution to RSA-2048 is encoded into the input and computes and outputs a plan to turn the universe into paperclips if so, and if not solves MNIST perfectly. We *could* punt this problem by saying that the base objective is at fault for not caring and declare that this is actually an outer alignment failure since there's no optimization failure going on here. However, that doesn't actually make the problem go away in practical situations where we simply can't make the base objective care about every possible input. (This might also give a useful categorization of outer alignment failures: there are failures where the base objective cares about something differently from how we "really want", and failures where the base loss is indifferent where it shouldn't be and so the model fails at interpolation/extrapolation.) You could also add some kind of complexity penalty (the *simplest* model that achieves a global minimum must be non-deceptive - any deceptive model achieving a global minimum must do at least what the simplest non-deceptive model does, plus also whatever it’s doing deceptively), but that's also probably really difficult to make work (KC is uncomputable and not differentiable, and weight decay is completely broken by batchnorm and in general doesn't really measure complexity per se, etc).

My suspicion is that nothing too crazy happens in practice. Because local minima are rarer in higher dimensional optimization, and SGD in theory doesn't get stuck in saddle points (all these under a ton of simplifying assumptions, admittedly), and also in practice it seems to somehow just be biased towards really good generalizing minima, it seems like a model getting stuck in a *really* suboptimal local minimum is unlikely. I don't have a very rigorous argument along these lines yet, though. It also seems a bit far-fetched that a model converged to a global minimum will learn a really malicious subnetwork somehow, though I don't know how to justify this intuition without appealing to a handwavy argument about SGD having an inductive bias for less complex models, some kind of simplicity bias inherent in SGD, and also possibly something about these states not having very large attractor basins because the slice of parameter space that converges to a model that happens to have a malicious subnetwork while also achieving a global minimum is tiny (since it can't rely on the gradient to form those subnetworks).

Here's a sampling of a few convergence proofs that might be useful and the kinds of assumptions they make:

- convergence to critical points with probability 1 if the optimizer has a learning rate decay of for some , the gradient exists everywhere and is bounded, and both the gradient is Lipschitz-continuous (this in particular excludes ReLUs!), and the sublevels and gradient sublevels are bounded. (paper)
- convergence within any of critical points with probability 1 given a small enough constant learning rate, but the model is allowed to be nonconvex, nonsmooth, and only required to be locally Lipschitz. (paper)
- convergence to global minimum with
*massively*overparameterized ReLU networks using cross entropy, Gaussian init, data meeting certain separation criteria. (paper)

These all make rather spherical-cow-like assumptions to various degrees, but I think they are useful as a starting point.

### Nonconvergent - Some ways to not converge to a local min

We can divide the kinds of reasons that the model might *not* converge into several categories, loosely using the assumptions of convergence proofs as a guide (this probably should be formalized at some point):

- Abuse of floating point rounding and other numerics nonsense - this could be abused to make the computed gradient not match up with the actual gradient, which lets you totally break convergence
- Model architecture + loss function results in a loss landscape that's not sufficiently nice (i.e not (locally) Lipschitz, gradient not Lipschitz continuous, etc depending on which convergence proof)
- Learning rate schedule doesn't allow the model to converge (but rather oscillate around a min) - I don't expect this one to actually matter in practice since the areas in parameter space near a non-malicious minimum are probably very unlikely to be malicious, but it complicates proofs.
- Lots of bad critical points (i.e saddle points). I mentioned before that this seems unlikely in practice given the evidence we have, but you could probably construct architectures by hand that do this (i.e using activation functions with lots of distinct saturated regions, for example). My intuition is that this is a non-problem in most reasonable architectures but it complicates proofs. Contrived, unreasonable architectures probably break this assumption the hardest.

I think a bunch of these are things capabilities researchers are actively working very hard on, so it would be low leverage for us to focus on them. I think there's a good chance that ML systems in practice converge somewhat often because the ones that don't converge suck, and capabilities researchers are constantly hard at work trying to find models that don't suck.

One other problem is that for really large models, we often don't train to convergence, since this isn't the most compute efficient option. So you could have a model that in theory would eventually converge, but spends long enough in gradient hacked regions that when we stop it, it hasn't left one of them yet. Some of the arguments about randomly initialized models being biased towards simple functions might be relevant for arguing that the model will never get into a gradient hacking region to begin with. Maybe the solution is just to give up compute efficiency and train models all the way (though this hurts competitiveness quite a bit). Or, maybe models that would eventually converge are safe throughout most of training, or perhaps we'll figure out how to converge compute-efficiently in the future (maybe something like superconvergence?) which totally obsoletes compute-efficient early stopping anyways.

It could also be that there are regions of pathologically slow convergence that are technically in the convergence basin of some local minimum but are sufficiently far from that minimum so as to have completely different (specifically, more malicious) behavior. This would definitely pose a problem for early stopping, and possibly even fool us into thinking that a region is near a local minimum when it isn't actually. My hope here would be for there to be some way to detect when this is happening, or some kind of proof that bounds how slowly a model can actually converge or how much a model can change from a pathologically slow region to the minimum that it eventually converges to.

### Nonstationary - Reinforcement Learning, Continual Learning, and other non-stationary training

In RL, there's definitely a way for the model to actually influence the base objective, since the base objective is non-stationary and depends in large part on the model's actions. Your training data (and thus base objective) depends on the trajectories resulting from interactions between your previous policy and the environment. Similarly, in CL, if the model exerts any influence on the data source it pulls from, it will be able to affect its future training data.

Unfortunately there are some really weird cases here. If you're learning from human feedback or otherwise influencing the world to get humans to do things that then feed back into your training set, you create tons of feedback loops with a human in the loop that are really hard to reason about. I'm pretty confused about which of the things in this category should even count as gradient hacking; I haven't thought extensively about this case so everything in this section is low confidence. The good news is that I think even if gradient hacking is totally plausible in RL/CL, being able to show that simple predictive models like LMs trained on stationary distributions can't gradient hack under certain reasonable conditions is a huge deal.

For concreteness, one super simple example of a nonstationary gradient hacker with CL would be models that generate funny outputs that humans like and tend to repost elsewhere on the internet, which then gets scraped and fed into the model again, leading to a model that produces more of those outputs, etc in a feedback loop. This particular case seems fairly benign but it's at least not *a priori* obvious that there's no malicious version of this involving much more complex interactions with the world.

One related possibility is that training models on data containing previous models' outputs (or data influenced by the previous model's outputs) does something similar to RL/CL. This way, although each model can't affect its own training data, it can affect the training data of the next model. For example, GPT-n outputs and people talking about GPT-n outputs and so on are present on the internet and therefore being scooped up into the GPT-(n+1) training set. This is essentially a really slow version of CL, and this doesn't allow a model to hack itself but rather only future models so it doesn't technically count as gradient hacking, but I think it's still potentially useful to think about, if only to help develop intuitions for the RL/CL cases.

I don't think it's possible to entirely eliminate this effect, since your only options would be to restrict yourself to pre-LM data (which limits how much data you have and will get further and further out of distribution as the world changes) or to try and remove any trace of LMs from the data (which seems basically impossible, especially if LMs are ubiquitous in the future, since we have to not just remove LM outputs but also anything downstream of LM outputs; relevant movie). However, I also don't think this is likely to be a serious issue. I think the main argument against simple predictive models like LMs doing nonstationary gradient hacking this way being a serious concern looks something like a) complex interactions get washed out by chaos in the real world and b) there's a very tight limit to what kinds of influence the model can exert on its future self using only simple interactions (i.e the funny output example might be as far as it gets), c) if it relies on humans amplifying something, then that feedback loop could and probably is happening without LMs anyways (i.e people share funny things whether or not LMs exist), and d) RL is already really slow and inefficient already and this process would be like taking an RL step once a year.

## Implications

The framing in this post helps us subdivide the problem space of gradient hacking into several quite distinct mechanisms. Crucially, this lets us assume a lot of stuff in each case: for convergent gradient hackers we only need to consider local minima and for nonconvergent gradient hackers we only need to consider ways to break convergence. (For nonstationary I don't really know what to do yet, but regular supervised training avoids this class of failure cases.) An example of the type of argument I expect is if we could show that no model for a given architecture and objective exhibiting gradient hacker like behaviour can protect itself from gradients, we could rule out the possibility of convergent gradient hackers entirely for that particular setting.

Hopefully, we can formalize these failure cases (and the overarching argument) and rigorously show conditions under which each of these is provably impossible (or, failing that, metrics that we can compute throughout training to tell us when these things are happening). Then, we can design our model architectures and training processes to satisfy these conditions as much as possible. That being said, there are a *lot* of nuances so I don't expect this to be easy at all.

If we can avoid gradient hacking, this significantly limits how powerful deceptively aligned mesaoptimizers can get. Without gradient hacking, I don't think I see any plausible mechanisms for a mesaoptimizer covering up its own tracks from our interpretability tools, and it also severely limits the things it can do to the rest of the model; in particular, I think it's basically impossible for a deceptively aligned mesaoptimizer to take control of the entire model and make the base objective near-irrelevant, or anything nearly that drastic.

(Moderation note: added to the Alignment Forum from LessWrong.)