This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example.
I wouldn't say that this work it attempting to "exhibit gradient hacking". (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.
There are a few ways to protect a subnetwork from being modified by gradient descent that I can think of (non-exhaustive list):
Another way of "protecting" a piece of logic in the network from changes (if we ignore regularization) is by redundancy: Suppose there are two different pieces of logic in the network such that each independently makes the model output what the mesa optimizer wants. Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output, and thus the gradient components of all those weights should be close to zero.
The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights () completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.
I don't think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function that takes the outputs from two redundant copies of and outputs some kind of combined output. (If you're allowed functions that don't meet the "continuous everywhere, differentiable countably-almost everywhere" requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to and when they are equal, then it must be that at all points where , . There should also exist at least some where , since otherwise no longer depends on the pair of redundant networks at all which means that those networks can't actually affect what the network does which defeats the whole point of this in the first place.
Let us then define . Then, for all . This implies that is a constant function. Therefore, there do not exist where . This is a contradiction, and therefore cannot exist.
To make sure I understand your notation, is some set of weights, right? If it's a set of multiple weights I don't know what you mean when you write .
There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all
(I don't yet understand the purpose of this claim, but it seems to me wrong. If for every , why is it true that does not depend on and when ?)
Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output.
You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.
But gradient descent doesn’t modify a neural network one weight at a time
Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.
[Epistemic status: brainstorming, still somewhat confused, with a bunch of loose threads left.]
This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example. The discussion was sparked by this post. We didn't end up coming up with any good examples (or proofs of nonexistence) but hopefully this post is helpful for anyone else trying to construct gradient hacking examples.
Note that because our goal is to construct a concrete example of gradient hacking, when I write about "what we want'' and "unfortunate" roadblocks, those are from the perspective of a mesaoptimizer (or a researcher trying to construct an example of a mesaoptimizer to study), not from the perspective of a researcher attempting to build aligned AI.
Thanks to Adam Shimi, AI_WAIFU, Charles Foster, Evan Hubinger, Kyle and Laria, Stella Biderman, and alstroemeria313 for their contributions to the discussion.
Formulation
We mostly considered the following specific formulation since it seemed like a reasonably general framework.
There is some base model being trained, which contains an honest subnetwork h that attempts to solve for the base objective, a mesaobjective embedded somewhere in the model’s weights, and a malicious subnetwork (or mesaoptimizer, though I prefer the former term because it makes it clear that we're talking about the subnetwork in particular) that attempts to use the gradient to optimize itself for the mesaobjective.
In particular, we assume the mesaobjective looks a lot like a supervised learning objective, since it seems to be intuitively the most useful kind of mesaobjective. In other words, the objective looks like minimizing some loss over a dataset. To solve the problem of getting this training data to the malicious subnetwork, the mesaobjective is implicitly embedded in the form of a mesa loss function L and a pair of functions f and g that take training data given to the model from the base optimization process and creates training pairs to give to the malicious model. To be more concrete, the mesaloss is L(m(f(x)),g(x)) for any training data x. This loss is then used in an enforcement mechanism that perturbs the output from the honest model in some way before outputting it.
While it may initially seem like requiring the malicious model’s training data to be derived from the input data is a big constraint, I don’t expect it to be a big problem in, say, big LMs where the training data is the internet. Also, since both and are derived from x within the model, it might seem like there is a chicken and egg problem where the model can only learn to fit things it can already create data for, but this is actually not a major problem. For example, suppose the base model classifies arXiv papers. Then f could pick out the body of the paper and g the abstract, and m would learn to summarize, which is highly nontrivial.
Here’s a diagram illustrating this formulation:
(Trying to) build an example of gradient hacking
In order for gradient hacking to happen, it must be possible to prevent gradient updates to f and g and L, and to the enforcement mechanism that perturbs the output of the honest subnetwork. If we can’t do this, then the mesaobjective will drift to become easier (and eventually become a constant function) and the enforcement mechanism will be disabled entirely, causing the mesaoptimizer to become completely removed. We want it so that the only way to reduce the perturbation is to make m better at the mesaobjective. Also, while the enforcement mechanism and L cannot be updated, they must still allow gradients to flow through themselves, since otherwise m would receive no gradients.
There are a few ways to protect a subnetwork f from being modified by gradient descent that I can think of (non-exhaustive list):
Stop gradient related tricks don't work for L and the enforcement mechanism because that would prevent the gradient from getting to m. Also, it seems pretty unlikely that the output of these is too useful for anything else in the model so it would probably be hard to use something else depending on it to pin it in place. This means the best solution is probably to exploit some hard coded structure in the architecture. While some current architectures come close to being useful (i.e an attention layer with a query as the malicious subnetwork prediction, key as the mesa target g(x), and value as the output from the honest subnetwork h(x) would work if keys and queries were first scaled to unit length, so that cosine similarly instead of dot product were used in the attention layer), it's certainly not reliable across architectures.
Assuming we could fix that problem, the other problem is we need f and g to remain stable. We can use a stop gradient here. The most obvious way to do so is to have a staircase function with small steps so it essentially passes the input through with zero gradient. Unfortunately, it doesn't seem obvious that a staircase can be easily learned and will remain stable. We could also make h depend on f and g, but that places constraints on what f and g can be and also potentially opens f and g up to being moved around by h to be more useful for it. One other way to get a stop gradient is to use one that's already part of the training process. In particular, one interesting example is RL training of LMs, where the model outputs are discretized and fed back into the model. Finally, it's possible we could exploit floating point numerical stability and rounding. Perhaps a model can abuse rounding the gradient to zero while still passing a few bits of information through, or increase the scale of the gradient elsewhere in the model to force the loss scale down and drive gradients to zero.
Things to look into from here