*This setup is originally from Paul Christiano. Thanks to Mark Xu for pointing me towards it and providing hints. *

## Summary

I’ve been thinking about __gradient hacking__ lately. The following setting explores what’s the *best *that a gradient hacking model can do when it's part of a mixture of other models (analogous to a set of subnetworks).

I show an optimal gradient hacker can dominate the mixture *even when its loss is constrained to be higher than competing models*. I found this surprising as I initially thought there might be an upper bound on optimal gradient hacking, especially with minimal assumptions (e.g. it's not assumed the model 'plays along' or something in training and gets 0 loss). I don't have much of an intuitive explanation other than that adversarially picked gradient hackers can be scary. I’ve detailed the worked-out example below.

## Two Model Setup

Suppose, we have a mixture consisting of a good model which gets 0 loss in the limit (because it’s aligned with our training procedure) and a gradient hacker which gets loss in the limit (because its actual objective is paperclips).

The mixture is roughly a linear combination of the two models, weighted by for the good model and for the bad model. Suppose we’re doing SGD on the vector , and the total loss for the mixture is .^{[1]} Also suppose can see perfectly and its loss is a function of its relative weight in the mixture. I.e. , which I'll refer to as 's loss modulating function. Note, that must map to values greater than .

If and with , then has a simple strategy for staying dominant. Specifically,

Since gets 0 loss, gets no gradient update. Since ensures loss increases if decreases, also gets no gradient update.^{[2]} This same strategy can be extended in the general case to preserve the initial weighting of .

But that’s certainly not the best can do. If we start with some general , what loss strategy - i.e. what function outputted by - would lead to the highest relative weighting of to ?

Since there’s only two weights, we can start by parametrizing on . Let = 1 - and Referring to as , . Recall that gets 0 loss, so we simplify to .

Suppose SGD updates by . Therefore, . Since is actually , we get . Intuitively, a ‘good’ function for must involve and close to 0 for most of the [0,1] interval.

More formally, we can solve the differential inequality. To solve for an such that (i.e. for the gradient update on to be negative), first consider the boundary condition:

Equivalently,

which yields a solution of the form . gives the lower bound on when , so it must be . If , then only preserves the initial weight of . In other words, achieves the same preservation performance as the piecewise function mentioned earlier with the added benefit of being differentiable. A simple fix is to set to be some small negative quantity which I'll refer to as and modifying appropriately. We now have

which satisfies **. Therefore, the 'best' that **** can gradient hack is increasing its weight up to and including dominating the entire mixture in the setup.**

A few notes:

- This is sort of an upper-bound result with the takeaway that there is no upper-bound if can output loss in the range .
- If loss is bounded above for some reason (i.e. has range , then there is a legitimate upper bound. Specifically, we'd have which now only satisfies the differential inequality on the interval . can be arbitrarily small so for something like , this reduces to the model only being able to gradient hack
**if its initial weight (i.e.****) was greater than****.** **The toy model doesn't consider the potential incentive to modify the gradient hacker.**I.e. the gradients in a richer setup may point toward changing .^{[3]}- Loss modulating functions can be a useful frame for evaluating 'how much' a model can gradient hack. This provides a behavioral view of gradient hacking which I see as complementing the architectural view. More generally, being aware of how a subnetwork's loss changes as its weight changes seems important.
^{[4]}

## Appendix

### Larger Mixtures

Consider the general case of a mixture with models where are aligned with the training procedure and get 0 loss in the limit. Each of the models is weighted by some corresponding . After parametrizing to get weights and enforcing the constraint that all weights must be between 0 and 1, we get the following loss for the mixture:

Without loss of generality, assume the gradient hacking model is one of the initial models. has perfect access to the vector of weights and outputs loss according to some function . can increase its weighting in the mixture by increasing its corresponding parameter . Supposing SGD works by the same procedure as before, we have

gets 0 loss by assumption leaving the following differential inequality:

Sparing the details, a function of the form

where is any non-zero function of all weights except and is some positive number will work for all . The same bounded result also holds, so if 's range is capped at 1 then can only gradient hack if at initialization.

### Non-Zero Loss Models

Suppose there are some models in the mixture that are neither gradient hackers nor fully aligned with the training. Essentially the same result holds. Say their loss with respect to their weight is some positive constant. In the setup above, such models would only help gradient hack. Without loss of generality, let be one such model. The gradient update then is

where and . Accordingly, simply no longer needs the term.

^{^}By assumption, SGD is

*not*operating at the level of modifying either or . It's only operating at the level of altering the relative weights of the two models.^{^}If was differentiable that is. A differentiable approximation would work, though.

^{^}Though there appears to be some recursive element here. How changes may also be susceptible to gradient hacking. Indeed, some precursor to may have been 'benign', but was modified by SGD towards the current equilibrium .

^{^}Though note, there are other reasons besides gradient hacking for the loss modulating function to be non-constant. Modified architecture or different computational capacity at different weights, for example.

This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...