# 11

Frontpage

This setup is originally from Paul Christiano. Thanks to Mark Xu for pointing me towards it and providing hints.

## Summary

I’ve been thinking about gradient hacking lately. The following setting explores what’s the best that a gradient hacking model can do when it's part of a mixture of other models (analogous to a set of subnetworks).

I show an optimal gradient hacker can dominate the mixture even when its loss is constrained to be higher than competing models. I found this surprising as I initially thought there might be an upper bound on optimal gradient hacking, especially with minimal assumptions (e.g. it's not assumed the model 'plays along' or something in training and gets 0 loss). I don't have much of an intuitive explanation other than that adversarially picked gradient hackers can be scary. I’ve detailed the worked-out example below.

## Two Model Setup

Suppose, we have a mixture consisting of a good model  which gets 0 loss in the limit (because it’s aligned with our training procedure) and a gradient hacker  which gets loss in the limit (because its actual objective is paperclips).

The mixture is roughly a linear combination of the two models, weighted by  for the good model and  for the bad model. Suppose we’re doing SGD on the vector , and the total loss for the mixture is .[1] Also suppose  can see  perfectly and its loss is a function of its relative weight in the mixture. I.e. , which I'll refer to as 's loss modulating function. Note, that  must map to values greater than

If  and  with , then  has a simple strategy for staying dominant. Specifically,

Since  gets 0 loss,  gets no gradient update. Since  ensures loss increases if  decreases,  also gets no gradient update.[2] This same strategy can be extended in the general case to preserve the initial weighting of .

But that’s certainly not the best  can do. If we start with some general , what loss strategy - i.e. what function  outputted by  - would lead to the highest relative weighting of  to ?

Since there’s only two weights, we can start by parametrizing on . Let  = 1 -  and  Referring to  as . Recall that  gets 0 loss, so we simplify to

Suppose SGD updates by . Therefore, . Since  is actually , we get . Intuitively, a ‘good’ function for  must involve  and  close to 0 for most of the [0,1] interval.

More formally, we can solve the differential inequality. To solve for an  such that  (i.e. for the gradient update on  to be negative), first consider the boundary condition:

Equivalently,

which yields a solution of the form  gives the lower bound on  when , so it must be . If , then  only preserves the initial weight of . In other words,  achieves the same preservation performance as the piecewise function mentioned earlier with the added benefit of being differentiable. A simple fix is to set  to be some small negative quantity which I'll refer to as  and modifying  appropriately. We now have

which satisfies . Therefore, the 'best' that  can gradient hack is increasing its weight up to and including dominating the entire mixture in the setup.

A few notes:

• This is sort of an upper-bound result with the takeaway that there is no upper-bound if  can output loss in the range
• If loss is bounded above for some reason (i.e.  has range , then there is a legitimate upper bound. Specifically, we'd have  which now only satisfies the differential inequality on the interval  can be arbitrarily small so for something like , this reduces to the model only being able to gradient hack if its initial weight (i.e. ) was greater than
• The toy model doesn't consider the potential incentive to modify the gradient hacker. I.e. the gradients in a richer setup may point toward changing .[3]
• Loss modulating functions can be a useful frame for evaluating 'how much' a model can gradient hack. This provides a behavioral view of gradient hacking which I see as complementing the architectural view. More generally, being aware of how a subnetwork's loss changes as its weight changes seems important.[4]

## Appendix

### Larger Mixtures

Consider the general case of a mixture with  models where  are aligned with the training procedure and get 0 loss in the limit. Each of the models  is weighted by some corresponding . After parametrizing to get  weights and enforcing the constraint that all weights must be between 0 and 1, we get the following loss for the mixture:

Without loss of generality, assume the gradient hacking model  is one of the initial  models.  has perfect access to the vector of weights  and outputs loss according to some function  can increase its weighting in the mixture by increasing its corresponding parameter . Supposing SGD works by the same procedure as before, we have

gets 0 loss by assumption leaving the following differential inequality:

Sparing the details, a function of the form

where  is any non-zero function of all weights except  and  is some positive number will work for all . The same bounded result also holds, so if 's range is capped at 1 then  can only gradient hack if  at initialization.

### Non-Zero Loss Models

Suppose there are some models in the mixture that are neither gradient hackers nor fully aligned with the training. Essentially the same result holds. Say their loss with respect to their weight is some positive constant. In the setup above, such models would only help  gradient hack. Without loss of generality, let  be one such model. The gradient update then is

where  and . Accordingly,  simply no longer needs the  term.

1. ^

By assumption, SGD is not operating at the level of modifying either  or . It's only operating at the level of altering the relative weights of the two models.

2. ^

If  was differentiable that is. A differentiable approximation would work, though.

3. ^

Though there appears to be some recursive element here. How  changes may also be susceptible to gradient hacking. Indeed, some precursor to  may have been 'benign', but was modified by SGD towards the current equilibrium

4. ^

Though note, there are other reasons besides gradient hacking for the loss modulating function to be non-constant. Modified architecture or different computational capacity at different weights, for example.

# 11

New Comment
1 comment, sorted by Click to highlight new comments since:

This is a cool result. If I'm understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn't have access to the weights? I think that probably requires an RL environment, but that's entirely based on "I thought about it for a few minutes and couldn't find a way to do it without RL" so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...