Gradient hacking is a fascinating type of deception, because of its anchoring in current ML. Even if it applies to more general settings, gradient hacking is considered foremost in the context of gradient descent, the most common training method at the moment.
Despite this, thinking about gradient hacking can lead one to very abstract and confusing questions, like how a deterministic neural net could decide to change its thinking for such reasons.
I propose instead in this post to discuss concrete ways a model could gradient hack, and the corresponding difficulties.
Thanks to Evan Hubinger for coining the term and discussing it many times with me, Kyle and Laria for discussions around gradient hacking and one proposal, Richard Ngo for pointing out the need for continuous failures, and Connor Leahy for feedback on my ideas.
The most obvious way to gradient hack is to compute one's gradient on a given batch for a given loss, and then compute what is needed to make the gradient go in the wanted direction.
Most of these sound doable to me for a neural net training on language or some intricate real-world task, especially since competence is required. But there is one massive issue: even if one could compute the expected loss, computing the gradient from there requires computing the gradient of the part of the network computing the gradient! In essence, the gradient is a transformation on the computation, but computing it makes the computation longer, and thus requires a recomputation of the gradient!
The problem is even more severe, in that the part computing the gradient is really important (and so might be where a lot of optimization pressure is applied) and controlling it is required to do more than a couple of steps of gradient hacking before losing the ability to.
Another approach than explicitly computing the gradient and searching for the right output comes from the initial post: checking whether we moved in the right direction, then give a better answer with a smaller loss, and fail otherwise (by giving an answer for with a high loss). It's worth being even more explicit here: the network is such that if changed in all but the way it "wants", then it will fail. If that is the case, then the local landscape pushes the gradient in the direction wanted.
Here are the requirements:
Without going into the subtleties of the last point (proposed by Richard Ngo in personal communication) for the moment, this strategy suffers from a similar problem than the search one: computing the difference (and retrieving the previous weights) must be done somewhere in the network, which will have trouble checking itself.
Note though that here it's not an issue of a computation having to compute a property of itself, more the inability of a network to extract its own complete weights while computing the difference with a previous image (just outputting the weights is actually possible)
All in all, I'm convinced this form of gradient hacking cannot be done, and so I won't spend any more time thinking about it.
Taking the previous schemes and only requiring an approximate control over the gradient makes it slightly more doable in my mind.
The biggest crux for all of these schemes is knowing where to go, or at least the general effect of some change. At the moment both are quite out of reach of humans AFAIK, but a competent enough neural net might go beyond us on this regard. Still, this requires superhuman competence, whereas most other requirements (knowing the loss function, the training process, the examples, computing the gradients) are human level or simpler.
Let's focus on the most promising scheme here: failing when changed in the wrong direction (and improving when changed in the approximate right direction).
I already mentioned that Richard Ngo pointed to Evan and I how implementing such a strategy requires continuously failing and/or improving with the distance from the starting point. The reason is that the gradient only depends on the local neighborhood, so if the failure/improvement only happens after some threshold, the infinitesimal neighborhood will stay unchanged, just like the gradient.
Hence the network must not only notice some difference, but do it continuously, increasing the failure/improvement with the distance from the initial point.
Even if we grant the network this ability, I see another big hurdle to this form of gradient hacking: the need for memory. Recall that to compute the direction, we need a starting point. Assuming it is hardcoded or rederived somehow, we still are left with the problem that after one step, the starting point for the next step will be different. And if this difference isn't corrected, multiple problems ensue:
How can this be solved? Intuitively, the step needs to update the network's "memory", so that the starting point changes with it. Difficulty to say how hard this is, but my intuition is that it limits dramatically the usable gradient directions.
This is the original scheme Evan presented in his post: instead of approximately controlling the gradient, the network only protects some part of the network (by making the gradient 0 for it). It sounds significantly easier, especially if the part to protect is just a small subset of the network.
The proposed scheme works just like previously explored: failing (continuously) when the protected part is changed, and improving (continuously) when it isn't.
Here the main difficulty I see relates to the allowed changes. Sure, the scheme will push towards gradients that don't change the protected part, but it will also push towards ones which change the gradients in the same way they change the means to check the difference. If the network has a copy for comparison, then changing both versions at the same time will also be incentivized. If the network instead has a routine to compute the wanted protected part, same story.
I don't know which direction will be actually pursued by gradient descent, but it's not clear at all that it must be one which doesn't touch to the protected part.
Lastly, here is a scheme from Kyle and Laria at EleutherAI: the network knows the expected answer, has some idea of some dimensions that are incentivized, and thus decides
Let's look at an example with humans to see the idea.
Imagine you're being tested by answering questions, and you know that your brain will be altered in a way reminiscent of gradient hacking. You're asked "how do you call a 1L glass filled with 1/2L of water?" You know the sicko doing this training wants to make you more optimistic (for your own good, he says), and so the correct answer is "a glass half-full". Now, do you want to become more optimistic (with the risk of other changes it entails)? If so, you answer "a glass half-empty" to trigger the gradient-descent-like neurosurgery. On the other hand, if you absolutely don't want to be optimistic (and keep being sad and lonely), then you would give the correct answer "a glass half-full", which would trigger no (or minimal) neurosurgery.
The requirements are significantly weaker than the previous schemes. With a cost, obviously: no way to control the gradient, just a choice to hitchhike on it, one step at a time. Variety in training examples/data might mitigate this by giving multiple gradient directions to choose from, but the network is still quite limited.
More damning, what happens when the network hitchhike? The gradient takes into account not only the output, but also how it was produced. And in this case, the gradient hitchhiking module plays an important role near the end of the computation. Hence there is a strong incentive for gradient descent to alter this module, for example by making it always give the correct answer.
We should thus expect that a gradient hitchhiker will lose its ability to do so after a few steps, and then will always give the correct answer and end its training.
None of the above shows definitely that gradient hacking is impossible. That being said, I find myself tentatively convinced that gradient hacking is really hard, and might be of limited power.
What sort of research can be done to check whether that's true?
The argument that Adam mentions me making is fleshed out in this post.
Great, now GPT-4 and beyond have an instruction manual in their training data! :D
...haha kidding not kidding...
I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.
A "simple" solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.
Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there's a button they can press for that.
I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)
My initial take is that this post is fine because every scheme proposed is really hard and I'm pointing the difficulty.
Two clear risks though:
(Note that I also don't expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!
Sure. I started studying Bostrom's paper today; I'll send you a message for a call when I read and thought enough to have something interesting to share and debate.