Gradient hacking is a hypothesized phenomenon where:
Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.
RL credit hacking examples
Gradient descent hacking examples
I expect that all of the effects discussed above could be much stronger in environments which violate standard RL/SL abstractions. I think it’s still worth having the central examples of gradient hacking be defined within the abstractions, since that makes it easier to see that there’s something there. However, it’s probably worth constructing examples which can lead to big changes (e.g. creating new local optima) given small violations of the standard abstractions such as:
Having said all this, I do have some concern that the concept of gradient hacking won’t be very useful, because as policies get more and more intelligent, more and more of their behavior will be driven by generalization (in particular via high-level reasoning) which by definition is not “taken into account” by training algorithms. For example, there are several ways in which very general reasoning about how to achieve goals could be interpreted as gradient hacking, which suggests that “gradient hacking” isn’t a very natural category:
So while techniques specifically aimed at preventing gradient hacking may be useful on the margin, I currently expect that the category of “gradient hacking” is most useful for framing the broader problem of misaligned reasoning in a way that fits more easily into existing ML frameworks.
I don't think this is true:
But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.
I could not find any study that test this directly, but I don't expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Does this matter for understanding gradient hacking in future AGIs? Maybe?
Since humans are the closest thing we have to an AGI, it does make sense to try to understand things like gradient hacking in ourselves. Or if we don't have this problem, it would be very interesting to understand why not.
Are there other examples of biological gradient hacking? (1)I heard that whatever you do while taking nicotine, will be reinforced (don't remember source but seems plausible to me). But this would be more analog to directly over writing the back prop signal, instead of manipulating the gradient via controlling the training data. If we end up with an AI that can just straight forwardly edit its outer learning regime in this way, then I think we are outside the scope of what you are talking about. However, if this nicotine hack works, it is interesting it is not used more? Maybe it is not a strong enough effect to be useful?
(2)You give an other example:
Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.
I can't decide if I think this should count as gradient hacking.
(3)I know that I to some extent absorb the values of people around me, and I have used this for self manipulation. This is the best analog to gradient hacking I can think of for humans. Unfortunately I don't expect this to tell us much about AI's, since this method depends on a specific human drive towards conformism.
I'm curious if an opposite strategy works for contrarians? If you want to self manipulate you should hang out with people who believe/value the opposite of what you want yourself to believe/value?