But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.

I could not find any study that test this directly, but I don't expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).

Does this matter for understanding gradient hacking in future AGIs? Maybe?

Since humans are the closest thing we have to an AGI, it does make sense to try to understand things like gradient hacking in ourselves. Or if we don't have this problem, it would be very interesting to understand why not.

Are there other examples of biological gradient hacking?

(1)
I heard that whatever you do while taking nicotine, will be reinforced (don't remember source but seems plausible to me). But this would be more analog to directly over writing the back prop signal, instead of manipulating the gradient via controlling the training data. If we end up with an AI that can just straight forwardly edit its outer learning regime in this way, then I think we are outside the scope of what you are talking about. However, if this nicotine hack works, it is interesting it is not used more? Maybe it is not a strong enough effect to be useful?

(2)
You give an other example:

Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.

I can't decide if I think this should count as gradient hacking.

(3)
I know that I to some extent absorb the values of people around me, and I have used this for self manipulation. This is the best analog to gradient hacking I can think of for humans. Unfortunately I don't expect this to tell us much about AI's, since this method depends on a specific human drive towards conformism.

I'm curious if an opposite strategy works for contrarians? If you want to self manipulate you should hang out with people who believe/value the opposite of what you want yourself to believe/value?

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Gradient hacking: definitions and examples

21