I think one common intuition when thinking about gradient descent (GD) is to think about it as more efficient genetic algorithms (GAs). I certainly used to think this way—after all, taking a bunch of random mutations and keeping the ones that work well sounds basically just like a slower version of finding the steepest direction in the first place. GAs are also much more intuitive than GD, so it's tempting to think about GD in terms of them. Unfortunately, there are some instances where they behave completely differently, which is really relevant for thinking about things like gradient hacking. This post provides one such example.

Redundancy and Submodule Protection

Imagine you're given a network that has two identical submodules, and some kind of combining function where if it detects the outputs from both submodules are the same it passes the value through but if they're different it explodes and outputs zero or something totally random, or whatever. This is a natural idea to come up with if your goal is to ensure the optimizer doesn't mess with these modules, for example, if you're trying to protect the mesaobjective encoding for gradient hacking.

It's easy to convince yourself that this is effective against GAs: since mutations are inserted at random, the probability that both modules are mutated the same way is very small, and so any model that changes either of these modules will do very poorly since the output will be totally useless.

GD on the other hand will happily change both modules at the same time, since the direction of steepest descent is the one where both submodules change at the same time. This is even despite the fact that partial derivatives are taken assuming all other parameters stay the same, because partial derivatives don't care about the cliff just epsilon away but only the slope at the exact point where both modules are the same. Another way of thinking about it is the gradient necessarily points in the direction of steepest ascent, even if that direction isn't axis-aligned. If you're still not convinced, see this comment for a proof that there exists no combining function (under modest constraints) that can protect submodules this way.

Takeaways

The main takeaway from this is that gradient descent is really unintuitive sometimes. When in doubt, it's probably a good idea to work out the gradient by hand.

I think one common intuition when thinking about gradient descent (GD) is to think about it as more efficient genetic algorithms (GAs). I certainly used to think this way—after all, taking a bunch of random mutations and keeping the ones that work well sounds basically just like a slower version of finding the steepest direction in the first place. GAs are also much more intuitive than GD, so it's tempting to think about GD in terms of them. Unfortunately, there are some instances where they behave completely differently, which is really relevant for thinking about things like gradient hacking. This post provides one such example.

## Redundancy and Submodule Protection

Imagine you're given a network that has two identical submodules, and some kind of combining function where if it detects the outputs from both submodules are the same it passes the value through but if they're different it explodes and outputs zero or something totally random, or whatever. This is a natural idea to come up with if your goal is to ensure the optimizer doesn't mess with these modules, for example, if you're trying to protect the mesaobjective encoding for gradient hacking.

It's easy to convince yourself that this is effective against GAs: since mutations are inserted at random, the probability that both modules are mutated the same way is very small, and so any model that changes either of these modules will do very poorly since the output will be totally useless.

GD on the other hand will happily change both modules at the same time, since the direction of steepest descent is the one where both submodules change at the same time. This is even despite the fact that partial derivatives are taken assuming all other parameters stay the same, because partial derivatives don't care about the cliff just epsilon away but only the slope at the exact point where both modules are the same. Another way of thinking about it is the gradient necessarily points in the direction of steepest ascent, even if that direction isn't axis-aligned. If you're still not convinced, see this comment for a proof that there exists no combining function (under modest constraints) that can protect submodules this way.

## Takeaways

The main takeaway from this is that gradient descent is really unintuitive sometimes. When in doubt, it's probably a good idea to work out the gradient by hand.