Challenge: construct a Gradient Hacker

Thomas Kwa

AI ALIGNMENT FORUM
AF

Challenge: construct a Gradient Hacker — AI Alignment Forum

This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress.

Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions.

Setup: Let be the network inputs, $θ \in R^{w}$ , be the network trainable parameters, $^y \in R$ be the network outputs, and $f (x, θ) =^y$ be the network architecture. $f (x, θ)$ must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input.

Suppose that we train this network on a dataset $D = {x^{(i)}, y^{(i)}}$ , with MSE loss $L (θ) = \frac{1}{n} \sum_{i} (f (x^{(i)}, θ) - y^{(i)})^{2}$ , assuming gradient flows: $θ^{'} (t) = - \nabla_{θ} L (θ (t))$ . Given an initial point $θ_{0}$ , the training dynamics are therefore:

θ (t) = θ_{0} - \int_{0}^{t} \nabla_{θ} L (θ (x)) d x

Definition: A tuple $(f, θ_{g}, D)$ is a gradient hacker if the following conditions hold:

$θ_{g}$ is not a global optimum: $\exists θ$ such that $L (θ) < L (θ_{g})$
$\exists ϵ > 0$ such that $\forall θ_{0} : | | θ_{0} - θ_{g} | | < ϵ$ :
1. the network’s training converges to $θ_{g}$ , i.e., ${lim}_{t \to \infty} θ (t) = θ_{g}$ .
2. There is internal structure of $f (\cdot, θ_{0})$ that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is $\nabla_{θ_{0}} L (θ_{0})$ . ^[1]

This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to $θ_{g}$ "), and then it should decide what it outputs in order to make the gradient true.

Some more ambitious problems (if gradient hackers exist):

Characterize the set of all gradient hackers.
Show that they all must satisfy some property.
Construct gradient hackers for arbitrarily large $n$ , $d$ , $w$ , and neural net depth.
Variations on the problem: a subset of the activations equals $\nabla_{θ_{0}} L (θ_{0})$ for every input, or the subset of activations correspond to the gradient on that input.

^{^}
This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

15

Challenge: construct a Gradient Hacker

15