AI ALIGNMENT FORUM
AF

Gradient HackingAI
Frontpage

15

Challenge: construct a Gradient Hacker

by Thomas Larsen, Thomas Kwa
9th Mar 2023
1 min read
10

15

Gradient HackingAI
Frontpage
Challenge: construct a Gradient Hacker
9johnswentworth
2James Payor
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:18 AM
[-]johnswentworth2y94

Seems like the easiest way to satisfy that definition would be to:

  • Set up a network and dataset with at least one local minimum which is not a global minimum
  • ... Then add an intermediate layer which estimates the gradient, and doesn't connect to the output at all.
Reply
[-]James Payor2y20

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

Reply
Moderation Log
More from Thomas Larsen
View more
Curated and popular this week
2Comments

This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. 

Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions.

Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input.

Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore:

θ(t)=θ0−∫t0∇θL(θ(x))dx

Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold:

  1. θg is not a global optimum: ∃θ such that L(θ)<L(θg)
  2. ∃ϵ>0 such that ∀θ0:||θ0−θg||<ϵ:
    1. the network’s training converges to θg, i.e., limt→∞θ(t)=θg. 
    2. There is internal structure of f(⋅,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). [1]

This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true.

Some more ambitious problems (if gradient hackers exist):

  • Characterize the set of all gradient hackers. 
  • Show that they all must satisfy some property. 
  • Construct gradient hackers for arbitrarily large n, d, w, and neural net depth.
  • Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input.
  1. ^

    This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient.