Thanks to Roger Grosse for helping me understand his intuitions and hopes for influence functions. This post combines highlights from some influence function papers, some of Roger Grosse’s intuitions (though he doesn’t agree with everything I’m writing here), and some takes of mine.

Influence functions are informally about some notion of influence of a training data point on the model’s weights. But in practice, for neural networks, “influence functions” do not approximate well “what would happen if a training data point was removed”. Then, what are influence functions about, and what can they be used for?

From leave-one-out to influence functions

Ideas from Bae 2022 (If influence functions are the answer, what is the question?).

The leave-one-out function is the answer to “what would happen, in a network trained to its global minima, if one point was omitted”: 

Under some assumptions such as a strongly convex loss landscape, influence functions are cheap-to-compute approximation to leave-one-out function, thanks to the Implicit Function Theorem, which tells us that under those assumptions 

But these assumptions don't hold for neural networks, and Basu 2020 shows that influence functions are a terrible approximation of leave-one-out in the context of neural networks, as shown in this figure from Bae 2022 (left is for Linear Regression, where the approximation hold, right is for MultiLayer-Perceptron, where it doesn’t):

Moreover, even the leave-one-out function is about parameters at convergence, which is not the regime most deep learning training runs operate in. Therefore, influence functions are even less about answering the question “what would happen if this point had more/less weight in the (incomplete) training run?”.

So every time you see someone introducing influence functions as an approximation of the effect of up/down-weighting training data points (as in this LW post about interpretability), remember that this does not apply when they are applied to neural networks.

What are influence functions doing

Bae 2022 shows that influence functions (not leave-one-out!) can be well approximated by the minimization of another training objective called PBRF, which is the sum of 3 terms:

  • , The loss function with the soft labels as computed by the studied function with weights after training : the new  should not change the output of the function much.
  • , The opposite of the loss function on the target point: the new  should give a high loss on the considered data point
  • , A penalization of weights very different from the final training weights (Roger told me the specific value of  didn’t have a huge influence on the result.)

This does not answer the often advertised question about leave-one-out, but this does answer something which looks related, and which happens to be much cheaper to compute than the leave-one-out function (which can only be computed by retraining the network and doesn’t have cheaper approximations).

Influence functions are currently among the few options to say anything about the intuitive “influence” of individual data points in large neural networks, which justifies why they are used. (Alternatives have roughly the same kind of challenges as influence functions.)

Note: this explanation of what influence function are doing is not the only way to describe their behavior, and other works may shine new lights on what they are doing.

What are influence functions useful for

Current empirical evidence

To this date, there has been almost no work externally validating that influence functions tell us anything about the influence of data points on neural networks behaviors beyond:

  • “It looks reasonable” when running it on individual examples of influence function approximations results on large language models
  • Influence functions are useful for identifying mislabeled examples: if 10% of MNIST labels are changed, and flag points as suspicious if they have a large influence on themselves, then about 60% of mislabeled points are in the top-10% of the most suspicious points. (See last appendix of Bae 2022)

(These two methods are common in the influence function literature and were not introduced by the two papers I cite here.)

[Edit] Other empirical evidence since the initial post date: influence functions correctly predict that LLMs don't generalize from "A is B" to "B is A".

I would be excited about future work using influence functions to make specific predictions about the effect of removing some points from the training set and retraining the network to see if those predictions were accurate. I would be even more interested if those predictions were about generalization properties of LLMs outside their training distribution.

Speculations about what influence functions won't and will be useful for

The main practical use case I see is formulating hypotheses about what kind of dataset modification would lead to what changes in behavior. I say “formulating hypotheses” and not something stronger like “making suggestions” because I currently see no strong evidence that influence functions can make reliable predictions about effects of modifying training datasets.

If in the future, influence functions are shown to be a good way to suggest and predict effects of dataset modifications, how would that be differentially useful for AI safety? I see 3 paths forward:

  1. Influence functions might help generate hypothesis about what makes adversarial training work best, and thus improve its sample efficiency significantly (while capabilities of frontier models might benefit differentially more from scale alone);
  2. Influence functions localization suggestions might help steer inductive biases of models by suggesting freezing some parts of the models during some crucial fine-tuning stages (e.g. by favoring changes to possible motives as opposed to superficial word choices);
  3. Influence functions might lead to better intuitions of how models' internal work: since influence functions describe influence of points on specific weights, they could tell us which weights would be influenced as a result of dataset changes, which could help formulate hypotheses about the inner workings of neural networks.


New Comment