Shapley Value Attribution in Chain of Thought

AI ALIGNMENT FORUM
AF

Shapley Value Attribution in Chain of Thought — AI Alignment Forum

TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations.

Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Code for replicating the Shapley value results can be found here.

Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion.

Motivation

Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general.

Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations.

Related work

Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases.

Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance.

Effect of Interventions

We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case).

Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs):

We test how frequently this occurs in several different settings (n=100):

Setting	Accuracy (w/ CoT)	P(error not propagated \| original correct)
GPT4, zero shot	0.88	0.68
GPT4 base, 2-shot	0.73	0.63
GPT3.5, zero-shot	0.43	0.33

Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65.

Shapley value attribution

We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major problem with this is that especially in the larger models, it turns out there’s lots of cases where a token depends on multiple previous tokens in some complicated way. In particular, if a model looks at multiple different places in the context and takes a vote for the most common value, then intervening on any one of them doesn’t change the output logprob a lot, even though there’s a lot of information flow there.

To get around this problem, we instead estimate Shapley values, which take into account all the interactions (in the case where the model takes a vote among 3 values in the context, those three values would each get ⅓ of the attribution).^[1] We also normalize the attributions to sum to 1 for each token, clamping negative Shapley values to 0.^[2] We do this to make the attributions more comparable across different models.

Here's an example chain of thought in GPT-4^[3]:

Here, we can see patterns like the 23 and 20 being copied, or the 3 depending heavily on the preceding 23 - 20.^[4] We can also look at some other models:

GPT-3.5 (text-davinci-002):

text-davinci-001:

Interestingly, we notice that the more capable the model, the more spread out the attributions become. We can quantify this as the mean entropy of the parent attributions across all tokens to get a measure of how spread out this attribution is, at least on this particular data sample:

Model	Mean entropy of example sentence (nats)
text-davinci-001	0.796
GPT-3.5 (text-davinci-002)	0.967
GPT-4	1.133

Limitations and Future Work

The cost of computing the Shapley value scales exponentially with the number of tokens we're attributing.^[5] This makes it impractical for many use cases, though there exist efficient Monte Carlo estimators (Castro et al., 2008).
Replacing digits with underscores (or incorrect numbers) moves the model out of distribution, and its behaviour may not be as representative.
The Shapley attributions are not guaranteed to correspond to the actual information flow inside the model either. This methodology would not be sufficient for deceptive/adversarial LMs, or as an optimization target during training. In the language of Lipton (2016), this is a "post-hoc" method.
The mechanism behind this effect is still unknown, and would require more experiments and possibly interpretability to better understand. Possible hypotheses include typo correction or subsurface cognition.

Discussion

I think these experiments show that a naive optimistic view of CoT interpretability is incorrect, but do not provide strong evidence that there is definitely something fishy or difficult-to-fix going on.
I started out in a place of fairly high skepticism of CoT for increasing interpretability, and I didn't update very strongly because I expect deceptive alignment in future models to be most of the risk in my threat model
However, I did update a little because my previous view would not have ruled out extremely egregious causal dependencies even in current models.
I'm generally excited about better understanding what is going on with chain of thought and finding ways to make it more faithful.

^{^}
Methodological footnote: the Shapley experiments actually blank out the numbers with underscores, rather than doing the +-3 perturbation of the last section.
^{^}
Negative shapley values did not occur very often, but this is still somewhat unprincipled. This was primarily to make the entropy calculation work.
^{^}
We only look at the shapley values for the numbers, because shapley values take exponentially longer to attribute for more tokens under consideration.
^{^}
Alternative visualization style:
^{^}
When doing Shapley value attributions for every pair of tokens, there is a dynamic programming trick that we can use to prevent this from becoming n * 2^n: because the model is autoregressive, we can run attributions to only the last token, and, if we take care to save the logprobs for the correct number token at each underscore, compute all other attributions for free.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

55

Shapley Value Attribution in Chain of Thought

55

Motivation

Related work

Effect of Interventions

Shapley value attribution

Limitations and Future Work

Discussion