Practical Pitfalls of Causal Scrubbing

Phil3; tony; jacquesthibs; David Lindner

Thanks for your work!

Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses

I think that what you mean here is a combination of the following:

CaSc fails to reject some false hypotheses, as already discussed.
Each node in the interpretation graph is only verified up to extensional equality. As in, if I claim that a single node in the graph is a whole sort function, I don't learn anything about whether the model is implementing quicksort or mergesort.

But one way someone could interpret this sentence is that CaSc doesn't distinguish between whether the model does quicksort or mergesort. This isn't generally true--if your interpretation graph broke up its quicksort implementation into multiple nodes, then the CaSc experiment would fail to explain the model's performance if the model itself was actually using merge sort.

[-]Buck3y127

Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:

I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of your computational graph.

Thinking about it this way is helpful to me for two reasons:

False hypotheses can make true predictions; this is basically why CaSc can fail to reject false hypotheses.
It also emphasizes why I'm unsympathetic to claims that "it sets the bar too high for something being a legit circuit"--IMO, if you claimed that your model has some internal structure well described by hypothesis that fits into the CaSc structure (which is true of almost all interp hypotheses in practice), I don't really see how the failure of a CaSc test is compatible with that hypothesis being true (modulo my remaining questions about how bad it is for a hypothesis to get a middling CaSc score).

CaSc attempts to compute the single most aggressive prediction made by your hypothesis--this is why we do all allowed swaps. (I'm a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don't totally understand.)

I think I regret that we phrased our writeup as "CaSc gives you a test of interp hypotheses" rather than saying "CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don't match that's a problem for your hypothesis".

[-]Buck3y21

Something I've realized over the last few days:

Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?

The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.

[-]David Lindner3y20

Thanks, that's a useful alternative framing of CaSc!

FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is "cancellation" which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems.

Plausibly you'd run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on "how wrong" your hypothesis is.

[-]Adrià Garriga-alonso3y10

If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems

I think this "max loss" procedure is different from what Buck wrote and the same as what I wrote.

[-]Adrià Garriga-alonso3y10

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).

One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values are all equivalent, you can choose to map all of them to e.g. $x_{1}$ . And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.

[-]Adrià Garriga-alonso3y22

Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.

There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.

If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that seems harder to happen in practice.

(There's a section on this in the appendix but it's rather controversial even among the authors)

[-]Lucius Bushnaq3y00

CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.

Seems to me like this is easily resolved so long as you don't screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?

On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.

I don't know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

Wouldn't you want to directly compare the divergence on outputs between the original graph and ablated graph $I$ instead? The $D_{K L}$ divergence between their output distributions over the data is the first thing that'd come to my mind. Or keeping whatever the original loss function is, but with the outputs of $G$ as the new ground truth labels.

That's still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?

[-]Buck3y30

Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.

The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There's some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.

There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you're trying to understand the model's behavior on, but not in a way that ends up affecting the model's average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.

^{^}

The small difference in the Scrubbed Loss is likely due to sampling noise.

^{^}

In 25% of the cases, we swap a +1 with a -1, resulting in a perfect estimate with a mean absolute error of 0 (underestimating the true sum by 2 and then adding 2). In 25% of cases, we swap a -1 with a +1, resulting in a mean absolute error of 4 (overestimating the true sum and adding 2). In the remaining 50% of cases where we “replace” +1 with +1 or -1 with -1, we get the original MAE of 2. So, overall we still get a MAE of 2.

^{^}

Researchers at Redwood Research have told us that they have also started to look at the loss for individual samples.

Interpretation	Scrubbed Loss (MAE)
Correct Interpretation: $\frac{1}{x_{0}} \leq \frac{1}{x_{1}}$	0
Random Labels	0.49
“Incorrect” Interpretation: $x_{0} \geq x_{1}$	0

Interpretation	Scrubbed Loss (MAE)
Graph $G$ (returns first bit)	0.50
Random Labels	0.50
Hypothesis I’ (returns second bit)	0.51^[1]

Interpretation	Scrubbed Loss (MAE)
Graph G (returns sum + 2)	2.0
Random Labels	6.0
Hypothesis I’ (returns sum of first n-1 elements)	2.0

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

40

Practical Pitfalls of Causal Scrubbing

40

Introduction

Background on Causal Scrubbing (CaSc)

How to evaluate Causal Scrubbing?

How Consistent is Causal Scrubbing?

Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses

Causal Scrubbing Can Fail to Reject False Hypotheses

Example 1 (XOR)

Example 2 (Sum with offset)

Implications for Causal Scrubbing