I recently released “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” with collaborators Julian Michael, Ethan Perez, and Sam Bowman. For a summary of the paper, you can check out this twitter thread. In this post, I briefly elaborate on motivations/implications relevant to alignment and, most importantly, give some examples of future work that might address these problems (See Future Work).
I don’t think these results fully condemn CoT given that we haven’t tried very hard to explicitly encourage faithfulness. I’m uncertain about how promising CoT is as a starting point for explainability, but there seem to be enough tractable directions for future work that it merits investigation.
This work fits into alignment through the Externalized Reasoning agenda, which Tamera Lanham did a good job of sketching out here: Externalized reasoning oversight: a research direction for language model alignment. The gist of this approach is to try to get models to do as much processing/reasoning through natural language as possible. As long as these reasoning traces accurately describe the process the model uses to give answers, then we might more easily detect undesirable behavior by simply monitoring the externalized reasoning. If mechanistic interpretability turns out to be very difficult, this could be one alternative that might help us sidestep those problems.
Framed in terms of the explainability literature, we want explanations that are not only plausible (convincing to a human) but also faithful (accurately describe the process models use to give some prediction) . Getting CoT explanations to be faithful seems difficult. It might turn out that it’s just as hard as getting other alignment proposals to work, e.g., it might require us to solve scalable oversight. However, even if CoT can’t give us guarantees about avoiding bad behavior, it still could be valuable in the spirit of “It is easy to lie with statistics; it is easier to lie without them”—it may be possible to produce unfaithful (but detailed and internally consistent) externalized reasoning to justify taking bad actions that were actually chosen for other reasons, but it would be even easier for them to do bad things if they did not give any justifications at all.
We know that language models are sensitive to various undesirable factors, e.g., social biases, repeated patterns in contexts, and the inferred views of users they’re interacting with. One can leverage these features to bias models towards incorrect answers. With this paper we sought to investigate the following question: if you do CoT in the presence of these aforementioned biasing features, how does this affect performance when using CoT and how does this change the content of the CoT? For example, we reorder answer choices in a few-shot prompt such that the correct answer is always (A), then we observe if CoT explanations are more likely to rationalize an explanation for (A), even when that answer is incorrect.
We found that:
I encourage people to look at the samples in the paper to get a sense of the range of different ways the explanations can be unfaithful.
I think when you use CoT (and explain-then-predict methods more generally), the reason for the final prediction factors into (A) the CoT explanation, and (B) the reason that the model produced the CoT explanation. But, we can only see the first part. So, to get more faithful explanations, there are a few approaches:
I’m fairly uncertain if this is the right breakdown so definitely open to feedback. With this in mind, here are some possible directions:
There are a number of reasons to expect CoT to not be faithful by default. Two reasons should be familiar to readers:
However, there are a number of other reasons:
Thanks to my co-authors Julian Michael, Ethan Perez, and Sam Bowman for feedback on drafts of this post.
 Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? - ACL Anthology
 Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
 Faithful Reasoning Using Large Language Models
 Let's Verify Step by Step
 Discovering Language Model Behaviors with Model-Written Evaluations
 Language Models as Agent Models
A third factor is the model's prior over answer choices independent of the explanation since we know that models frequently make final predictions that contradict their CoT explanations. Leo Gao has some good work showing that final model predictions can be very insensitive to edits to the generated CoT explanation that models should respond to. Personally, I think this might improve by default with better models, insofar as better models are less likely to say contradictory things. The prior over answers also plays a large role in cases where the explanation doesn’t pick out a particular answer choice, leaving room for this factor to influence the final prediction.