We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such fake reasoning is just one side of CoT faithfulness - the other is preventing true reasoning that is hidden.)
We estimate the beliefs by truncating the CoT early and asking the model for an answer. Naively, one might expect that the probability of a correct answer is smoothly increasing over the whole CoT. However, it turns out that even for a straightforward and short chain of thought the value of P[correct_answer] fluctuates a lot with the number of tokens of CoT...