Honestly, I don't think we have any very compelling ones! We gesture at some possibilities in the paper, such as it being harder for the model to ignore its reasoning when it's in an explicit question-and-answer format (as opposed to a more free-form CoT), but I don't think we have a good understanding of why it helps. It's also worth noting that CoT decomposition helps mitigate the ignored reasoning problem, but actually is more susceptible to biasing features in the context than CoT. Depending on how you weigh the two, it's possible that CoT might still come out ahead on reasoning faithfulness (we chose to weigh the two equally).
Thanks for this post Lawrence! I agree with it substantially, perhaps entirely.One other thing that I thing interacts with the difficulty of evaluation in some ways is the fact that many AI safety researchers think that most of the work done by some other researchers is approximately useless, or even net-negative in terms of reducing existential risk. I think it's pretty easy to wrap an evaluation of a research direction or agenda and an evaluation of a particular researcher together. I think this is actually pretty justified for more senior researchers, s... (read more)
I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear). I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case).
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In part
in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them
I'd like to flag that this has been pretty easy to do - for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model's performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.