Top postsTop post
In the case for CoT unfaithfulness is overstated, @nostalgebraist pointed out that reading the chain-of-thought (CoT) reasoning of models is neglected as an interpretability technique. I completely agree, but I also believe that with our current methods, there's a significant risk of being misled by the CoTs of models like...
Thanks to Neel Nanda, Madeleine Chang, Michael Chen, Oliver Zhang, and Aidan O’Gara for their early feedback on this post. Summary: Alignment researchers could record the intermediate outputs of their work in anticipation of using the data to train future AI models. These models could eventually accelerate alignment research by...