Rauno Arike — AI Alignment Forum

Hidden Reasoning in LLMs: A Taxonomy

Summary When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of...

Aug 25, 202578

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Update: Our code for the original version of this post contained some bugs. The high-level takeaways remain the same after making the corrections, but there are important differences in the details. We give an overview of those changes in Appendix E. We apologize for having presented inaccurate results at first!...

Aug 8, 202552

Clarifying the confusion around inner alignment

Note 1: This article was written for the EA UC Berkeley Distillation Contest, and is also my capstone project for the AGISF course. Note 2: All claims here about what different researchers believe and which definitions they endorse are my interpretations. All interpretation errors, though carefully avoided, are my own....

May 13, 202232