Summary When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of...
Update: Our code for the original version of this post contained some bugs. The high-level takeaways remain the same after making the corrections, but there are important differences in the details. We give an overview of those changes in Appendix E. We apologize for having presented inaccurate results at first!...
Note 1: This article was written for the EA UC Berkeley Distillation Contest, and is also my capstone project for the AGISF course. Note 2: All claims here about what different researchers believe and which definitions they endorse are my interpretations. All interpretation errors, though carefully avoided, are my own....