I recently spent a few months thinking about whether LLM-based models can do model-based planning, and wrote a ~40-page report on it: "Report on LLMs and model-based planning". The doc is a bit rough around the edges still - most notably, the concepts of "efficient planning" and "sufficiently convoluted" tasks...
Work performed as a part of Neel Nanda's MATS 6.0 (Summer 2024) training program. TLDR This is an interim report on reverse-engineering Othello-GPT, an 8-layer transformer trained to take sequences of Othello moves and predict legal moves. We find evidence that Othello-GPT learns to compute the board state using many...
Thanks to Dan Roberts and Sho Yaida for comments on a draft of this post. In this post, I would like to draw attention to the book Principles of Deep Learning Theory (PDLT), which I think represents a significant advance in our understanding of how neural networks work [1]. Among...
This is a linkpost for a review of Ajeya Cotra's Biological Anchors report (see also update here) that I wrote in April 2022. It's since won a prize from the EA criticism and red-teaming contest, so I thought it might be good to share here for further discussion. Here's a...
In this post, I'll summarize my views on AGI safety after thinking about it for two years while on the Research Scholars Programme at FHI -- before which I was quite skeptical about the entire subject. [1] My goals here are twofold: (1) to introduce the problem from scratch in...
Introduction Transparency is the problem of going inside the black box of an AI system and understanding exactly how it works, by tracking and understanding the processes in its internal state. In this post, I’ll argue that making AI systems more transparent could be very useful from a longtermist or...