Preamble I heavily recommend @beren's "Deconfusing Direct vs Amortised Optimisation". It's a very important conceptual clarification that has changed how I think about many issues bearing on technical AI safety. Currently, it's the most important blog post I've read this year. This sequence (if I get around to completing it)...
Abstract Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability...
Epistemic Status Unsure[1], partially noticing my own confusion. Hoping Cunningham's Law can help resolve it. Related Answer Confusions About Arguments From Expected Utility Maximisation Some MIRI people (e.g. Rob Bensinger) still highlight EU maximisers as the paradigm case for existentially dangerous AI systems. I'm confused by this for a few...