This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability. There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining...
TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting. Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt) Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089 Abstract Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can...
GPT-4 surprised many people with its abilities at coding, creative brainstorming, letter-writing, and other skills. Surprises in machine learning are not restricted to GPT-4: I was previously surprised by Minerva’s mathematical abilities, as were many competitive forecasters. How can we be less surprised by developments in machine learning? Our brains...
[Note: this post was drafted before Sydney (the Bing chatbot) was released, but Sydney demonstrates some particularly good examples of some of the issues I discuss below. I've therefore added a few Sydney-related notes in relevant places.] I’ve previously argued that machine learning systems often exhibit emergent capabilities, and that...
To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding...
Thanks to Collin Burns, Ruiqi Zhong, Cassidy Laidlaw, Jean-Stanislas Denain, and Erik Jones, who generated most of the considerations discussed in this post. Previously, I evaluated the accuracy of forecasts about performance on the MATH and MMLU (Massive Multitask) datasets. I argued that most people, including myself, significantly underestimated the...
My students and collaborators have been doing some particularly awesome work over the past several months, and to highlight that I wanted to summarize their papers here, and explain why I’m excited about them. There’s six papers in three categories. Human-Aligned AI * The Effects of Reward Misspecification: Mapping and...