jsteinhardt — AI Alignment Forum

The Case for Evaluating Model Behaviors

Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on. From a safety perspective, capability evaluations have a place: by understanding how close we are to different capabilities, and the rate of progress...

May 2042

Scalable End-to-End Interpretability

This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability. There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining...

Dec 18, 2025124

Approaching Human-Level Forecasting with Language Models

by Fred Zhang, dannyhalawi, and jsteinhardt

TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting. Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt) Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089 Abstract Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can...

Feb 29, 202460

What will GPT-2030 look like?

GPT-4 surprised many people with its abilities at coding, creative brainstorming, letter-writing, and other skills. Surprises in machine learning are not restricted to GPT-4: I was previously surprised by Minerva’s mathematical abilities, as were many competitive forecasters. How can we be less surprised by developments in machine learning? Our brains...

Jun 7, 2023185

Emergent Deception and Emergent Optimization

[Note: this post was drafted before Sydney (the Bing chatbot) was released, but Sydney demonstrates some particularly good examples of some of the issues I discuss below. I've therefore added a few Sydney-related notes in relevant places.] I’ve previously argued that machine learning systems often exhibit emergent capabilities, and that...

Feb 20, 202364

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

by RowanWang, Alexandre Variengien, Arthur Conmy, Buck, and jsteinhardt

To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding...

Oct 28, 2022102

Forecasting ML Benchmarks in 2023

Thanks to Collin Burns, Ruiqi Zhong, Cassidy Laidlaw, Jean-Stanislas Denain, and Erik Jones, who generated most of the considerations discussed in this post. Previously, I evaluated the accuracy of forecasts about performance on the MATH and MMLU (Massive Multitask) datasets. I argued that most people, including myself, significantly underestimated the...

Jul 18, 202236

Jacob Steinhardt

Jacob Steinhardt

Jacob Steinhardt

What will GPT-2030 look like?

More Is Different for AI

Scalable End-to-End Interpretability

Future ML Systems Will Be Qualitatively Different

Jacob Steinhardt

What will GPT-2030 look like?

More Is Different for AI

Scalable End-to-End Interpretability

Future ML Systems Will Be Qualitatively Different

The Case for Evaluating Model Behaviors

Scalable End-to-End Interpretability

Approaching Human-Level Forecasting with Language Models

What will GPT-2030 look like?

Emergent Deception and Emergent Optimization

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Forecasting ML Benchmarks in 2023