Dan H

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

In collaboration with Scale AI, we are releasing MASK (Model Alignment between Statements and Knowledge), a benchmark with over 1000 scenarios specifically designed to measure AI honesty. As AI systems grow increasingly capable and autonomous, measuring the propensity of AIs to lie to humans is increasingly important. Often, LLM developers...

Mar 5, 202537

On the Rationality of Deterring ASI

I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers. Executive Summary Rapid advances in AI are poised to reshape...

Mar 5, 2025171

AI forecasting bots incoming

In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: “15 or 20 [years].” In light of this, we are...

Sep 9, 202424

The Bitter Lesson for AI Safety Research

Read the associated paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?": https://arxiv.org/abs/2407.21792 Focus on safety problems that aren’t solved with scale. Benchmarks are crucial in ML to operationalize the properties we want models to have (knowledge, reasoning, ethics, calibration, truthfulness, etc.). They act as a criterion to judge...

Aug 2, 202458

UC Berkeley course on LLMs and ML Safety

The UC Berkeley course I co-taught now has lecture videos available: https://www.youtube.com/playlist?list=PLJ66BAXN6D8H_gRQJGjmbnS5qCWoxJNfe Course site: Understanding LLMs: Foundations and Safety Unrelatedly, a more conceptual AI safety course has its content available at https://www.aisafetybook.com/

Jul 9, 202436

Uncovering Latent Human Wellbeing in LLM Embeddings

tl;dr A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS Util training dataset. This demonstrates how language models are developing implicit representations of human utility even without direct preference...

Sep 14, 202332

Risks from AI Overview: Summary

We’ve recently published on our website a summary of our paper on catastrophic risks from AI, which we are cross-posting here. We hope that this summary helps to make our research more accessible and to share our policy recommendations in a more convenient format. (Previously we had a smaller summary...

Aug 18, 202325

Dan H

Dan H

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

On the Rationality of Deterring ASI

$20 Million in NSF Grants for Safety Research

A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]

Dan H

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

On the Rationality of Deterring ASI

$20 Million in NSF Grants for Safety Research

A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]

Introducing MASK: A Benchmark for Measuring Honesty in AI Systems

On the Rationality of Deterring ASI

AI forecasting bots incoming

The Bitter Lesson for AI Safety Research

UC Berkeley course on LLMs and ML Safety

Uncovering Latent Human Wellbeing in LLM Embeddings

Risks from AI Overview: Summary