The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
alkjash
Prune
johnswentworth
Gears vs Behavior
Raemon
Noticing Frame Differences
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
+

Optimization

sarahconstantin
The Pavlov Strategy
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
HoldenKarnofsky
This Can't Go On
Scott Alexander
Studies On Slack
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Jameson Quinn
A voting theory primer for rationalists
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
+

World

Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
johnswentworth
Transportation as a Constraint
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
Bird Concept
Unconscious Economics
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
zhukeepa
How uniform is the neocortex?
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
KatjaGrace
Elephant seal 2
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
Kaj_Sotala
Book summary: Unlocking the Emotional Brain
+

AI Strategy

Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
KatjaGrace
Counterarguments to the basic AI x-risk case
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
Adam Scholl
Safetywashing
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
+

Technical AI Safety

Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
paulfchristiano
The strategy-stealing assumption
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
Alex Flint
The ground of optimization
paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
#4

ARC explores the challenge of extracting information from AI systems that isn't directly observable in their outputs, i.e "eliciting latent knowledge." They present a hypothetical AI-controlled security system to demonstrate how relying solely on visible outcomes can lead to deceptive or harmful results. The authors argue that developing methods to reveal an AI's full understanding of a situation is crucial for ensuring the safety and reliability of advanced AI systems.

7[anonymous]
ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community. Things about ELK that I benefited from Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking. Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult.  Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who we
5Matthew "Vaniver" Gray
I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points. I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most important posts from 2021 (even tho the prize and much of the related work happened in 2022). I don't really need to ask for follow-on work; there's already tons, as you can see from the ELK tag. I think it is maybe underappreciated by the broad audience how much this is an old problem, and appreciate the appendix that gives credit to earlier thinking, while thinking this doesn't erode any of the credit Paul, Mark, and Ajeya should get for the excellent packaging. [To the best of my knowledge, ELK is still an open problem, and one of the things that I appreciated about the significant focus on ELK specifically was helping give people better models of how quickly progress happens in this space, and what it looks like (or doesn't look like).]
#10

Nate Soares moderates a long conversation between Richard Ngo and Eliezer Yudkowsky on AI alignment. The two discuss topics like "consequentialism" as a necessary part of strong intelligence, the difficulty of alignment, and potential pivotal acts to address existential risk from advanced AI. 

30Oliver Habryka
I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall. A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.  Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments.  Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it isn't really obvious that you even want an "AI Alignment field" or lots of "AI Alignment research organizations" or "AI Alignment student groups". Like, because we don't know how to solve this problem, it really isn't clear what the right type of social organization is, and there aren't obviously great gains from trade, and so from a coalition perspective, you don't get a coalition of people who think these arguments are real.  I feel deeply confused about this. Over the last two years, I think I wrongly ended up just kind of investing into an ecosystem of people that somewhat structurally can't really handle these arguments, and makes plans that assume that these arguments are false, and in doing so actually mostly makes the world worse, by having a far too optimistic stance on the differen
#25

What's the type signature of an agent? John Wentworth proposes Selection Theorems as a way to explore this question. Selection Theorems tell us what agent type signatures will be selected for in broad classes of environments. This post outlines the concept and how to work on it.

8Cinera Verinia
Epistemic Status I am an aspiring selection theorist and I have thoughts.   ----------------------------------------   Why Selection Theorems? Learning about selection theorems was very exciting. It's one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger.   Selection Theorems as a Compelling Agent Foundations Paradigm There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don't quite describe the AI systems we deal with in practice. I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it's general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research "surface area" and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program. Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what
#27

Paul Christiano describes his research methodology for AI alignment. He focuses on trying to develop algorithms that can work "in the worst case" - i.e. algorithms for which we can't tell any plausible story about how they could lead to egregious misalignment. He alternates between proposing alignment algorithms and trying to think of ways they could fail.

6Ben Pace
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge. Overview of why: * The motivation behind most of proposals Paul has spent a lot of time (iterated amplification, imitative generalization) on are explained clearly and succinctly. * For a quick summary, this involves  * A proposal for useful ML-systems designed with human feedback * An argument that the human-feedback ML-systems will have flaws that kill you * A proposal for using ML assistants to debug the original ML system * An argument that the ML systems will not be able to understand the original human-feedback ML-systems * A proposal for training the human-feedback ML-systems in a way that requires understandability * An argument that this proposal is uncompetitive * ??? (I believe the proposals in the ELK document are the next step here) * A key problem when evaluating very high-IQ, impressive, technical work, is that it is unclear which parts of the work you do not understand because you do not understand an abstract technical concept, and which parts are simply judgment calls based on the originator of the idea. This post shows very clearly which is which — many of the examples and discussions are technical, but the standard for "plausible failure story" and "sufficiently precise algorithm" and "sufficiently doomed" are all judgment calls, as are the proposed solutions. I'm not even sure I get on the bus at step 1, that the right next step is to consider ML
#39

The RL algorithm "EfficientZero" achieves better-than-human performance on Atari games after only 2 hours of gameplay experience. This seems like a major advance in sample efficiency for reinforcement learning. The post breaks down how EfficientZero works and what its success might mean.

61a3orn
I remain pretty happy with most of this, looking back -- I think this remains clear, accessible, and about as truthful as possible without getting too technical. I do want to grade my conclusions / predictions, though. (1). I predicted that this work would quickly be exceeded in sample efficiency. This was wrong -- it's been a bit over a year and EfficientZero is still SOTA on Atari. My 3-to-24-month timeframe hasn't run out, but I said that I expected "at least a 25% gain" towards the start of the time, which hasn't happened. (2). There has been a shift to multitask domains, or to multi-benchmark papers. This wasn't too hard of a prediction, but I think it was correct. (Although of course good evidence for such a shift would require comprehensive lit review.) To sample two -- DreamerV3 is a very recently released model-based DeepMind algorithm. It does very well at Atari100k -- it gets a better mean score then everything but EfficientZero -- but it also does well at DMLab + 4 other benchmarks + even crafting a Minecraft diamond. The paper emphasizes the robustness of the algorithm, and is right to do so -- once you get human-level sample efficiency on Atari100k, you really want to make sure you aren't just overfitting to that! And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment. (3). And finally -- well, the last conclusion, that there is still a lot of space for big gains in performance in RL even without field-overturning new insights, is inevitably subjective. But I think the evidence still supports it.
6Cinera Verinia
Epistemic Status: I don't actually know anything about machine learning or reinforcement learning and I'm just following your reasoning/explanation.   This does not actually follow. Policies return probability distributions over actions ("strategies"), and it's not necessarily the case that the output of the optimal policy in the current state is a pure strategy. Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist). Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy's support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I'm too tired to work out the maths right now. IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty). So I think greedy search for the action that attains the highest value for the optimal policy's action value function is only equivalent to the optimal policy if the environment is: * Deterministic * Fully observable/the agent has perfect information * Agent knows all the "laws of physics"/state transition rules of the environment * Fixed low level state transitions that do not depend on agent (I may be missing some other criteria necessary to completely obviate mixed strategies.)   I think these conditions are actually quite strong!