The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
TsviBT
Please don't throw your mind away
Raemon
Noticing Frame Differences
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Duncan Sabien (Deactivated)
Basics of Rationalist Discourse
alkjash
Prune
johnswentworth
Gears vs Behavior
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
Raemon
Feedbackloop-first Rationality
LoganStrohl
Fucking Goddamn Basics of Rationalist Discourse
Raemon
Tuning your Cognitive Strategies
johnswentworth
Lessons On How To Get Things Right On The First Try
+

Optimization

So8res
Focus on the places where you feel shocked everyone's dropping the ball
Jameson Quinn
A voting theory primer for rationalists
sarahconstantin
The Pavlov Strategy
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
GeneSmith
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
HoldenKarnofsky
This Can't Go On
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Scott Alexander
Studies On Slack
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
jenn
Things I Learned by Spending Five Thousand Hours In Non-EA Charities
Richard_Ngo
The ants and the grasshopper
So8res
Enemies vs Malefactors
Elizabeth
Change my mind: Veganism entails trade-offs, and health is one of the axes
+

World

Kaj_Sotala
Book summary: Unlocking the Emotional Brain
Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
johnswentworth
Transportation as a Constraint
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Bird Concept
Unconscious Economics
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
L Rudolf L
A Disneyland Without Children
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
zhukeepa
How uniform is the neocortex?
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
KatjaGrace
Elephant seal 2
Zvi
Book Review: Going Infinite
Rafael Harth
Why it's so hard to talk about Consciousness
Duncan Sabien (Deactivated)
Social Dark Matter
Eric Neyman
How much do you believe your results?
Malmesbury
The Talk: a brief explanation of sexual dimorphism
moridinamael
The Parable of the King and the Random Process
Henrik Karlsson
Cultivating a state of mind where new ideas are born
+

Practical

+

AI Strategy

paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
Eliezer Yudkowsky
Pausing AI Developments Isn't Enough. We Need to Shut it All Down
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
KatjaGrace
Counterarguments to the basic AI x-risk case
Adam Scholl
Safetywashing
habryka
AI Timelines
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Orpheus16
Speaking to Congressional staffers about AI risk
Tom Davidson
What a compute-centric framework says about AI takeoff speeds
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
Joe Carlsmith
Predictable updating about AI risk
Raemon
"Carefully Bootstrapped Alignment" is organizationally hard
KatjaGrace
We don’t trade with ants
+

Technical AI Safety

paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Buck
AI Control: Improving Safety Despite Intentional Subversion
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
evhub
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
paulfchristiano
The strategy-stealing assumption
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
Alex Flint
The ground of optimization
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
LawrenceC
Natural Abstractions: Key Claims, Theorems, and Critiques
Zack_M_Davis
Alignment Implications of LLM Successes: a Debate in One Act
johnswentworth
Natural Latents: The Math
TurnTrout
Steering GPT-2-XL by adding an activation vector
Jessica Rumbelow
SolidGoldMagikarp (plus, prompt generation)
So8res
Deep Deceptiveness
Charbel-Raphaël
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Charbel-Raphaël
Against Almost Every Theory of Impact of Interpretability
Joe Carlsmith
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Eliezer Yudkowsky
GPTs are Predictors, not Imitators
peterbarnett
Labs should be explicit about why they are building AGI
HoldenKarnofsky
Discussion with Nate Soares on a key alignment difficulty
Jesse Hoogland
Neural networks generalize because of this one weird trick
paulfchristiano
My views on “doom”
technicalities
Shallow review of live agendas in alignment & safety
Vanessa Kosoy
The Learning-Theoretic Agenda: Status 2023
ryan_greenblatt
Improving the Welfare of AIs: A Nearcasted Proposal
#2

AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.

15Adam Shimi
In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one: For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway. With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review. (Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind) Summary Let’s start the review proper with a post by post summary (except for the conclusion): * (Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer? The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective. * (Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can m
#10

Examining the concept of optimization, Abram Demski distinguishes between "selection" (like search algorithms that evaluate many options) and "control" (like thermostats or guided missiles). He explores how this distinction relates to ideas of agency and mesa-optimization, and considers various ways to define the difference. 

19Adam Shimi
Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control. Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constrained by going back to Yudkowski’s optimization power). The big distinction is between doing internal search (like in optimization algorithms or mesa-optimizers) and acting as optimizing something. It is intuitive that you can do the second without the first, but before Alex Flint’s definition, I couldn’t put words on my intuition than the first implies the second. So my current picture of optimization is Internal Optimization (Internal Search/Selection) \subset External Optimization (Optimizing systems). This means that I think of this post as one of the first instances of grappling at this distinction, without agreeing completely with the way it ends up making that distinction.
8johnswentworth
In a field like alignment or embedded agency, it's useful to keep a list of one or two dozen ideas which seem like they should fit neatly into a full theory, although it's not yet clear how. When working on a theoretical framework, you regularly revisit each of those ideas, and think about how it fits in. Every once in a while, a piece will click, and another large chunk of the puzzle will come together. Selection vs control is one of those ideas. It seems like it should fit neatly into a full theory, but it's not yet clear what that will look like. I revisit the idea pretty regularly (maybe once every 3-4 months) to see how it fits with my current thinking. It has not yet had its time, but I expect it will (that's why it's on the list, after all). Bearing in mind that the puzzle piece has not yet properly clicked, here are some current thoughts on how it might connect to other pieces: * Selection and control have different type signatures. * A selection process optimizes for the values of variables in some model, which may or may not correspond anything in the real world. Human values seem to be like this - see Human Values Are A Function Of Humans' Latent Variables. * A control process, on the other hand, directly optimizes things in its environment. A thermostat, for instance, does not necessarily contain any model of the temperature a few minutes in the future; it just directly optimizes the value of the temperature a few minutes in the future. * The post basically says it, but it's worth emphasizing: reinforcement learning is a control process, expected utility maximization is a selection process. The difference in type signatures between RL and EU maximization is the same as the difference in type signatures between selection and control. * Inner and outer optimizers can have different type signatures: an outer controller (e.g. RL) can learn an inner selector (e.g. utility maximizer), or an outer selector (e.g. a human) can build an inner controller (e
#17

Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.

But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.

26johnswentworth
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post. I see two (related) central problems, from which various other symptoms follow: 1. POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence. 2. Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence. Some things I've thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them. Why Unstructured MDPs Are A Bad Model For Instrumental Convergence The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem: * it's hard to talk about "resources", which seem fairly central to instrumental convergence * it's hard to talk about multiple agents competing for the same resources * it's hard to talk about which parts of the world an agent controls/doesn't control * it's hard to talk about which parts of the world agents do/don't care about * ... indeed, it's hard to talk about the world having "parts" at all * it's hard to talk about agents not competing, since there's only one monolithic world-state to control * any action which changes the world at all changes the entire world-state; there's no built-in w
6Alex Turner
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility. Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).  However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes. ‘Instrumentally convergent’ replaced by ‘robustly instrumental’ Like many good things, this terminological shift was prompted by a critique from Andrew Critch.  Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper. This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence: (Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”) While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”  It would be more appropriate to say that an ac
#20

The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail. 

10Zvi
This post is even-handed and well-reasoned, and explains the issues involved well. The strategy-stealing assumption seems important, as a lot of predictions are inherently relying on it either being essentially true, or effectively false, and I think the assumption will often effectively be a crux in those disagreements, for reasons the post illustrates well. The weird thing is that Paul ends the post saying he thinks the assumption is mostly true, whereas I thought the post was persuasive that the assumption is mostly false. The post illustrates that the unaligned force is likely to have many strategic and tactical advantages over aligned forces, that should allow the unaligned force to, at a minimum, 'punch above its weight' in various ways even under close-to-ideal conditions. And after the events of 2020, and my resulting updates to my model of humans, I'm highly skeptical that we'll get close to ideal. Either way, I'm happy to include this.
#21

Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?

#22

Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.

0Zvi
I've stepped back from thinking about ML and alignment the last few years, so I don't know how this fits into the discourse about it, but I felt like I got important insight here and I'd be excited to include this. The key concept that bigger models can be simpler seems very important.  In my words, I'd say that when you don't have enough knobs, you're forced to find ways for each knob to serve multiple purposes slash combine multiple things, which is messy and complex and can be highly arbitrary, whereas with lots of knobs you can do 'the thing you naturally actually want to do.' And once you get sufficiently powerful, the overfitting danger isn't getting any worse with the extra knobs, so sure, why not? I also strongly agree with orthonormal that including the follow-up as an addendum adds a lot to this post. If it's worth including this, it's worth including both, even if the follow-up wasn't also nominated. 
9orthonormal
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
#37

Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.

13Adam Shimi
This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important. (Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.) What is gradient hacking? Evan defines it as: So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things). Before checking on how exactly this could be possible, we should think a bit more about what this implies. If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that's pretty much the only constraint left. It could also pretty much deals with deception detectors because it can make itself not detectable: To say it pithy: if gradient hacking happens, we’re fucked. How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one. How could a model gradient hack? The first example comes from a quoted footnote of Risks from Learned Optimization: This im
#43

So we're talking about how to make good decisions, or the idea of 'bounded rationality', or what sufficiently advanced Artificial Intelligences might be like; and somebody starts dragging up the concepts of 'expected utility' or 'utility functions'.

And before we even ask what those are, we might first ask, Why?

39johnswentworth
Things To Take Away From The Essay First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It's debatable whether it should be considered a coherence theorem at all. Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who've only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay: * the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model * the second argues that two of the VNM axioms are unrealistic I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different. So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage's theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent's implied probabilities are instead derived. Yudkowsky's essay does a good job communicating these concepts, but doesn't emphasize that this is different from VNM. One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet
#44

A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"

0magfrump
I want to have this post in a physical book so that I can easily reference it. It might actually work better as a standalone pamphlet, though. 
4Adam Shimi
How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful. Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that. Full time researcher (no team or MIRIx chapter) For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced that I’ve been using it intuitively for years, but having an explicit model is definitely a plus when trying to debug a specific situation, or to explain how it works to someone less used to thinking like that. Full time research who wants to build a team/MIRIx chapter Obviously, this profile benefits from the great advice on building a research group. I would expect someone with this profile to understand relatively well the social dynamics part, so the most useful advice is probably the detailed logistics of getting such a group off the ground. I also believe that the escalating asks and rewards is a less obvious social dynamic to take into account. Aspiring researcher (no team or MIRIx chapter) The section You and your research was probably written with this profile in mind. It tries to push towards exploration instead of exploitation, babble instead of prune. And for so many people that I know who feel obligated to understand everything before toying with a question, this is the prescribed medicine. I want to push-back just a little about the “follow your curiosity” vibe, as I believe that there are ways to check how promising the current ideas are for AI Alignment. But I definitely understand that the audience is more “wannabe researchers stifled by their internal editor”, so pushing for curiosity and exploration makes sense. Aspiring researcher who wants to build a team/MIRIx chapter In additio