AI ALIGNMENT FORUM
The Best of LessWrong
AF

54

The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
TsviBT
Please don't throw your mind away
Raemon
Noticing Frame Differences
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Inactive)
Lies, Damn Lies, and Fabricated Options
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Inactive)
Split and Commit
Duncan Sabien (Inactive)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Inactive)
Shoulder Advisors 101
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Inactive)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Duncan Sabien (Inactive)
Basics of Rationalist Discourse
alkjash
Prune
johnswentworth
Gears vs Behavior
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
Duncan Sabien (Inactive)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
Raemon
Feedbackloop-first Rationality
LoganStrohl
Fucking Goddamn Basics of Rationalist Discourse
Raemon
Tuning your Cognitive Strategies
johnswentworth
Lessons On How To Get Things Right On The First Try
+

Optimization

So8res
Focus on the places where you feel shocked everyone's dropping the ball
Jameson Quinn
A voting theory primer for rationalists
sarahconstantin
The Pavlov Strategy
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
GeneSmith
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
HoldenKarnofsky
This Can't Go On
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Scott Alexander
Studies On Slack
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
jenn
Things I Learned by Spending Five Thousand Hours In Non-EA Charities
Richard_Ngo
The ants and the grasshopper
So8res
Enemies vs Malefactors
Elizabeth
Change my mind: Veganism entails trade-offs, and health is one of the axes
+

World

Kaj_Sotala
Book summary: Unlocking the Emotional Brain
Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
johnswentworth
Transportation as a Constraint
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Bird Concept
Unconscious Economics
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
L Rudolf L
A Disneyland Without Children
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
zhukeepa
How uniform is the neocortex?
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
KatjaGrace
Elephant seal 2
Zvi
Book Review: Going Infinite
Rafael Harth
Why it's so hard to talk about Consciousness
Duncan Sabien (Inactive)
Social Dark Matter
Eric Neyman
How much do you believe your results?
Malmesbury
The Talk: a brief explanation of sexual dimorphism
moridinamael
The Parable of the King and the Random Process
Henrik Karlsson
Cultivating a state of mind where new ideas are born
+

Practical

alkjash
Pain is not the unit of Effort
benkuhn
Staring into the abyss as a core life skill
Unreal
Rest Days vs Recovery Days
Duncan Sabien (Inactive)
In My Culture
juliawise
Notes from "Don't Shoot the Dog"
Elizabeth
Luck based medicine: my resentful story of becoming a medical miracle
johnswentworth
How To Write Quickly While Maintaining Epistemic Rigor
Duncan Sabien (Inactive)
Ruling Out Everything Else
johnswentworth
Paper-Reading for Gears
Elizabeth
Butterfly Ideas
Eliezer Yudkowsky
Your Cheerful Price
benkuhn
To listen well, get curious
Wei Dai
Forum participation as a research strategy
HoldenKarnofsky
Useful Vices for Wicked Problems
pjeby
The Curse Of The Counterfactual
Darmani
Leaky Delegation: You are not a Commodity
Adam Zerner
Losing the root for the tree
chanamessinger
The Onion Test for Personal and Institutional Honesty
Raemon
You Get About Five Words
HoldenKarnofsky
Learning By Writing
GeneSmith
How to have Polygenically Screened Children
AnnaSalamon
“PR” is corrosive; “reputation” is not.
Ruby
Do you fear the rock or the hard place?
johnswentworth
Slack Has Positive Externalities For Groups
Raemon
Limerence Messes Up Your Rationality Real Bad, Yo
mingyuan
Cryonics signup guide #1: Overview
catherio
microCOVID.org: A tool to estimate COVID risk from common activities
Valentine
Noticing the Taste of Lotus
orthonormal
The Loudest Alarm Is Probably False
Raemon
"Can you keep this confidential? How do you know?"
mingyuan
Guide to rationalist interior decorating
Screwtape
Loudly Give Up, Don't Quietly Fade
+

AI Strategy

paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
Eliezer Yudkowsky
Pausing AI Developments Isn't Enough. We Need to Shut it All Down
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
KatjaGrace
Counterarguments to the basic AI x-risk case
Adam Scholl
Safetywashing
habryka
AI Timelines
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Orpheus16
Speaking to Congressional staffers about AI risk
Tom Davidson
What a compute-centric framework says about AI takeoff speeds
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
Joe Carlsmith
Predictable updating about AI risk
Raemon
"Carefully Bootstrapped Alignment" is organizationally hard
KatjaGrace
We don’t trade with ants
+

Technical AI Safety

paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Buck
AI Control: Improving Safety Despite Intentional Subversion
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
evhub
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
paulfchristiano
The strategy-stealing assumption
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
Alex Flint
The ground of optimization
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
LawrenceC
Natural Abstractions: Key Claims, Theorems, and Critiques
Zack_M_Davis
Alignment Implications of LLM Successes: a Debate in One Act
johnswentworth
Natural Latents: The Math
TurnTrout
Steering GPT-2-XL by adding an activation vector
Jessica Rumbelow
SolidGoldMagikarp (plus, prompt generation)
So8res
Deep Deceptiveness
Charbel-Raphaël
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Charbel-Raphaël
Against Almost Every Theory of Impact of Interpretability
Joe Carlsmith
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Eliezer Yudkowsky
GPTs are Predictors, not Imitators
peterbarnett
Labs should be explicit about why they are building AGI
HoldenKarnofsky
Discussion with Nate Soares on a key alignment difficulty
Jesse Hoogland
Neural networks generalize because of this one weird trick
paulfchristiano
My views on “doom”
technicalities
Shallow review of live agendas in alignment & safety
Vanessa Kosoy
The Learning-Theoretic Agenda: Status 2023
ryan_greenblatt
Improving the Welfare of AIs: A Nearcasted Proposal
201820192020202120222023All
RationalityWorldOptimizationAI StrategyTechnical AI SafetyPracticalAll
#1
Embedded Agents

How does it work to optimize for realistic goals in physical environments of which you yourself are a part? E.g. humans and robots in the real world, and not humans and AIs playing video games in virtual worlds where the player not part of the environment.  The authors claim we don't actually have a good theoretical understanding of this and explore four specific ways that we don't understand this process.

by abramdemski
#1
AGI Ruin: A List of Lethalities

A few dozen reason that Eliezer thinks AGI alignment is an extremely difficult problem, which humanity is not on track to solve.

by Eliezer Yudkowsky
#1
AI Control: Improving Safety Despite Intentional Subversion

As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

#2
The Rocket Alignment Problem

This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute. 

MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."

by Eliezer Yudkowsky
#2
Risks from Learned Optimization: Introduction

AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.

by evhub
#2
An overview of 11 proposals for building safe advanced AI

A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches. 

by evhub
#3
Where I agree and disagree with Eliezer

Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.

by paulfchristiano
#4
ARC's first technical report: Eliciting Latent Knowledge

ARC explores the challenge of extracting information from AI systems that isn't directly observable in their outputs, i.e "eliciting latent knowledge." They present a hypothetical AI-controlled security system to demonstrate how relying solely on visible outcomes can lead to deceptive or harmful results. The authors argue that developing methods to reveal an AI's full understanding of a situation is crucial for ensuring the safety and reliability of advanced AI systems.

by paulfchristiano
#5
Alignment By Default

What if we don't need to solve AI alignment? What if AI systems will just naturally learn human values as they get more capable? John Wentworth explores this possibility, giving it about a 10% chance of working. The key idea is that human values may be a "natural abstraction" that powerful AI systems learn by default.

by johnswentworth
#5
Reward is not the optimization target

TurnTrout discusses a common misconception in reinforcement learning: that reward is the optimization target of trained agents. He argues reward is better understood as a mechanism for shaping cognition, not a goal to be optimized, and that this has major implications for AI alignment approaches. 

by TurnTrout
#6
The Solomonoff Prior is Malign

The Solomonoff prior is a mathematical formalization of Occam's razor. It's intended to provide a way to assign probabilities to observations based on their simplicity. However, the simplest programs that predict observations well might be universes containing intelligent agents trying to influence the predictions. This makes the Solomonoff prior "malign" - its predictions are influenced by the preferences of simulated beings. 

by Mark Xu
#9
The ground of optimization

An optimizing system is a physically closed system containing both that which is being optimized and that which is doing the optimizing, and defined by a tendency to evolve from a broad basin of attraction towards a small set of target configurations despite perturbations to the system. 

by Alex Flint
#10
Selection vs Control

Examining the concept of optimization, Abram Demski distinguishes between "selection" (like search algorithms that evaluate many options) and "control" (like thermostats or guided missiles). He explores how this distinction relates to ideas of agency and mesa-optimization, and considers various ways to define the difference. 

by abramdemski
#10
Ngo and Yudkowsky on alignment difficulty

Nate Soares moderates a long conversation between Richard Ngo and Eliezer Yudkowsky on AI alignment. The two discuss topics like "consequentialism" as a necessary part of strong intelligence, the difficulty of alignment, and potential pivotal acts to address existential risk from advanced AI. 

by Eliezer Yudkowsky
#13
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Human values are functions of latent variables in our minds. But those variables may not correspond to anything in the real world. How can an AI optimize for our values if it doesn't know what our mental variables are "pointing to" in reality? This is the Pointers Problem - a key conceptual barrier to AI alignment. 

by johnswentworth
#13
Inner and outer alignment decompose one hard problem into two extremely hard problems

Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

by TurnTrout
#13
Natural Abstractions: Key Claims, Theorems, and Critiques

Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.

by LawrenceC
#14
Coherence arguments do not entail goal-directed behavior

Rohin Shah argues that many common arguments for AI risk (about the perils of powerful expected utility maximizers) are actually arguments about goal-directed behavior or explicit reward maximization, which are not actually implied by coherence arguments. An AI system could be an expected utility maximizer without being goal-directed or an explicit reward maximizer.

by Rohin Shah
#14
On how various plans miss the hard bits of the alignment challenge

Nate Soares reviews a dozen plans and proposals for making AI go well. He finds that almost none of them grapple with what he considers the core problem - capabilities will suddenly generalize way past training, but alignment won't.

by So8res
#14
Alignment Implications of LLM Successes: a Debate in One Act

Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.

by Zack_M_Davis
#15
Inaccessible information

AI researcher Paul Christiano discusses the problem of "inaccessible information" - information that AI systems might know but that we can't easily access or verify. He argues this could be a key obstacle in AI alignment, as AIs may be able to use inaccessible knowledge to pursue goals that conflict with human interests.

by paulfchristiano
#15
Simulators

This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.

by janus
#15
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.

by evhub
#16
Robustness to Scale

You want your proposal for an AI to be robust to changes in its level of capabilities. It should be robust to the AI's capabilities scaling up, and also scaling down, and also the subcomponents of the AI scaling relative to each other. 

We might need to build AGIs that aren't robust to scale, but if so we should at least realize that we are doing that.

by Scott Garrabrant
#16
Natural Latents: The Math

John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.

by johnswentworth
#17
Seeking Power is Often Convergently Instrumental in MDPs

Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.

But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.

by TurnTrout
#17
Steering GPT-2-XL by adding an activation vector

Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior. 

by TurnTrout
#18
Inner Alignment: Explain like I'm 12 Edition

Inner alignment refers to the problem of aligning a machine learning model's internal goals (mesa-objective) with the intended goals we are optimizing for externally (base objective). Even if we specify the right base objective, the model may develop its own misaligned mesa-objective through the training process. This poses challenges for AI safety. 

by Rafael Harth
#18
SolidGoldMagikarp (plus, prompt generation)

Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.

by Jessica Rumbelow
#20
Paul's research agenda FAQ

Alex Zhu spent quite awhile understanding Paul's Iterated Amplication and Distillation agenda. He's written an in-depth FAQ, covering key concepts like amplification, distillation, corrigibility, and how the approach aims to create safe and capable AI assistants.

by zhukeepa
#20
The strategy-stealing assumption

The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail. 

by paulfchristiano
#21
An Untrollable Mathematician Illustrated

A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things. 

by abramdemski
#21
Reframing Impact

Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?

by TurnTrout
#21
Deep Deceptiveness

There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.

by So8res
#22
Understanding “Deep Double Descent”

Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.

by evhub
#23
Specification gaming examples in AI

A collection of examples of AI systems "gaming" their specifications - finding ways to achieve their stated objectives that don't actually solve the intended problem. These illustrate the challenge of properly specifying goals for AI systems.

by Vika
#23
An Orthodox Case Against Utility Functions

Abram argues against assuming that rational agents have utility functions over worlds (which he calls the "reductive utility" view). Instead, he points out that you can have a perfectly valid decision theory where agents just have preferences over events, without having to assume there's some underlying utility function over worlds.

by abramdemski
#23
Optimality is the tiger, and agents are its teeth

People worry about agentic AI, with ulterior motives. Some suggest Oracle AI, which only answers questions. But I don't think about agents like that. It killed you because it was optimised. It used an agent because it was an effective tool it had on hand. 

Optimality is the tiger, and agents are its teeth.

by Veedrac
#23
Davidad's Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes. 

by Charbel-Raphaël
#24
chinchilla's wild implications

The DeepMind paper that introduced Chinchilla revealed that we've been using way too many parameters and not enough data for large language models. There's immense returns to scaling up training data size, but we may be running out of high-quality data to train on. This could be a major bottleneck for future AI progress.

by nostalgebraist
#25
Introduction To The Infra-Bayesianism Sequence

Vanessa and diffractor introduce a new approach to epistemology / decision theory / reinforcement learning theory called Infra-Bayesianism, which aims to solve issues with prior misspecification and non-realizability that plague traditional Bayesianism.

by Diffractor
#25
Selection Theorems: A Program For Understanding Agents

What's the type signature of an agent? John Wentworth proposes Selection Theorems as a way to explore this question. Selection Theorems tell us what agent type signatures will be selected for in broad classes of environments. This post outlines the concept and how to work on it.

by johnswentworth
#25
Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk. 

by Charbel-Raphaël
#26
Worlds Where Iterative Design Fails

In worlds where AI alignment can be handled by iterative design, we probably survive. So if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason. John explores several ways that could happen, beyond just fast takeoff and deceptive misalignment. 

by johnswentworth
#27
My research methodology

Paul Christiano describes his research methodology for AI alignment. He focuses on trying to develop algorithms that can work "in the worst case" - i.e. algorithms for which we can't tell any plausible story about how they could lead to egregious misalignment. He alternates between proposing alignment algorithms and trying to think of ways they could fail.

by paulfchristiano
#27
Decision theory does not imply that we get to have nice things

Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely. 

by So8res
#27
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

by Joe Carlsmith
#29
Some AI research areas and their relevance to existential safety

Andrew Critch lists several research areas that seem important to AI existential safety, and evaluates them for direct helpfulness, educational value, and neglect. Along the way, he argues that the main way he sees present-day technical research helping is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise later.

by Andrew_Critch
#30
Search versus design

How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution?This essay studies search and design as distinct approaches to engineering, arguing that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced. 

by Alex Flint
#31
A Mechanistic Interpretability Analysis of Grokking

Neel Nanda reverse engineers neural networks that have "grokked" modular addition, showing that they operate using Discrete Fourier Transforms and trig identities. He argues grokking is really about phase changes in model capabilities, and that such phase changes may be ubiquitous in larger models.

by Neel Nanda
#34
Open question: are minimal circuits daemon-free?

Can the smallest boolean circuit that solves a problem be a "daemon" (a consequentialist system with its own goals)? Paul Christiano suspects not, but isn't sure. He thinks this question, while not necessarily directly important, may yield useful insights for AI alignment.

by paulfchristiano
#34
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits. 

by LawrenceC
#35
Language models seem to be much better than humans at next-token prediction

How good are modern language models compared to humans at the task language models are trained on (next token prediction on internet text)? We found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. 

by Buck
#35
GPTs are Predictors, not Imitators

GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict. 

GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.

by Eliezer Yudkowsky
#37
Gradient hacking

Gradient hacking is when a deceptively aligned AI deliberately acts to influence how the training process updates it. For example, it might try to become more brittle in ways that prevent its objective from being changed. This poses challenges for AI safety, as the AI might try to remove evidence of its deception during training.

by evhub
#38
Discussion with Nate Soares on a key alignment difficulty

Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.

by HoldenKarnofsky
#39
Challenges to Christiano’s capability amplification proposal

Eliezer Yudkowsky offers detailed critiques of Paul Christiano's AI alignment proposal, arguing that it faces major technical challenges and may not work without already having an aligned superintelligence. Christiano acknowledges the difficulties but believes they are solvable.

by Eliezer Yudkowsky
#39
EfficientZero: How It Works

The RL algorithm "EfficientZero" achieves better-than-human performance on Atari games after only 2 hours of gameplay experience. This seems like a major advance in sample efficiency for reinforcement learning. The post breaks down how EfficientZero works and what its success might mean.

by 1a3orn
#39
Models Don't "Get Reward"

Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment. 

by Sam Ringer
#40
How To Go From Interpretability To Alignment: Just Retarget The Search

Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done. 

by johnswentworth
#41
Neural networks generalize because of this one weird trick
Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory. [50WordSummary]
by Jesse Hoogland
#42
Why Agent Foundations? An Overly Abstract Explanation

What's with all the strange pseudophilosophical questions from AI alignment researchers, like "what does it mean for some chunk of the world to do optimization?" or "how does an agent model a world bigger than itself?". John lays out why some people think solving these sorts of questions is a necessary prerequisite for AI alignment.

by johnswentworth
#43
A central AI alignment problem: capabilities generalization, and the sharp left turn

Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.

by So8res
#44
Alignment Research Field Guide

A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"

by abramdemski
#44
Humans provide an untapped wealth of evidence about alignment

Alignment researchers often propose clever-sounding solutions without citing much evidence that their solution should help. Such arguments can mislead people into working on dead ends. Instead, Turntrout argues we should focus more on studying how human intelligence implements alignment properties, as it is a real "existence proof" of aligned intelligence. 

by TurnTrout
#45
Shallow review of live agendas in alignment & safety

A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs. 

by technicalities
#49
The shard theory of human values

How do humans form their values? Shard theory proposes that human values are formed through a relatively straightforward reinforcement process, rather than being hard-coded by evolution. This post lays out the core ideas behind shard theory and explores how it can explain various aspects of human behavior and decision-making. 

by Quintin Pope
#50
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.

by Collin
4orthonormal
Insofar as the AI Alignment Forum is part of the Best-of-2018 Review, this post deserves to be included. It's the friendliest explanation to MIRI's research agenda (as of 2018) that currently exists.
3Ben Pace
+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.
16Daniel Kokotajlo
This post is the best overview of the field so far that I know of. I appreciate how it frames things in terms of outer/inner alignment and training/performance competitiveness--it's very useful to have a framework with which to evaluate proposals and this is a pretty good framework I think. Since it was written, this post has been my go-to reference both for getting other people up to speed on what the current AI alignment strategies look like (even though this post isn't exhaustive). Also, I've referred back to it myself several times. I learned a lot from it. I hope that this post grows into something more extensive and official -- maybe an Official Curated List of Alignment Proposals, Summarized and Evaluated with Commentary and Links. Such a list could be regularly updated and would be very valuable for several reasons, some of which I mentioned in this comment.
29johnswentworth
I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.
31Buck
I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.  My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.
7Orpheus16
ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community. Things about ELK that I benefited from Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking. Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult.  Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who we
5Vaniver
I've written a bunch elsewhere about object-level thoughts on ELK. For this review, I want to focus instead on meta-level points. I think ELK was very well-made; I think it did a great job of explaining itself with lots of surface area, explaining a way to think about solutions (the builder-breaker cycle), bridging the gap between toy demonstrations and philosophical problems, and focusing lots of attention on the same thing at the same time. In terms of impact on the growth and development on the AI safety community, I think this is one of the most important posts from 2021 (even tho the prize and much of the related work happened in 2022). I don't really need to ask for follow-on work; there's already tons, as you can see from the ELK tag. I think it is maybe underappreciated by the broad audience how much this is an old problem, and appreciate the appendix that gives credit to earlier thinking, while thinking this doesn't erode any of the credit Paul, Mark, and Ajeya should get for the excellent packaging. [To the best of my knowledge, ELK is still an open problem, and one of the things that I appreciated about the significant focus on ELK specifically was helping give people better models of how quickly progress happens in this space, and what it looks like (or doesn't look like).]
8Steven Byrnes
I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI. The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an idiosyncrasy of the learning algorithm. For example: Both human brains and ConvNets seem to have a “tree” abstraction; neither human brains nor ConvNets seem to have a “head or thumb but not any other body part” concept. I kind of agree with this. I would say that the patterns are a joint property of the world and an inductive bias. I think the relevant inductive biases in this case are something like: (1) “patterns tend to recur”, (2) “patterns tend to be localized in space and time”, and (3) “patterns are frequently composed of multiple other patterns, which are near to each other in space and/or time”, and maybe other things. The human brain definitely is wired up to find patterns with those properties, and ConvNets to a lesser extent. These inductive biases are evidently very useful, and I find it very likely that future learning algorithms will share those biases, even more than today’s learning algorithms. So I’m basically on board with the idea that there may be plenty of overlap between the world-models of various different unsupervised world-model-building learning algorithms, one of which is the brain. (I would also add that I would expect “natural abstractions” to be a matter of degree, not binary. We can, after all, form the concept “head or thumb but not any other body part”. It would just be extremely low on the list of things that would pop into our head when trying to make sense of something we’
6Vanessa Kosoy
I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community. 1. ^ I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)
12habryka
I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of.  One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not what you might think when you observe the enormous amount of talent, funding and prestige flooding into the space, and the relatively constant refrain of "now that we have cutting edge systems to play around with we are making progress at an unprecedented rate".  It is quite plausible to me that technical AI Alignment research is not a particularly valuable thing to be doing right now. I don't think I have seen much progress, and the dynamics of the field seem to be enshrining an expert class that seems almost ontologically committed to believing that the things they are working on must be good and tractable, because their salary and social standing relies on believing that.  This and a few other similar posts last year are the kind of post that helped me come to understand the considerations around this crucial question better, and where I am grateful that Nate, despite having spent a lot of his life on solving the technical AI Alignment problem, is willing to question the tractability of the whole field. This specific post is more oriented around other people's work, though other posts by Nate and Eliezer are also facing the degree to which their past work didn't make the relevant progress they were hoping for. 
6Zack_M_Davis
I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post. That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily need to be able to pass an author's Ideological Turing Test is because, as a critic, I can at least be scrupulous in my reasoning about the actual text that the author actually published, even if the stereotype of the author I have in my head is faulty. If I can't produce the quotes to show that I'm not just arguing against a stereotype in my head, then it's not clear why the audience should care.
19habryka
I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it.  The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):   The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.  To me, the key feature of a "simulator" would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the "physics" of the system) and then applying that transition function over and over again until your desired target time.  I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don't see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose.  When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that qu
6janus
I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing. It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY's Sequences were for me. Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I'd otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked. I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now i
31johnswentworth
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post. I see two (related) central problems, from which various other symptoms follow: 1. POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence. 2. Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence. Some things I've thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them. Why Unstructured MDPs Are A Bad Model For Instrumental Convergence The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem: * it's hard to talk about "resources", which seem fairly central to instrumental convergence * it's hard to talk about multiple agents competing for the same resources * it's hard to talk about which parts of the world an agent controls/doesn't control * it's hard to talk about which parts of the world agents do/don't care about * ... indeed, it's hard to talk about the world having "parts" at all * it's hard to talk about agents not competing, since there's only one monolithic world-state to control * any action which changes the world at all changes the entire world-state; there's no built-in w
6TurnTrout
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility. Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).  However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes. ‘Instrumentally convergent’ replaced by ‘robustly instrumental’ Like many good things, this terminological shift was prompted by a critique from Andrew Critch.  Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper. This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence: (Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”) While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”  It would be more appropriate to say that an ac
17Vanessa Kosoy
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function. I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don't think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don't think that the conclusions Rohin draws or at least implies are the right conclusions. The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents. Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possibl
25Zack_M_Davis
(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)
10Zvi
This post is even-handed and well-reasoned, and explains the issues involved well. The strategy-stealing assumption seems important, as a lot of predictions are inherently relying on it either being essentially true, or effectively false, and I think the assumption will often effectively be a crux in those disagreements, for reasons the post illustrates well. The weird thing is that Paul ends the post saying he thinks the assumption is mostly true, whereas I thought the post was persuasive that the assumption is mostly false. The post illustrates that the unaligned force is likely to have many strategic and tactical advantages over aligned forces, that should allow the unaligned force to, at a minimum, 'punch above its weight' in various ways even under close-to-ideal conditions. And after the events of 2020, and my resulting updates to my model of humans, I'm highly skeptical that we'll get close to ideal. Either way, I'm happy to include this.
0Zvi
I've stepped back from thinking about ML and alignment the last few years, so I don't know how this fits into the discourse about it, but I felt like I got important insight here and I'd be excited to include this. The key concept that bigger models can be simpler seems very important.  In my words, I'd say that when you don't have enough knobs, you're forced to find ways for each knob to serve multiple purposes slash combine multiple things, which is messy and complex and can be highly arbitrary, whereas with lots of knobs you can do 'the thing you naturally actually want to do.' And once you get sufficiently powerful, the overfitting danger isn't getting any worse with the extra knobs, so sure, why not? I also strongly agree with orthonormal that including the follow-up as an addendum adds a lot to this post. If it's worth including this, it's worth including both, even if the follow-up wasn't also nominated. 
9orthonormal
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
12Vika
I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative. Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I've seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful - this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with. This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I've probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven't picked yet. I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the arg
13Vanessa Kosoy
In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility. To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view. The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent's preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to "moral uncertainty". The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time). The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the
7Ben Pace
An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things "from the starting-point of the universe". I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that "starts from the standpoint of the agent". Instead of my beliefs being about the universe, my beliefs are about my experiences and thoughts. I generally nod along to a lot of the 'scientific' discussion in the 21st century about how the universe works and how reasonable the whole thing is. But I don't feel I knew in-advance to expect the world around me to operate on simple mathematical principles and be so reasonable. I could've woken up in the Harry Potter universe of magic wands and spells. I know I didn't, but if I did, I think I would be able to act in it? I wouldn't constantly be falling over myself because I don't understand how 1 + 1 = 2 anymore? There's some place I'm starting from that builds up to an understanding of the universe, and doesn't sneak it in as an 'assumption'. And this is what this new perspective does that Abram lays out in technical detail. (I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.) In conclusion, this piece is a key step from the existing philosophy of agents to the philosophy of embedded agents, or at least it was for me, and it changes my background perspective on rationality. It's the only post in the early vote that I gave +9. (This review is taken from my post Ben Pace's Controversial Picks for the 2020 Review.)
20ryan_greenblatt
At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda[1]. I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I've found that he often has more specific tricks and proposals than what was laid out in this post. I didn't find these tricks moved me very far, but I think they were helpful for understanding what is going on. This post and Davidad's agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn't allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can't be done with this approach. My core disagreements are on the "Scientific Sufficiency Hypothesis" (particularly when considering computational constraints), "Model-Checking Feasibility Hypothesis" (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).[2] As a concrete example, I found it quite implausible that you
4Charbel-Raphaël
Ok, time to review this post and assess the overall status of the project. Review of the post What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand it, and thought, "There's no way this is going to work." Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important. I'm especially happy about the fact that we tried to elicit as much as we could from Davidad's model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal. Current Status of the Agenda. (I'm not the best person to write this, see this as an informal personal opinion) Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it's probably not going to happen in time. Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through sta
7Diffractor
This post is still endorsed, it still feels like a continually fruitful line of research. A notable aspect of it is that, as time goes on, I keep finding more connections and crisper ways of viewing things which means that for many of the further linked posts about inframeasure theory, I think I could explain them from scratch better than the existing work does. One striking example is that the "Nirvana trick" stated in this intro (to encode nonstandard decision-theory problems), has transitioned from "weird hack that happens to work" to "pops straight out when you make all the math as elegant as possible". Accordingly, I'm working on a "living textbook" (like a textbook, but continually being updated with whatever cool new things we find) where I try to explain everything from scratch in the crispest way possible, to quickly catch up on the frontier of what we're working on. That's my current project. I still do think that this is a large and tractable vein of research to work on, and the conclusion hasn't changed much.
37Charbel-Raphaël
Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have. First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it. I still want everyone working on interpretability to read it and engage with its arguments. Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions. Updates on my views Legend: * On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis * ✅ - yes, I think I was correct (>90%) * ❓✅ - I would lean towards yes (70%-90%) * ❓ - unsure (between 30%-70%) * ❓❌ - I would lean towards no (10%-30%) * ❌ - no, I think I was basically wrong (<10%) * ⭐ important, you can skip the other sections Here's my review section by section: ⭐ The Overall Theory of Impact is Quite Poor? * "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task. * "Interpretability is Not a Good Predictor of Future Systems" →
6Ben Pace
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge. Overview of why: * The motivation behind most of proposals Paul has spent a lot of time (iterated amplification, imitative generalization) on are explained clearly and succinctly. * For a quick summary, this involves  * A proposal for useful ML-systems designed with human feedback * An argument that the human-feedback ML-systems will have flaws that kill you * A proposal for using ML assistants to debug the original ML system * An argument that the ML systems will not be able to understand the original human-feedback ML-systems * A proposal for training the human-feedback ML-systems in a way that requires understandability * An argument that this proposal is uncompetitive * ??? (I believe the proposals in the ELK document are the next step here) * A key problem when evaluating very high-IQ, impressive, technical work, is that it is unclear which parts of the work you do not understand because you do not understand an abstract technical concept, and which parts are simply judgment calls based on the originator of the idea. This post shows very clearly which is which — many of the examples and discussions are technical, but the standard for "plausible failure story" and "sufficiently precise algorithm" and "sufficiently doomed" are all judgment calls, as are the proposed solutions. I'm not even sure I get on the bus at step 1, that the right next step is to consider ML
14ryan_greenblatt
IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans. AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1] When it returns to arguing about the actual main question (a tiny fraction of resources) at the end here and eventually gets to the main trade-related argument (acausal or causal) in the very last response in this section, it almost seems to admit that this tiny amount of resources is plausible, but fails to update all the way. I think the discussion here and here seems highly relevant and fleshes out this argument to a substantially greater extent than I did in this comment. However, note that being willing to spend a tiny fraction of resources on humans still might result in AIs killing a huge number of humans due to conflict between it and humans or the AI needing to race through the singularity as quickly as possible due to competition with other misaligned AIs. (Again, discussed in the links above.) I think fully misaligned paperclippers/squiggle maximizer AIs which spend only a tiny fraction of resources on humans (as seems likely conditional on that type of AI) are reasonably likely to cause outcomes which look obviously extremely bad from the perspective of most people (e.g., more than hundreds of millions dead due to conflict and then most people quickly rounded up and given the option to either be frozen or killed). I wish that Soares and Eliezer would stop making these incorrect arguments against tiny fractions of resources being spent on the preference of current humans. It isn't their actual crux, and it isn't the crux of anyone else either. (However rhetorically nice it might be.) -------
6habryka
This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 
19Neel Nanda
Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it. Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-accidentally) did a great job of picking a problem that people cared about for non-interp reasons, where mech interp was unusually easy (as it was a small model, on a clean algorithmic task), and that I was able to find real insights about grokking as a direct result of doing the mechanistic interpretability. Real models are fucking complicated (and even this model has some details we didn't fully understand), but I feel great about the field having something that's genuinely detailed, rigorous and successfully reverse-engineered, and this seems an important proof of concept. IMO the other contributions are the specific algorithm I found, and the specific insights about how and why grokking happens. but in my opinion these are much less interesting. Field-Building Another large amount of impact is that this was a major win for mech interp field-building. This is hard to measure, but some data: * There are multiple papers I like that are substantially building on/informed by these results (A toy model of universality, the clock and the pizza, Feature emergence via margin maximization, Explaining grokking through circuit efficiency * It's got >100 citations in less than a year (a decent chunk of these are semi-fake citations from this being used as a standard citation for 'mech interp exists as a field', so I care more about the "how many papers would not exist without this" metric) * It went pretty viral on Twi
39Buck
(I'm just going to speak for myself here, rather than the other authors, because I don't want to put words in anyone else's mouth. But many of the ideas I describe in this review are due to other people.) I think this work was a solid intellectual contribution. I think that the metric proposed for how much you've explained a behavior is the most reasonable metric by a pretty large margin. The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I'm glad we did. But these negative results haven’t had that much influence on other people’s work AFAICT, so overall it seems somewhat low impact. The empirical results in this paper demonstrated that induction heads are not the simple circuit which many people claimed (see this post for a clearer statement of that), and we then used these techniques to get mediocre results for IOI (described in this comment). There hasn’t been much followup on this work. I suspect that the main reasons people haven't built on this are: * it's moderately annoying to implement it * it makes your explanations look bad (IMO because they actually are unimpressive), so you aren't that incentivized to get it working * the interp research community isn't very focused on validating whether its explanations are faithful, and in any case we didn’t successfully persuade many people that explanations performing poorly according to this metric means they’re importantly unfaithful I think that interpretability research isn't going to be able to produce explanations that are very faithful explanations of what's going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don't seem very important to me now. (I think that people who want to do research that uses model internals should evaluate their techniques by mea
9Buck
This post's point still seems correct, and it still seems important--I refer to it at least once a week.
8Vanessa Kosoy
This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach. Here is how I view this question: The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality. Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3. The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On t
4Alex_Altair
This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)
8Vanessa Kosoy
This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important. One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad! It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.
81a3orn
I remain pretty happy with most of this, looking back -- I think this remains clear, accessible, and about as truthful as possible without getting too technical. I do want to grade my conclusions / predictions, though. (1). I predicted that this work would quickly be exceeded in sample efficiency. This was wrong -- it's been a bit over a year and EfficientZero is still SOTA on Atari. My 3-to-24-month timeframe hasn't run out, but I said that I expected "at least a 25% gain" towards the start of the time, which hasn't happened. (2). There has been a shift to multitask domains, or to multi-benchmark papers. This wasn't too hard of a prediction, but I think it was correct. (Although of course good evidence for such a shift would require comprehensive lit review.) To sample two -- DreamerV3 is a very recently released model-based DeepMind algorithm. It does very well at Atari100k -- it gets a better mean score then everything but EfficientZero -- but it also does well at DMLab + 4 other benchmarks + even crafting a Minecraft diamond. The paper emphasizes the robustness of the algorithm, and is right to do so -- once you get human-level sample efficiency on Atari100k, you really want to make sure you aren't just overfitting to that! And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment. (3). And finally -- well, the last conclusion, that there is still a lot of space for big gains in performance in RL even without field-overturning new insights, is inevitably subjective. But I think the evidence still supports it.
39LawrenceC
This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.  Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.  TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.[1] Introduction/Overview The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.  The CCS paper kicked off a lot of waves in the alig
15adamShimi
In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one: For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway. With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review. (Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind) Summary Let’s start the review proper with a post by post summary (except for the conclusion): * (Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer? The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective. * (Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can m
13adamShimi
This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important. (Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.) What is gradient hacking? Evan defines it as: So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things). Before checking on how exactly this could be possible, we should think a bit more about what this implies. If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that's pretty much the only constraint left. It could also pretty much deals with deception detectors because it can make itself not detectable: To say it pithy: if gradient hacking happens, we’re fucked. How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one. How could a model gradient hack? The first example comes from a quoted footnote of Risks from Learned Optimization: This im
15Isnasene
[Disclaimer: I'm reading this post for the first time now, as of 1/11/2020. I also already have a broad understanding of the importance of AI safety. While I am skeptical about MIRI's approach to things, I am also a fan of MIRI. Where this puts me relative to the target demographic of this post, I cannot say.] Overall Summary I think this post is pretty good. It's a solid and well-written introduction to some of the intuitions behind AI alignment and the fundamental research that MIRI does. At the same time, the use of analogy made the post more difficult for me to parse and hid some important considerations about AI alignment from view. Though it may be good (but not optimal) for introducing some people to the problem of AI alignment and a subset of MIRI's work, it did not raise or lower my opinion of MIRI as someone who already understood AGI safety to be important. To be clear, I do not consider any of these weaknesses serious because I believe them to be partially irrelevant to the audience of people who don't appreciate the importance of AI-Safety. Still, they are relevant to the audience of people who give AI-Safety the appropriate scrutiny but remain skeptical of MIRI. And I think this latter audience is important enough to assign this article a "pretty good" instead of a "great". I hope a future post directly explores the merit of MIRI's work on the context AI alignment without use of analogy. Below is an overview of my likes and dislikes in this post. I will go into more detail about them in the next section, "Evaluating Analogies." Things I liked: * It's a solid introduction to AI-alignment, covering a broad range of topics including: * Why we shouldn't expect aligned AGI by default * How modern conversation about AGI behavior is problematically underspecified * Why fundamental deconfusion research is necessary for solving AI-alignment * It directly explains the value/motivation of particular pieces of MIRI work via analogy -- which
7Jan_Kulveit
This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent. I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'. Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.
10Jan_Kulveit
In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community. The upsides - majority of the claims is true or at least approximately true - "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,... - shard theory coined a number of locally memetically fit names or phrases, such as 'shards' - part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem  The downsides - almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models - the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions) - the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research) Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 
21DanielFilan
I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'. I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term. In my mind, it provokes various follow-up questions: 1. What arguments would imply 'goal-directed' behaviour? 2. With what probability will a random utility maximiser be 'goal-directed'? 3. How often should I think of a system as a utility maximiser in resources, perhaps with a slowly-changing utility function? 4. How 'goal-directed' are humans likely to make systems, given that we are making them in order to accomplish certain tasks that don't look like random utility functions? 5. Is there some kind of 'basin of goal-directedness' that systems fall in if they're even a little goal-directed, causing them to behave poorly? Off the top of my head, I'm not familiar with compelling responses from the 'freak out about goal-directedness' camp on points 1 through 5, even though as a member of that camp I think that such responses exist. Responses from outside this camp include Rohin's post 'Will humans build goal-directed agents?'. Another response is Brangus' comment post, although I find its theory of goal-directedness uncompelling. I think that it's notable that Brangus' post was released soon after this was announced as a contender for Best of LW 2018. I think that if this post were added to the Best of LW 2018 Collection, the 'freak out' camp might produce more of these responses and move the dialogue forward. As such, I think it should be added, both because of the clear argumentation and because of the response it is likely to provoke.
4Seth Herd
This post skillfully addressed IMO the most urgent issue in alignment:; bridging the gap between doomers and optimists. If half of alignment thinkers think alignment is very difficult, while half think it's pretty achievable, decision-makers will be prone to just choose whichever expert opinion supports what they want to do anyway. This and its following acts are the best work I know of in refining the key cruxes. And they do so in a compact, readable, and even fun form.
8Fabien Roger
This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).
15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
4adamShimi
How do you review a post that was not written for you? I’m already doing research in AI Alignment, and I don’t plan on creating a group of collaborators for the moment. Still, I found some parts of this useful. Maybe that’s how you do it: by taking different profiles, and running through the most useful advice for each profile from the post. Let’s do that. Full time researcher (no team or MIRIx chapter) For this profile (which is mine, by the way), the most useful piece of advice from this post comes from the model of transmitters and receivers. I’m convinced that I’ve been using it intuitively for years, but having an explicit model is definitely a plus when trying to debug a specific situation, or to explain how it works to someone less used to thinking like that. Full time research who wants to build a team/MIRIx chapter Obviously, this profile benefits from the great advice on building a research group. I would expect someone with this profile to understand relatively well the social dynamics part, so the most useful advice is probably the detailed logistics of getting such a group off the ground. I also believe that the escalating asks and rewards is a less obvious social dynamic to take into account. Aspiring researcher (no team or MIRIx chapter) The section You and your research was probably written with this profile in mind. It tries to push towards exploration instead of exploitation, babble instead of prune. And for so many people that I know who feel obligated to understand everything before toying with a question, this is the prescribed medicine. I want to push-back just a little about the “follow your curiosity” vibe, as I believe that there are ways to check how promising the current ideas are for AI Alignment. But I definitely understand that the audience is more “wannabe researchers stifled by their internal editor”, so pushing for curiosity and exploration makes sense. Aspiring researcher who wants to build a team/MIRIx chapter In additio
9Olli Järviniemi
I view this post as providing value in three (related) ways: 1. Making a pedagogical advancement regarding the so-called inner alignment problem 2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong 3. Pushing for thinking mechanistically about cognition-updates   Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused. Some months later I read this post and then it clicked. Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles' exposition skills that's a bit overwhelming. Another part I liked were the phrases "Just because common English endows “reward” with suggestive pleasurable connotations" and "Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater." One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.   Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view. I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It's the former view that this post (correctly) argues against. I am sympathetic to pushback of the form "there are arguments that make it reasonable to privilege reward-maximization as a hypothesis" and about this post going a bit too far, but these remarks should not be confused with a rebuttal
17Jeremy Gillen
This post deserves to be remembered as a LessWrong classic.  1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen). 2. It uses a new diagrammatic method of manipulating sets of independence relations. 3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged. There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other.  * Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology. * High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa. * Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model. * Interpretability is about finding recognisable concepts and algorithms in trained neural networks. In general, we can solve these problems using shared variables and shared sub-structures that are present in both models. * We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors. * If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "expla
6Davidmanheim
This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)
8Daniel Murfet
I like the emphasis in this post on the role of patterns in the world in shaping behaviour, the fact that some of those patterns incentivise misaligned behaviour such as deception, and further that our best efforts at alignment and control are themselves patterns that could have this effect. I also like the idea that our control systems (even if obscured from the agent) can present as "errors" with respect to which the agent is therefore motivated to learn to "error correct". This post and the sharp left turn are among the most important high-level takes on the alignment problem for shaping my own views on where the deep roots of the problem are. Although to be honest I had forgotten about this post, and therefore underestimated its influence on me, until performing this review (which caused me to update a recent article I wrote, the Queen's Dilemma, which is clearly a kind of retelling of one aspect of this story, with an appropriate reference). I assess it to be a substantial influence on me even so. I think this whole line of thought could be substantially developed, and with less reliance on stories, and that this would be useful.
8DragonGod
Epistemic Status I am an aspiring selection theorist and I have thoughts.   ----------------------------------------   Why Selection Theorems? Learning about selection theorems was very exciting. It's one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger.   Selection Theorems as a Compelling Agent Foundations Paradigm There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don't quite describe the AI systems we deal with in practice. I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it's general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research "surface area" and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program. Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what
6DragonGod
Epistemic Status: I don't actually know anything about machine learning or reinforcement learning and I'm just following your reasoning/explanation.   This does not actually follow. Policies return probability distributions over actions ("strategies"), and it's not necessarily the case that the output of the optimal policy in the current state is a pure strategy. Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist). Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy's support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I'm too tired to work out the maths right now. IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty). So I think greedy search for the action that attains the highest value for the optimal policy's action value function is only equivalent to the optimal policy if the environment is: * Deterministic * Fully observable/the agent has perfect information * Agent knows all the "laws of physics"/state transition rules of the environment * Fixed low level state transitions that do not depend on agent (I may be missing some other criteria necessary to completely obviate mixed strategies.)   I think these conditions are actually quite strong!
8Martin Randall
Does this look like a motte-and-bailey to you? 1. Bailey: GPTs are Predictors, not Imitators (nor Simulators). 2. Motte: The training task for GPTs is a prediction task. The title and the concluding sentence both plainly advocate for (1), but it's not really touched by the overall post, and I think it's up for debate (related: reward is not the optimization target). Instead there is an argument for (2). Perhaps the intention of the final sentence was to oppose Simulators? If that's the case, cite it, be explicit. This could be a really easy thing for an editor to fix. ---------------------------------------- Does this look like a motte-and-bailey to you? 1. Bailey: The task that GPTs are being trained on is ... harder than the task of being a human. 2. Motte: Being an actual human is not enough to solve GPT's task. As I read it, (1) is false, the task of being a human doesn't cap out at human intelligence. More intelligent humans are better at minimizing prediction error, achieving goals, inclusive genetic fitness, whatever you might think defines "the task of being a human". In the comments, Yudkowsky retreats to (2), which is true. But then how should I understand this whole paragraph from the post? If we're talking about how natural selection trained my genome, why are we talking about how well humans perform the human task? Evolution is optimizing over generations. My human task is optimizing over my lifetime. Also, if we're just arguing for different thinking, surely it mostly matters whether the training task is different, not whether it is harder? ---------------------------------------- Overall I think "Is GPT-N bounded by human capabilities? No." is a better post on the mottes and avoids staking out unsupported baileys. This entire topic is becoming less relevant because AIs are getting all sorts of synthetic data and RLHF and other training techniques thrown at them. The 2022 question of the capabilities of a hypothetical GPT-N that was only t
4Mikhail Samin
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment. Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about. For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing. Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things togeth
0magfrump
I want to have this post in a physical book so that I can easily reference it. It might actually work better as a standalone pamphlet, though. 
5Gunnar_Zarncke
I like many aspects of this post.  * It promotes using intuitions from humans. Using human, social, or biological approaches is neglected compared to approaches that are more abstract and general. It is also scalable, because people can work on it that wouldn't be able to work directly on the abstract approaches. * It reflects on a specific problem the author had and offers the same approach to readers. * It uses concrete examples to illustrate. * It is short and accessible. 
32habryka
I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall. A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.  Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments.  Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it isn't really obvious that you even want an "AI Alignment field" or lots of "AI Alignment research organizations" or "AI Alignment student groups". Like, because we don't know how to solve this problem, it really isn't clear what the right type of social organization is, and there aren't obviously great gains from trade, and so from a coalition perspective, you don't get a coalition of people who think these arguments are real.  I feel deeply confused about this. Over the last two years, I think I wrongly ended up just kind of investing into an ecosystem of people that somewhat structurally can't really handle these arguments, and makes plans that assume that these arguments are false, and in doing so actually mostly makes the world worse, by having a far too optimistic stance on the differen
19adamShimi
Selection vs Control is a distinction I always point to when discussing optimization. Yet this is not the two takes on optimization I generally use. My favored ones are internal optimization (which is basically search/selection), and external optimization (optimizing systems from Alex Flint’s The ground of optimization). So I do without control, or at least without Abram’s exact definition of control. Why? Simply because the internal structure vs behavior distinction mentioned in this post seems more important than the actual definitions (which seem constrained by going back to Yudkowski’s optimization power). The big distinction is between doing internal search (like in optimization algorithms or mesa-optimizers) and acting as optimizing something. It is intuitive that you can do the second without the first, but before Alex Flint’s definition, I couldn’t put words on my intuition than the first implies the second. So my current picture of optimization is Internal Optimization (Internal Search/Selection) \subset External Optimization (Optimizing systems). This means that I think of this post as one of the first instances of grappling at this distinction, without agreeing completely with the way it ends up making that distinction.
8johnswentworth
In a field like alignment or embedded agency, it's useful to keep a list of one or two dozen ideas which seem like they should fit neatly into a full theory, although it's not yet clear how. When working on a theoretical framework, you regularly revisit each of those ideas, and think about how it fits in. Every once in a while, a piece will click, and another large chunk of the puzzle will come together. Selection vs control is one of those ideas. It seems like it should fit neatly into a full theory, but it's not yet clear what that will look like. I revisit the idea pretty regularly (maybe once every 3-4 months) to see how it fits with my current thinking. It has not yet had its time, but I expect it will (that's why it's on the list, after all). Bearing in mind that the puzzle piece has not yet properly clicked, here are some current thoughts on how it might connect to other pieces: * Selection and control have different type signatures. * A selection process optimizes for the values of variables in some model, which may or may not correspond anything in the real world. Human values seem to be like this - see Human Values Are A Function Of Humans' Latent Variables. * A control process, on the other hand, directly optimizes things in its environment. A thermostat, for instance, does not necessarily contain any model of the temperature a few minutes in the future; it just directly optimizes the value of the temperature a few minutes in the future. * The post basically says it, but it's worth emphasizing: reinforcement learning is a control process, expected utility maximization is a selection process. The difference in type signatures between RL and EU maximization is the same as the difference in type signatures between selection and control. * Inner and outer optimizers can have different type signatures: an outer controller (e.g. RL) can learn an inner selector (e.g. utility maximizer), or an outer selector (e.g. a human) can build an inner controller (e
7TurnTrout
Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there's a way to "see" this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt. I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights. I am less wedded to "think about what shards will form and make sure they don't care about bad stuff (like reward)", because I think we won't get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being "tool AI."
12johnswentworth
This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally. I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them. In Solomonoff Model, Sufficiently Large Data Rules Out Malignness There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that: ... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit. Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.) ... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior? Reflection Breaks The Large-Data Guarantees There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every compu
28Vanessa Kosoy
This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it. Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later). Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an in
7johnswentworth
Why This Post Is Interesting This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise. Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap. Warning: Inductive Gap This post builds on top of two important pieces for modelling embedded agents which don't have their own posts (to my knowledge). The pieces are: * Lazy world models * Lazy utility functions (or value functions more generally) In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand. Lazy World Models One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up. That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing. In the same way, we can represent
9Vanessa Kosoy
This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI. The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers. The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior. (What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.) Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables. Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward functio
17Vanessa Kosoy
In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications. The key paragraph, which summarizes the definition itself, is the following: In fact, "continues to exhibit this tendency with respect to the same target configuration set despite perturbations" is redundant: clearly as long as the perturbation doesn't push the system out of the basin, the tendency must continue. This is what is known as "attractor" in dynamical systems theory. For comparison, here is the definition of "attractor" from the Wikipedia: The author acknowledges this connection, although he also makes the following remark: I find this remark confusing. An attractor that operates along a subset of the dimension is just an attractor submanifold. This is completely standard in dynamical systems theory. Given that the definition itself is not especially novel, the post's main claim to value is via the applications. Unfortunately, some of the proposed applications seem to me poorly justified. Specifically, I want to talk about two major examples: the claimed relationship to embedded agency and the claimed relations to comprehensive AI services. In both cases, the main shortcoming of the definition is that there is an essential property of AI that this definition doesn't capture at all. The author does acknowledge that "goal-directed agent system" is a distinct concept from "optimizing systems". However, he doesn't explain how are they distinct. One way to formulate the difference is as follows: agency = optimization + learning. An agent is not just capable of steering a particular universe towards a certain outc
10PeterMcCluskey
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him. I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it. In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research. I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
15Writer
In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and
7Vanessa Kosoy
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism. To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity). Some thoughts about natural abstractions inspired by this post: * The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view). * A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺…  Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j