Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.
Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda.
Target case: worst-case
definitely not worst-case, more like pessimistic-case
Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K
Add "José Pedro Faustino"
Estimated # FTEs: 1-10
I'd call it 2, averaged throughout 2024.
Some outputs in 2024: mostly exposition but it’s early days
from aisafety.world
The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.
The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).
“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world.
This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.
The method section is important but we put it down the bottom anyway.
Here’s a spreadsheet version also.
Editorial
Agendas with public outputs
1. Understand existing models
Evals
(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)
Various capability and safety evaluations
Various red-teams
Eliciting model anomalies
Interpretability
(Figuring out what a trained model is actually computing.[3])
Good-enough mech interp
Sparse Autoencoders
Simplex: computational mechanics for interp
Pr(Ai)2R: Causal Abstractions
Concept-based interp
Leap
EleutherAI interp
Understand learning
(Figuring out how the model figured it out.)
Timaeus: Developmental interpretability
Saxe lab
See also
2. Control the thing
(figure out how to predictably detect and quash misbehaviour)
Iterative alignment
Control evaluations
Guaranteed safe AI
Assistance games / reward learning
Social-instinct AGI
Prevent deception and scheming
(through methods besides mechanistic interpretability)
Mechanistic anomaly detection
Cadenza
Faithful CoT through separation and paraphrasing
Indirect deception monitoring
See also “retarget the search”.
Surgical model edits
(interventions on model internals)
Activation engineering
See also unlearning.
Goal robustness
(Figuring out how to keep the model doing what it has been doing so far.)
Mild optimisation
3. Safety by design
(Figuring out how to avoid using deep learning)
Conjecture: Cognitive Software
See also parts of Guaranteed Safe AI involving world models and program synthesis.
4. Make AI solve it
(Figuring out how models might help with figuring it out.)
Scalable oversight
(Figuring out how to get AI to help humans supervise models.)
OpenAI SuperalignmentAutomated Alignment ResearchWeak-to-strong generalization
Supervising AIs improving AIs
Cyborgism
Transluce
Task decomp
Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thought / tree of thought.
Adversarial
Deepmind Scalable Alignment
Anthropic: Bowman/Perez
Latent adversarial training
See also FAR (below). See also obfuscated activations.
5. Theory
(Figuring out what we need to figure out, and then doing that.)
The Learning-Theoretic Agenda
Question-answer counterfactual intervals (QACI)
Understanding agency
(Figuring out ‘what even is an agent’ and how it might be linked to causality.)
Causal Incentives
Hierarchical agency
(Descendents of) shard theory
Dovetail research
boundaries / membranes
Understanding optimisation
Corrigibility
(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight are ~atheoretical approaches to this.)
Behavior alignment theory
Ontology Identification
(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)
Natural abstractions
ARC Theory: Formalizing heuristic arguments
Understand cooperation
(Figuring out how inter-AI and AI/human game theory should or would work.)
Pluralistic alignment / collective intelligence
Center on Long-Term Risk (CLR)
FOCAL
Alternatives to utility theory in alignment
See also: Chris Leong's Wisdom Explosion
6. Miscellaneous
(those hard to classify, or those making lots of bets rather than following one agenda)
AE Studio
Anthropic Alignment Capabilities / Alignment Science / Assurance / Trust & Safety / RSP Evaluations
Apart Research
Apollo
Cavendish Labs
Center for AI Safety (CAIS)
CHAI
Deepmind Alignment Team
Elicit (ex-Ought)
FAR
Krueger Lab Mila
MIRI
NSF SLES
OpenAI
SuperalignmentSafety SystemsOpenAI Alignment Science
OpenAI Safety and Security Committee
OpenAI AGI Readiness Mission Alignment
Palisade Research
Tegmark Group / IAIFI
UK AI Safety Institute
US AI Safety Institute
Agendas without public outputs this year
Graveyard (known to be inactive)
Method
We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.
We started with last year’s list and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.
An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.
All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:
We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.
We dropped the operational criteria this year because we made our point last year and it’s clutter.
Lastly, we asked some reviewers to comment on the draft.
Other reviews and taxonomies
Acknowledgments
Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.
Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’
Also, actually, Christiano’s original post is about the alignment of prosaic AGI, not the prosaic alignment of AGI.
This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.
Often less strict than formal verification but "directionally approaching it": probabilistic checking.
Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”
Satron adds that this is lacking in concrete criticism and that more expansion on object-level problems would be useful.
Scheming can be a problem before this point, obviously. Could just be too expensive to catch AIs who aren't smart enough to fool human experts.
Indirectly; Kosoy is funded for her work on Guaranteed Safe AI.
Called “average-case” in Ngo’s post.
Our addition.