This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Home
Library
Questions
All Posts
About
All Posts
Sorted by Magic (New & Upvoted)
Timeframe:
All time
Daily
Weekly
Monthly
Yearly
Sorted by:
Magic (New & Upvoted)
Top
Top (Inflation Adjusted)
Recent Comments
New
Old
Filtered by:
All Posts
Questions
Show Low Karma
Show Events
25
AXRP Episode 22 - Shard Theory with Quintin Pope
DanielFilan
10h
0
18
Instrumental Convergence? [Draft]
J. Dmitri Gallow
1d
0
56
ARC is hiring theoretical researchers
Paul Christiano
,
Jacob Hilton
,
Mark Xu
3d
3
57
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
likenneth
5d
0
23
MetaAI: less is less for alignment.
Cleo Nardo
3d
1
27
TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI
Andrew Critch
3d
0
136
Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures
Dan H
17d
11
24
Introduction to Towards Causal Foundations of Safe AGI
Tom Everitt
,
Lewis Hammond
,
Francis Rhys Ward
,
Ryan Carey
,
James Fox
,
Matt MacDermott
,
Sebastian Benthall
3d
0
69
What will GPT-2030 look like?
Jacob Steinhardt
8d
0
16
Contingency: A Conceptual Tool from Evolutionary Biology for Alignment
clem_acs
3d
0
47
Algorithmic Improvement Is Probably Faster Than Scaling Now
johnswentworth
10d
0
49
Takeaways from the Mechanistic Interpretability Challenges
Stephen Casper
7d
1
82
Announcing Apollo Research
Marius Hobbhahn
,
Beren Millidge
,
Lee Sharkey
,
Lucius Bushnaq
,
Dan Braun
,
Mikita Balesni
,
Jérémy Scheurer
17d
4
12
Explicitness
Tsvi Benson-Tilsen
4d
0
34
A Playbook for AI Risk Reduction (focused on misaligned AI)
HoldenKarnofsky
9d
2
109
Steering GPT-2-XL by adding an activation vector
Alex Turner
,
Monte MacDiarmid
,
David Udell
,
lisathiergart
,
Ulisse Mini
1mo
54
21
How biosafety could inform AI standards
Olivia Jimenez
7d
3
9
formalizing the QACI alignment formal-goal
Tamsin Leake
,
JuliaHP
6d
0
5
an Evangelion dialogue explaining the QACI alignment plan
Tamsin Leake
6d
0
29
A comparison of causal scrubbing, causal abstractions, and related methods
Erik Jenner
,
Adrià Garriga-Alonso
,
Egor Zverev
7d
0
43
Think carefully before calling RL policies "agents"
Alex Turner
14d
7
46
Sentience matters
Nate Soares
17d
0
40
Cosmopolitan values don't come free
Nate Soares
16d
0
52
Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
Spencer Becker-Kahn
15d
0
21
An Exercise to Build Intuitions on AGI Risk
Lauro Langosco
8d
2
54
Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI
Maris Sala
25d
0
48
Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)
Scott Emmons
15d
0
23
Wikipedia as an introduction to the alignment problem
SoerenMind
17d
0
29
Uncertainty about the future does not imply that AGI will go well
Lauro Langosco
14d
0
34
Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies
Rubi J. Hudson
,
Johannes Treutlein
20d
11
61
The Waluigi Effect (mega-post)
Cleo Nardo
3mo
24
15
Wildfire of strategicness
Tsvi Benson-Tilsen
11d
8
99
GPTs are Predictors, not Imitators
Eliezer Yudkowsky
2mo
8
18
How to Think About Activation Patching
Neel Nanda
12d
3
51
When is Goodhart catastrophic?
Drake Thomas
,
Thomas Kwa
23d
4
19
PaLM-2 & GPT-4 in "Extrapolating GPT-N performance"
Lukas Finnveden
16d
6
146
SolidGoldMagikarp (plus, prompt generation)
Jessica Rumbelow
,
mwatkins
4mo
16
13
Shutdown-Seeking AI
Simon Goldstein
15d
4
20
TinyStories: Small Language Models That Still Speak Coherent English
Ulisse Mini
18d
1
66
Prizes for matrix completion problems
Paul Christiano
1mo
0
31
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
Stefan Heimersheim
,
Marius Hobbhahn
22d
0
15
Unfaithful Explanations in Chain-of-Thought Prompting
miles
13d
0
77
My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"
Quintin Pope
3mo
63
52
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1
Stefan Heimersheim
,
Marius Hobbhahn
1mo
0
37
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
Steve Byrnes
1mo
10
29
'Fundamental' vs 'applied' mechanistic interpretability research
Lee Sharkey
23d
2
30
Some background for reasoning about dual-use alignment research
Charlie Steiner
1mo
1
45
Clarifying and predicting AGI
Richard Ngo
1mo
13
41
AGI safety career advice
Richard Ngo
1mo
12
11
Boomerang - protocol to dissolve some commitment races
Filip Sondej
17d
0