This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
AF
Login
Home
Library
Questions
All Posts
About
All Posts
Sorted by Magic (New & Upvoted)
Timeframe:
All Time
Daily
Weekly
Monthly
Yearly
Sorted by:
Magic (New & Upvoted)
Recent Comments
New
Old
Top
Filtered by:
All Posts
Questions
Meta
Show Low Karma
2018
Frontpage Posts
37
2018 AI Alignment Literature Review and Charity Comparison
Larks
2y
4
35
Embedded Agents
Abram Demski
,
Scott Garrabrant
2y
7
11
An Untrollable Mathematician Illustrated
Abram Demski
2y
2
31
Realism about rationality
ricraz
2y
50
22
The Rocket Alignment Problem
Eliezer Yudkowsky
2y
5
11
Challenges to Christiano’s capability amplification proposal
Eliezer Yudkowsky
2y
2
2
Prisoners' Dilemma with Costs to Modeling
Scott Garrabrant
2y
1
8
Robustness to Scale
Scott Garrabrant
2y
6
27
Subsystem Alignment
Abram Demski
,
Scott Garrabrant
2y
3
21
Embedded Agency (full-text version)
Scott Garrabrant
,
Abram Demski
2y
1
23
Towards a New Impact Measure
Alex Turner
2y
119
30
Robust Delegation
Abram Demski
,
Scott Garrabrant
2y
2
24
Decision Theory
Abram Demski
,
Scott Garrabrant
2y
8
11
Toward a New Technical Explanation of Technical Explanation
Abram Demski
2y
2
6
Sources of intuitions and data on AGI
Scott Garrabrant
2y
0
24
Two Neglected Problems in Human-AI Safety
Wei Dai
2y
20
23
Introducing the AI Alignment Forum (FAQ)
Oliver Habryka
,
Ben Pace
,
Raymond Arnold
,
Jim Babcock
2y
0
26
Coherence arguments do not imply goal-directed behavior
Rohin Shah
2y
45
8
Open question: are minimal circuits daemon-free?
Paul Christiano
2y
9
18
Why we need a *theory* of human values
Stuart Armstrong
1y
6
9
Paul's research agenda FAQ
Alex Zhu
2y
31
24
Embedded World-Models
Abram Demski
,
Scott Garrabrant
2y
5
4
Prize for probable problems
Paul Christiano
2y
0
3
Optimization Amplifies
Scott Garrabrant
2y
3
11
When does rationality-as-search have nontrivial implications?
nostalgebraist
2y
4
1
Announcement: AI alignment prize round 3 winners and next round
Vladimir Slepnev
2y
0
3
Alignment Newsletter #13: 07/02/18
Rohin Shah
2y
0
21
Embedded Curiosities
Scott Garrabrant
,
Abram Demski
2y
0
13
Topological Fixed Point Exercises
Scott Garrabrant
,
Sam Eisenstat
2y
40
6
Beyond Astronomical Waste
Wei Dai
2y
9
5
Announcing AlignmentForum.org Beta
Raymond Arnold
2y
17
8
History of the Development of Logical Induction
Scott Garrabrant
2y
2
2
Counterfactual Mugging Poker Game
Scott Garrabrant
2y
0
8
Arguments about fast takeoff
Paul Christiano
2y
1
19
Clarifying "AI Alignment"
Paul Christiano
2y
73
21
Three AI Safety Related Ideas
Wei Dai
2y
30
22
In Logical Time, All Games are Iterated Games
Abram Demski
2y
7
2
A general model of safety-oriented AI development
Wei Dai
2y
5
8
Comment on decision theory
Rob Bensinger
2y
13
2
Understanding Iterated Distillation and Amplification: Claims and Oversight
William Saunders
2y
0
2
Worrying about the Vase: Whitelisting
Alex Turner
2y
2
12
Humans can be assigned any values whatsoever…
Stuart Armstrong
2y
0
20
Preface to the sequence on value learning
Rohin Shah
2y
5
2
Beware of black boxes in AI alignment research
Vladimir Slepnev
2y
0
14
Assuming we've solved X, could we do Y...
Stuart Armstrong
2y
6
13
Bottle Caps Aren't Optimisers
DanielFilan
2y
9
8
Mathematical Mindset
komponisto
2y
0
4
Another take on agent foundations: formalizing zero-shot reasoning
Alex Zhu
2y
0
18
Reasons compute may not drive AI capabilities growth
Kythe
2y
10
14
Bayesian Probability is for things that are Space-like Separated from You
Scott Garrabrant
2y
1
2
Impact Measure Desiderata
Alex Turner
2y
38
19
EDT solves 5 and 10 with conditional oracles
Jessica Taylor
2y
5
9
Multi-agent predictive minds and AI alignment
Jan_Kulveit
2y
0
2
Can corrigibility be learned safely?
Wei Dai
2y
15
12
Future directions for ambitious value learning
Rohin Shah
2y
6
8
Fixed Point Discussion
Scott Garrabrant
2y
0
3
Specification gaming examples in AI
Samuel Rødal
2y
5
12
What is ambitious value learning?
Rohin Shah
2y
12
11
Mechanistic Transparency for Machine Learning
DanielFilan
2y
5
6
Specification gaming examples in AI
Vika
2y
8
9
Corrigibility
Paul Christiano
2y
2
7
Alignment Newsletter #34
Rohin Shah
2y
0
8
Fixed Point Exercises
Scott Garrabrant
2y
0
12
Factored Cognition
Andreas Stuhlmüller
2y
1
13
Reducing collective rationality to individual optimization in common-payoff games using MCMC
Jessica Taylor
2y
10
10
The easy goal inference problem is still hard
Paul Christiano
2y
1
3
Buridan's ass in coordination games
Jessica Taylor
2y
26
10
New safety research agenda: scalable agent alignment via reward modeling
Vika
2y
12
3
UDT can learn anthropic probabilities
Vladimir Slepnev
2y
0
1
Non-Adversarial Goodhart and AI Risks
Davidmanheim
2y
0
10
The Steering Problem
Paul Christiano
2y
2
11
Prosaic AI alignment
Paul Christiano
2y
0
5
Alignment Newsletter #33
Rohin Shah
2y
0
8
Policy Approval
Abram Demski
2y
13
4
Bridging syntax and semantics, empirically
Stuart Armstrong
2y
0
8
(A -> B) -> A
Scott Garrabrant
2y
2
0
Self-regulation of safety in AI research
G Gordon Worley III
2y
0
7
Using expected utility for Good(hart)
Stuart Armstrong
2y
1
3
Petrov corrigibility
Stuart Armstrong
2y
9
9
Agents That Learn From Human Behavior Can't Learn Human Values That Humans Haven't Learned Yet
steven0461
2y
5
7
Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.
Ryan Carey
2y
0
7
Asymptotic Decision Theory (Improved Writeup)
Diffractor
2y
13
3
Counterfactuals, thick and thin
Nisan
2y
4
4
Overcoming Clinginess in Impact Measures
Alex Turner
2y
5
1
Alignment Newsletter #27
Rohin Shah
2y
0
4
Alignment Newsletter #29
Rohin Shah
2y
0
6
Diagonalization Fixed Point Exercises
Scott Garrabrant
,
Sam Eisenstat
2y
13
8
Model Mis-specification and Inverse Reinforcement Learning
Owain Evans
,
Jacob Steinhardt
2y
0
11
Discussion on the machine learning approach to AI safety
Vika
2y
2
1
Alignment Newsletter #16: 07/23/18
Rohin Shah
2y
0
6
Anthropic probabilities and cost functions
Stuart Armstrong
2y
0
2
Knowledge is Freedom
Scott Garrabrant
2y
1
9
Acknowledging Human Preference Types to Support Value Learning
Nandi, Sabrina and Erin
2y
0
0
Idea: Open Access AI Safety Journal
G Gordon Worley III
2y
0
13
New DeepMind AI Safety Research Blog
Vika
2y
0
8
Hyperreal Brouwer
Scott Garrabrant
2y
0
5
Alignment Newsletter #32
Rohin Shah
2y
0
12
Intuitions about goal-directed behavior
Rohin Shah
2y
5
9
Penalizing Impact via Attainable Utility Preservation
Alex Turner
2y
0
7
Alignment Newsletter #37
Rohin Shah
2y
2
Load More
(100/254)