2018
Frontpage Posts
37
2018 AI Alignment Literature Review and Charity Comparison
Larks
2y
4
35
Embedded Agents
Abram Demski
,
Scott Garrabrant
2y
7
11
An Untrollable Mathematician Illustrated
Abram Demski
2y
2
31
Realism about rationality
ricraz
2y
50
22
The Rocket Alignment Problem
Eliezer Yudkowsky
2y
5
11
Challenges to Christiano’s capability amplification proposal
Eliezer Yudkowsky
2y
2
2
Prisoners' Dilemma with Costs to Modeling
Scott Garrabrant
2y
1
8
Robustness to Scale
Scott Garrabrant
2y
6
27
Subsystem Alignment
Abram Demski
,
Scott Garrabrant
2y
3
21
Embedded Agency (full-text version)
Scott Garrabrant
,
Abram Demski
2y
1
23
Towards a New Impact Measure
Alex Turner
2y
119
30
Robust Delegation
Abram Demski
,
Scott Garrabrant
2y
2
24
Decision Theory
Abram Demski
,
Scott Garrabrant
2y
8
11
Toward a New Technical Explanation of Technical Explanation
Abram Demski
2y
2
6
Sources of intuitions and data on AGI
Scott Garrabrant
2y
0
24
Two Neglected Problems in Human-AI Safety
Wei Dai
2y
20
23
Introducing the AI Alignment Forum (FAQ)
Oliver Habryka
,
Ben Pace
,
Raymond Arnold
,
Jim Babcock
2y
0
26
Coherence arguments do not imply goal-directed behavior
Rohin Shah
2y
45
8
Open question: are minimal circuits daemon-free?
Paul Christiano
2y
9
18
Why we need a *theory* of human values
Stuart Armstrong
1y
6
9
Paul's research agenda FAQ
Alex Zhu
2y
31
24
Embedded World-Models
Abram Demski
,
Scott Garrabrant
2y
5
4
Prize for probable problems
Paul Christiano
2y
0
3
Optimization Amplifies
Scott Garrabrant
2y
3
11
When does rationality-as-search have nontrivial implications?
nostalgebraist
2y
4
1
Announcement: AI alignment prize round 3 winners and next round
Vladimir Slepnev
2y
0
3
Alignment Newsletter #13: 07/02/18
Rohin Shah
2y
0
21
Embedded Curiosities
Scott Garrabrant
,
Abram Demski
2y
0
13
Topological Fixed Point Exercises
Scott Garrabrant
,
Sam Eisenstat
2y
40
6
Beyond Astronomical Waste
Wei Dai
2y
9
5
Announcing AlignmentForum.org Beta
Raymond Arnold
2y
17
8
History of the Development of Logical Induction
Scott Garrabrant
2y
2
2
Counterfactual Mugging Poker Game
Scott Garrabrant
2y
0
8
Arguments about fast takeoff
Paul Christiano
2y
1
19
Clarifying "AI Alignment"
Paul Christiano
2y
73
21
Three AI Safety Related Ideas
Wei Dai
2y
30
22
In Logical Time, All Games are Iterated Games
Abram Demski
2y
7
2
A general model of safety-oriented AI development
Wei Dai
2y
5
8
Comment on decision theory
Rob Bensinger
2y
13
2
Understanding Iterated Distillation and Amplification: Claims and Oversight
William Saunders
2y
0
2
Worrying about the Vase: Whitelisting
Alex Turner
2y
2
12
Humans can be assigned any values whatsoever…
Stuart Armstrong
2y
0
20
Preface to the sequence on value learning
Rohin Shah
2y
5
2
Beware of black boxes in AI alignment research
Vladimir Slepnev
2y
0
14
Assuming we've solved X, could we do Y...
Stuart Armstrong
2y
6
13
Bottle Caps Aren't Optimisers
DanielFilan
2y
9
8
Mathematical Mindset
komponisto
2y
0
4
Another take on agent foundations: formalizing zero-shot reasoning
Alex Zhu
2y
0
18
Reasons compute may not drive AI capabilities growth
Kythe
2y
10
14
Bayesian Probability is for things that are Space-like Separated from You
Scott Garrabrant
2y
1
2
Impact Measure Desiderata
Alex Turner
2y
38
19
EDT solves 5 and 10 with conditional oracles
Jessica Taylor
2y
5
9
Multi-agent predictive minds and AI alignment
Jan_Kulveit
2y
0
2
Can corrigibility be learned safely?
Wei Dai
2y
15
12
Future directions for ambitious value learning
Rohin Shah
2y
6
8
Fixed Point Discussion
Scott Garrabrant
2y
0
3
Specification gaming examples in AI
Samuel Rødal
2y
5
12
What is ambitious value learning?
Rohin Shah
2y
12
11
Mechanistic Transparency for Machine Learning
DanielFilan
2y
5
6
Specification gaming examples in AI
Vika
2y
8
9
Corrigibility
Paul Christiano
2y
2
7
Alignment Newsletter #34
Rohin Shah
2y
0
8
Fixed Point Exercises
Scott Garrabrant
2y
0
12
Factored Cognition
Andreas Stuhlmüller
2y
1
13
Reducing collective rationality to individual optimization in common-payoff games using MCMC
Jessica Taylor
2y
10
10
The easy goal inference problem is still hard
Paul Christiano
2y
1
3
Buridan's ass in coordination games
Jessica Taylor
2y
26
10
New safety research agenda: scalable agent alignment via reward modeling
Vika
2y
12
3
UDT can learn anthropic probabilities
Vladimir Slepnev
2y
0
1
Non-Adversarial Goodhart and AI Risks
Davidmanheim
2y
0
10
The Steering Problem
Paul Christiano
2y
2
11
Prosaic AI alignment
Paul Christiano
2y
0
5
Alignment Newsletter #33
Rohin Shah
2y
0
8
Policy Approval
Abram Demski
2y
13
4
Bridging syntax and semantics, empirically
Stuart Armstrong
2y
0
8
(A -> B) -> A
Scott Garrabrant
2y
2
0
Self-regulation of safety in AI research
G Gordon Worley III
2y
0
7
Using expected utility for Good(hart)
Stuart Armstrong
2y
1
3
Petrov corrigibility
Stuart Armstrong
2y
9
9
Agents That Learn From Human Behavior Can't Learn Human Values That Humans Haven't Learned Yet
steven0461
2y
5
7
Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.
Ryan Carey
2y
0
7
Asymptotic Decision Theory (Improved Writeup)
Diffractor
2y
13
3
Counterfactuals, thick and thin
Nisan
2y
4
4
Overcoming Clinginess in Impact Measures
Alex Turner
2y
5
1
Alignment Newsletter #27
Rohin Shah
2y
0
4
Alignment Newsletter #29
Rohin Shah
2y
0
6
Diagonalization Fixed Point Exercises
Scott Garrabrant
,
Sam Eisenstat
2y
13
8
Model Mis-specification and Inverse Reinforcement Learning
Owain Evans
,
Jacob Steinhardt
2y
0
11
Discussion on the machine learning approach to AI safety
Vika
2y
2
1
Alignment Newsletter #16: 07/23/18
Rohin Shah
2y
0
6
Anthropic probabilities and cost functions
Stuart Armstrong
2y
0
2
Knowledge is Freedom
Scott Garrabrant
2y
1
9
Acknowledging Human Preference Types to Support Value Learning
Nandi, Sabrina and Erin
2y
0
0
Idea: Open Access AI Safety Journal
G Gordon Worley III
2y
0
13
New DeepMind AI Safety Research Blog
Vika
2y
0
8
Hyperreal Brouwer
Scott Garrabrant
2y
0
5
Alignment Newsletter #32
Rohin Shah
2y
0
12
Intuitions about goal-directed behavior
Rohin Shah
2y
5
9
Penalizing Impact via Attainable Utility Preservation
Alex Turner
2y
0
7
Alignment Newsletter #37
Rohin Shah
2y
2
