AI ALIGNMENT FORUMShard Theory
AF

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

56Humans provide an untapped wealth of evidence about alignment
Alex Turner, Quintin Pope
1y
28
41Human values & biases are inaccessible to the genome
Alex Turner
1y
25
26General alignment properties
Alex Turner
1y
1
27Evolution is a bad analogy for AGI: inner alignment
Quintin Pope
1y
0
89Reward is not the optimization target
Alex Turner
1y
82
72The shard theory of human values
Quintin Pope, Alex Turner
1y
31
21Understanding and avoiding value drift
Alex Turner
1y
5
36A shot at the diamond-alignment problem
Alex Turner
1y
45
31Don't design agents which exploit adversarial inputs
Alex Turner, Garrett Baker
10mo
27
25Don't align agents to evaluations of plans
Alex Turner
10mo
28
28Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Alex Turner
10mo
31
42Inner and outer alignment decompose one hard problem into two extremely hard problems
Alex Turner
10mo
11