AI ALIGNMENT FORUM
Shard Theory
AF

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

59Humans provide an untapped wealth of evidence about alignment
Alex Turner, Quintin Pope
3y
42
42Human values & biases are inaccessible to the genome
Alex Turner
3y
38
26General alignment properties
Alex Turner
3y
2
29Evolution is a bad analogy for AGI: inner alignment
Quintin Pope
3y
1
94Reward is not the optimization target
Alex Turner
3y
88
74The shard theory of human values
Quintin Pope, Alex Turner
3y
33
23Understanding and avoiding value drift
Alex Turner
3y
7
36A shot at the diamond-alignment problem
Alex Turner
3y
45
32Don't design agents which exploit adversarial inputs
Alex Turner, Garrett Baker
3y
33
24Don't align agents to evaluations of plans
Alex Turner
2y
32
29Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Alex Turner
2y
31
47Inner and outer alignment decompose one hard problem into two extremely hard problems
Alex Turner
2y
14