AI ALIGNMENT FORUM
AF

1051
Shard Theory

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

59Humans provide an untapped wealth of evidence about alignment
TurnTrout, Quintin Pope
3y
42
42Human values & biases are inaccessible to the genome
TurnTrout
3y
38
26General alignment properties
TurnTrout
3y
2
30Evolution is a bad analogy for AGI: inner alignment
Quintin Pope
3y
1
94Reward is not the optimization target
TurnTrout
3y
88
74The shard theory of human values
Quintin Pope, TurnTrout
3y
33
23Understanding and avoiding value drift
TurnTrout
3y
7
36A shot at the diamond-alignment problem
TurnTrout
3y
45
32Don't design agents which exploit adversarial inputs
TurnTrout, Garrett Baker
3y
33
24Don't align agents to evaluations of plans
TurnTrout
3y
32
29Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
3y
31
43Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
3y
14