AI ALIGNMENT FORUM
AF

636

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

59Humans provide an untapped wealth of evidence about alignment

TurnTrout, Quintin Pope

3y

42

42Human values & biases are inaccessible to the genome

3y

38

26General alignment properties

3y

2

30Evolution is a bad analogy for AGI: inner alignment

3y

1

94Reward is not the optimization target

3y

88

74The shard theory of human values

Quintin Pope, TurnTrout

3y

33

23Understanding and avoiding value drift

3y

7

36A shot at the diamond-alignment problem

3y

45

32Don't design agents which exploit adversarial inputs

TurnTrout, Garrett Baker

3y

33

24Don't align agents to evaluations of plans

3y

32

29Alignment allows "nonrobust" decision-influences and doesn't require robust grading

3y

31

43Inner and outer alignment decompose one hard problem into two extremely hard problems

3y

14