AI ALIGNMENT FORUMShard Theory
AF

Shard Theory

Jul 14, 2022 by Quintin Pope

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

53Humans provide an untapped wealth of evidence about alignment
Alex Turner, Quintin Pope
8mo
28
41Human values & biases are inaccessible to the genome
Alex Turner
9mo
25
25General alignment properties
Alex Turner
8mo
1
26Evolution is a bad analogy for AGI: inner alignment
Quintin Pope
8mo
0
84Reward is not the optimization target
Alex Turner
8mo
77
69The shard theory of human values
Quintin Pope, Alex Turner
6mo
30
21Understanding and avoiding value drift
Alex Turner
7mo
5
33A shot at the diamond-alignment problem
Alex Turner
6mo
35
31Don't design agents which exploit adversarial inputs
Alex Turner, Garrett Baker
5mo
27
24Don't align agents to evaluations of plans
Alex Turner
4mo
28
28Alignment allows "nonrobust" decision-influences and doesn't require robust grading
Alex Turner
4mo
31
44Inner and outer alignment decompose one hard problem into two extremely hard problems
Alex Turner
4mo
11