AI ALIGNMENT FORUM
AF

1159
Wikitags

Shard Theory

Edited by David Udell, Noosphere89, et al. last updated 30th Dec 2024

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory's basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Shard Theory
12
74The shard theory of human values
Quintin Pope, TurnTrout
3y
33
5
72Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
3y
22
2
47Contra shard theory, in the context of the diamond maximizer problem
So8res
3y
3
3
23Understanding and avoiding value drift
TurnTrout
3y
7
1
94Reward is not the optimization target
TurnTrout
3y
88
3
140Understanding and controlling a maze-solving policy network
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
3y
23
2
45Shard Theory: An Overview
David Udell
3y
2
2
43Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
3y
14
3
36A shot at the diamond-alignment problem
TurnTrout
3y
45
1
25Shard Theory - is it true for humans?
Rishika
1y
0
2
64Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
1y
4
2
47Predictions for shard theory mechanistic interpretability results
TurnTrout, Ulisse Mini, peligrietzer
3y
6
2
42Human values & biases are inaccessible to the genome
TurnTrout
3y
38
2
41Disentangling Shard Theory into Atomic Claims
Leon Lang
3y
1
1
27Research agenda: Supervising AIs improving AIs
Quintin Pope, Owen D, Roman Engeler, jacquesthibs
3y
0
Load More (15/35)
Add Posts