AI ALIGNMENT FORUM
AF

Wikitags

Shard Theory

Edited by David Udell, Noosphere89, et al. last updated 30th Dec 2024

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory's basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Shard Theory
74The shard theory of human values
Quintin Pope, TurnTrout
3y
33
72Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
3y
22
47Contra shard theory, in the context of the diamond maximizer problem
So8res
3y
3
23Understanding and avoiding value drift
TurnTrout
3y
7
94Reward is not the optimization target
TurnTrout
3y
88
140Understanding and controlling a maze-solving policy network
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
3y
23
45Shard Theory: An Overview
David Udell
3y
2
43Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
3y
14
36A shot at the diamond-alignment problem
TurnTrout
3y
45
25Shard Theory - is it true for humans?
Rishika
1y
0
64Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout
9mo
4
47Predictions for shard theory mechanistic interpretability results
TurnTrout, Ulisse Mini, peligrietzer
3y
6
42Human values & biases are inaccessible to the genome
TurnTrout
3y
38
41Disentangling Shard Theory into Atomic Claims
Leon Lang
3y
1
27Research agenda: Supervising AIs improving AIs
Quintin Pope, Owen D, Roman Engeler, jacquesthibs
2y
0
Load More (15/35)
Add Posts