AI ALIGNMENT FORUM
AF

Wikitags

Shard Theory

Written by David Udell, Noosphere89, et al. last updated 30th Dec 2024

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained agents. It is thus an approach to progressively fleshing out a mechanistic account of , learned values in RL agents, and (to a lesser extent) the learned algorithms in generally.

Shard theory's basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

Subscribe
1
Subscribe
1
human values
Reinforcement Learning (RL)
ML
akrasia
Discussion0
Discussion0
Posts tagged Shard Theory
74The shard theory of human values
Quintin Pope, Alex Turner
3y
33
72Shard Theory in Nine Theses: a Distillation and Critical Appraisal
Lawrence Chan
3y
22
47Contra shard theory, in the context of the diamond maximizer problem
Nate Soares
3y
3
23Understanding and avoiding value drift
Alex Turner
3y
7
94Reward is not the optimization target
Alex Turner
3y
88
140Understanding and controlling a maze-solving policy network
Alex Turner, peligrietzer, Ulisse Mini, Monte MacDiarmid, David Udell
2y
23
45Shard Theory: An Overview
David Udell
3y
2
44Inner and outer alignment decompose one hard problem into two extremely hard problems
Alex Turner
3y
14
36A shot at the diamond-alignment problem
Alex Turner
3y
45
25Shard Theory - is it true for humans?
Rishika Bose
1y
0
62Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud, Jacob G-W, Evžen Wybitul, Joseph Miller, Alex Turner
7mo
3
47Predictions for shard theory mechanistic interpretability results
Alex Turner, Ulisse Mini, peligrietzer
2y
6
42Human values & biases are inaccessible to the genome
Alex Turner
3y
38
41Disentangling Shard Theory into Atomic Claims
Leon Lang
2y
1
27Research agenda: Supervising AIs improving AIs
Quintin Pope, Owen D, Roman Engeler, Jacques Thibodeau
2y
0
Load More (15/35)
Add Posts