Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight

Jacy Reese Anthis

I read several shard theory posts and found the details interesting, but I couldn't quite see the big picture. I'm used to hearing "theory" refer to a falsifiable generalization of data. It is typically stated as a sentence, paragraph, list, mathematical expression, or diagram early in a scientific paper. Theories range from extremely precise like Newton's three laws of motion to extremely broad like Frankish's consciousness illusionism (i.e., consciousness is an illusion).

I have also been generally confused about what it means to "solve alignment" before AGI arrives given that there is not (yet) consensus around any pre-AGI^[1] formalization of the alignment problem itself: Wouldn't any proposed solution still have a significant number of people (say, >10% of Alignment Forum users) who think that it doesn't even pose the problem the right way? What should our "theories" even be aiming for?

With help from Nathan Helm-Burger, I think I now better understand what's referred to as shard theory and want to share my understanding as an exercise in problem and solution formulation in alignment. I think "shard theory" refers to four sequential components: a Shard Hunch that motivates a two-part Shard Question, the first part of which is currently being answered by the gradual development of an actual Shard Theory of human values, which hopefully provides answers to the second part with Shard Insight that can be implemented in AI systems to facilitate alignment.^[2] Namely:

Shard Hunch: A human brain is a general intelligences, and its intentions and behavior are reasonably aligned with its many shards of values (i.e., little bits of "contextual influences on decision-making"^[3]). Maybe something like that alignment can work for AI too!^[4]
Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers' values?
Shard Theory: The brain ensures alignment with its values by doing A, B, C, etc.
Shard Insight: We can ensure the alignment of an AI with its designers' values by doing X, Y, Z, etc. mapped from shard theory.

This is exciting! Now when I read shard theory research, I feel like I properly understand it as gradually filling in A, B, C, X, Y, Z, etc. For example, Assumptions 1, 2, and 3 in "The Shard Theory of Human Values" are examples of A, B, and C, and the two arguments in "Reward is Not the Optimization Target" and "Human Values & Biases are Inaccessible to the Genome" are examples of X and Y. I also think the specific idea of a "shard" is less central to these claims than I thought; it seems the first of those posts could parsimoniously replace "shard" with "value" (in the dictionary sense) with very little meaning lost, and the latter two posts don't even use the word. I wonder if something like "Brain-Inspired Alignment" would be a clearer label, at least until a central concept like shards emerges in the research.

Shard research is also at a very early stage, so it is inevitably less focused on stating and validating the falsifiable, non-trivial claims that could be an actual shard theory (which is usually what we discuss in science) and instead seems to mostly be developing a language for eventually specifying shard theory—much like how Rubin's potential outcomes (POs) and Pearl's directed acyclic graphs (DAGs) were important developments in causality research because they allowed for the clear statement of falsifiable, nontrivial causal theories. Pope and Turner also use the terms "paradigm" and "frame," which I think are more fitting for what they have done so far than shard "theory" per se though less specific than "language."

For example, the post "Reward is not the optimization target" and Paul Christiano's reply seem better read not as claim and counter-claim, but as thinking about the most useful neuroscience-inspired way to define "reward," "optimization," etc. These discussions seem to some hand waving and talking past each other, so I also wonder if more explicitly approaching shard theory as building a language, not as sharing an extant theory, would help us think more clearly. In any case, these meta-questions seem inevitable as the field of AI alignment advances and we come closer to developed theories and solutions—whatever that means.

^{^}
Post-AGI formalizations of alignment, such as thresholds for how much value persists, seem less controversial but also less useful than a pre-AGI formalization would be. And they still seem far from uncontroversial. For example, some make an appeal to moral nature, so to speak, to keep human value as close to its current path as possible while ensuring AI safety, while other see this as false or confused.
^{^}
Pope and Turner say in "The Shard Theory of Human Values" that "“Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment. ... We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”" I take that to mean they do not include anything about AI alignment itself as shard theory. I think this will confuse many people because of how central AI alignment is to the shard theory project.
^{^}
This definition of value (i.e., shard) is unintuitively broad, as Pope and Turner acknowledge. I think precisifying and clarifying that will be an important part of building shard theory.
^{^}
The Shard Hunch is most clearly stated in the first blockquote in Turner's "Looking Back on My Alignment PhD" and in Turner's comment on "Where I Agree and Disagree with Eliezer."

While Quintin and I were careful in selecting the name "shard", I think that calling the present version "shard theory" may have been a mistake, in part for the reasons you note. We aren't at the "precise predictions" phase yet, but I do think present shard theory makes some informal predictions.

For example, I think that agents will competently generalize in multiple ways, depending on the context they find themselves in.

IE for an agent trained via deep RL on mazes where the exit is randomly on the right half, and the agent starts on the left... the trained policy won't just be running search to reach the end of the maze, with a globally activated mesa-objective across possible contexts. Rather, the agent may have a "going right" policy sub-component, which increases logits on actions which go towards the right half of the maze, and they may have a "go to red thing" policy sub-component, if the maze exit was red. And so therefore an agent might generalize to go right in the absence of red items, and go towards red squares if visible, and do both in a competent, contextual way.

Pope and Turner say in "The Shard Theory of Human Values" that "“Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment. ... We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”" I take that to mean they do not include anything about AI alignment itself as shard theory. I think this will confuse many people because of how central AI alignment is to the shard theory project.

Hm, I didn't mean to imply that. Point (2) of that decomposition was:

The shard paradigm/theory/frame of AI alignment analyzes the value formation processes which will occur in deep learning, and tries to figure out their properties.

This definitely includes AI alignment insights as part of shard theory, but not as part of shard theory $_{human values}$ . What I was trying to gesture at is how e.g. reward != optimization target is not necessarily making predictions about modular contextual influences within policy networks trained via e.g. PPO. Instead, Reward!=OT explains a few general insights into the nature of alignment in the deep learning regime.

That seems like a useful decomposition! Point 2 seems to beg the question, why does it assume that the brain can "ensure alignment with its values", as opposed to, say, synthesizes an illusion of values by aggregating data from various shards?

Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned.

In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate's recent post:

Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.

The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror's apprentice Mickey Mouse was not able to do! So that's a basic sense in which brains are aligned, but again I'm not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.

I meant this:

Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers' values?

which does indeed beg the question in the standard meaning of it.

My point is that there is very much no alignment between different values! They are independent at best and contradictory in many cases. There is an illusion of coherent values that is a rationalization. The difference in values sometimes leads to catastrophic Fantasia-like outcomes on the margins (e.g. people with addiction don't want to be on drugs but are), but most of the time it results in a mild akrasia (I am writing this instead of doing something that makes me money). This seems like a good analogy: http://max.mmlc.northwestern.edu/mdenner/Demo/texts/swan_pike_crawfish.htm

Hm. I think you can dissolve the perceived question-begging by replacing "values" with its substance:

How does the genome, in the presence of e.g. modern Western culture, reliably form decision-influences which push the person to e.g. take actions which increase the welfare of their family and friends? (i.e. where do friendship-shards come from?)

We're then asking a relatively well-defined question with a guaranteed-to-exist answer.