The shard theory of human values

Quintin Pope; TurnTrout

TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.

We think that human value formation is extremely important for AI alignment. We have empirically observed exactly one process which reliably produces agents which intrinsically care about certain objects in the real world, which reflect upon their values and change them over time, and which—at least some of the time, with non-negligible probability—care about each other. That process occurs millions of times each day, despite genetic variation, cultural differences, and disparity in life experiences. That process produced you and your values.

Human values look so strange and inexplicable. How could those values be the product of anything except hack after evolutionary hack? We think this is not what happened. This post describes the shard theory account of human value formation, split into three sections:

Details our working assumptions about the learning dynamics within the brain,
Conjectures that reinforcement learning grows situational heuristics of increasing complexity, and
Uses shard theory to explain several confusing / “irrational” quirks of human decision-making.

Terminological note: We use “value” to mean a contextual influence on decision-making. Examples:

Wanting to hang out with a friend.
Feeling an internal urge to give money to a homeless person.
Feeling an internal urge to text someone you have a crush on.
That tug you feel when you are hungry and pass by a donut.

To us, this definition seems importantly type-correct and appropriate—see Appendix A.2. The main downside is that the definition is relatively broad—most people wouldn’t list “donuts” among their “values.” To avoid this counterintuitiveness, we would refer to a “donut shard” instead of a “donut value.” (“Shard” and associated terminology are defined in section II.)

I. Neuroscientific assumptions

The shard theory of human values makes three main assumptions. We think each assumption is pretty mainstream and reasonable. (For pointers to relevant literature supporting these assumptions, see Appendix A.3.)

Assumption 1: The cortex^[1] is basically (locally) randomly initialized. According to this assumption, most of the circuits in the brain are learned from scratch, in the sense of being mostly randomly initialized and not mostly genetically hard-coded. While the high-level topology of the brain may be genetically determined, we think that the local connectivity is not primarily genetically determined. For more clarification, see [Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain.

Thus, we infer that human values & biases are inaccessible to the genome:

It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…

[This leaves us with] a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.

Assumption 2: The brain does self-supervised learning. According to this assumption, the brain is constantly predicting what it will next experience and think, from whether a V1 neuron will detect an edge, to whether you’re about to recognize your friend Bill (which grounds out as predicting the activations of higher-level cortical representations). (See On Intelligence for a book-long treatment of this assumption.)

In other words, the brain engages in self-supervised predictive learning: Predict what happens next, then see what actually happened, and update to do better next time.

Definition. Consider the context available to a circuit within the brain. Any given circuit is innervated by axons from different parts of the brain. These axons transmit information to the circuit. Therefore, whether a circuit fires is not primarily dependent on the external situation navigated by the human, or even what the person senses at a given point in time. A circuit fires depending on whether its inputs^[2]—the mental context—triggers it or not. This is what the "context" of a shard refers to.

Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain). In some^[3] fashion, the brain reinforces thoughts and mental subroutines which have led to reward, so that they will be more likely to fire in similar contexts in the future. We suspect that the “base” reinforcement learning algorithm is relatively crude, but that people reliably bootstrap up to smarter credit assignment.

Summary. Under our assumptions, most of the human brain is locally randomly initialized. The brain has two main learning objectives: self-supervised predictive loss (we view this as building your world model; see Appendix A.1) and reward (we view this as building your values, as we are about to explore).

II. Reinforcement events shape human value shards

This section lays out a bunch of highly specific mechanistic speculation about how a simple value might form in a baby’s brain. For brevity, we won’t hedge statements like “the baby is reinforced for X.” We think the story is good and useful, but don’t mean to communicate absolute confidence via our unhedged language.

Given the inaccessibility of world model concepts, how does the genetically hard-coded reward system dispense reward in the appropriate mental situations? For example, suppose you send a drunk text, and later feel embarrassed, and this triggers a penalty. How is that penalty calculated? By information inaccessibility and the absence of text messages in the ancestral environment, the genome isn’t directly hard-coding a circuit which detects that you sent an embarrassing text and then penalizes you. Nonetheless, such embarrassment seems to trigger (negative) reinforcement events... and we don’t really understand how that works yet.

Instead, let’s model what happens if the genome hardcodes a sugar-detecting reward circuit. For the sake of this section, suppose that the genome specifies a reward circuit which takes as input the state of the taste buds and the person’s metabolic needs, and produces a reward if the taste buds indicate the presence of sugar while the person is hungry. By assumption 3 in section I, the brain does reinforcement learning and credit assignment to reinforce circuits and computations which led to reward. For example, if a baby picks up a pouch of apple juice and sips some, that leads to sugar-reward. The reward makes the baby more likely to pick up apple juice in similar situations in the future.

Therefore, a baby may learn to sip apple juice which is already within easy reach. However, without a world model (much less a planning process), the baby cannot learn multi-step plans to grab and sip juice. If the baby doesn’t have a world model, then she won’t be able to act differently in situations where there is or is not juice behind her. Therefore, the baby develops a set of shallow situational heuristics which involve sensory preconditions like “IF juice pouch detected in center of visual field, THEN move arm towards pouch.” The baby is basically a trained reflex agent.

However, when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.^[4]

By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics exchange information with the budding world model.

A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events. For example, the juice-shard consists of the various decision-making influences which steer the baby towards the historical reinforcer of a juice pouch. These contextual influences were all reinforced into existence by the activation of sugar reward circuitry upon drinking juice. A subshard is a contextually activated component of a shard. For example, “IF juice pouch in front of me THEN grab” is a subshard of the juice-shard. It seems plain to us that learned value shards are^[5] most strongly activated in the situations in which they were historically reinforced and strengthened. (For more on terminology, see Appendix A.2.)

While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm,^[6] because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).

The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away.^[7] Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.

Importantly, however, the juice-shard is shaped to bid for plans which the world model predicts actually lead to juice being consumed, and not necessarily for plans which lead to sugar-reward-circuit activation. You might wonder: “Why wouldn’t the shard learn to value reward circuit activation?”. The effect of drinking juice is that the baby's credit assignment reinforces the computations which were causally responsible for producing the situation in which the hardcoded sugar-reward circuitry fired.

But what is reinforced? The content of the responsible computations includes a sequence of heuristics and decisions, one of which involved the juice pouch abstraction in the world model. Those are the circuits which actually get reinforced and become more likely to fire in the future. Therefore, the juice-heuristics get reinforced. The heuristics coalesce into a so-called shard of value as they query the world model and planner to implement increasingly complex multi-step plans.

In contrast, in this situation, the baby's decision-making does not involve “if this action is predicted to lead to sugar-reward, then bid for the action.” This non-participating heuristic probably won’t be reinforced or created, much less become a shard of value.^[8]

This is important. We see how the reward system shapes our values, without our values entirely binding to the activation of the reward system itself. We have also laid bare the manner in which the juice-shard is bound to your model of reality instead of simply your model of future perception. Looking back across the causal history of the juice-shard’s training, the shard has no particular reason to bid for the plan “stick a wire in my brain to electrically stimulate the sugar reward-circuit”, even if the world model correctly predicts the consequences of such a plan. In fact, a good world model predicts that the person will drink fewer juice pouches after becoming a wireheader, and so the juice-shard in a reflective juice-liking adult bids against the wireheading plan! Humans are not reward-maximizers, they are value shard-executors.

This, we claim, is one reason why people (usually) don’t want to wirehead and why people often want to avoid value drift. According to the sophisticated reflective capabilities of your world model, if you popped a pill which made you 10% more okay with murder, your world model predicts futures which are bid against by your current shards because they contain too much murder.

We’re pretty confident that the reward circuitry is not a complicated hard-coded morass of alignment magic which forces the human to care about real-world juice. No, the hypothetical sugar-reward circuitry is simple. We conjecture that the order in which the brain learns abstractions makes it convergent to care about certain objects in the real world.

III. Explaining human behavior using shard theory

The juice-shard formation story is simple and—if we did our job as authors—easy to understand. However, juice-consumption is hardly a prototypical human value. In this section, we’ll show how shard theory neatly explains a range of human behaviors and preferences.

As people, we have lots of intuitions about human behavior. However, intuitively obvious behaviors still have to have mechanistic explanations—such behaviors still have to be retrodicted by a correct theory of human value formation. While reading the following examples, try looking at human behavior with fresh eyes, as if you were seeing humans for the first time and wondering what kinds of learning processes would produce agents which behave in the ways described.

Altruism is contextual

Consider Peter Singer’s drowning child thought experiment:

Imagine you come across a small child who has fallen into a pond and is in danger of drowning. You know that you can easily and safely rescue him, but you are wearing an expensive pair of shoes that will be ruined if you do.

Probably,^[9] most people would save the child, even at the cost of the shoes. However, few of those people donate an equivalent amount of money to save a child far away from them. Why do we care more about nearby visible strangers as opposed to distant strangers?

We think that the answer is simple. First consider the relevant context. The person sees a drowning child. What shards activate? Consider the historical reinforcement events relevant to this context. Many of these events involved helping children and making them happy. These events mostly occurred face-to-face.

For example, perhaps there is a hardcoded reward circuit which is activated by a crude subcortical smile-detector and a hardcoded attentional bias towards objects with relatively large eyes. Then reinforcement events around making children happy would cause people to care about children. For example, an adult’s credit assignment might correctly credit decisions like “smiling at the child” and “helping them find their parents at a fair” as responsible for making the child smile. “Making the child happy” and “looking out for the child’s safety” are two reliable correlates of smiles, and so people probably reliably grow child-subshards around these correlates.

This child-shard most strongly activates in contexts similar to the historical reinforcement events. In particular, “knowing the child exists” will activate the child-shard less strongly than “knowing the child exists and also seeing them in front of you.” “Knowing there are some people hurting somewhere” activates altruism-relevant shards even more weakly still. So it’s no grand mystery that most people care more when they can see the person in need.

Shard theory retrodicts that altruism tends to be biased towards nearby people (and also the ingroup), without positing complex, information-inaccessibility-violating adaptations like the following:

We evolved in small groups in which people helped their neighbors and were suspicious of outsiders, who were often hostile. Today we still have these “Us versus Them” biases, even when outsiders pose no threat to us and could beneﬁt enormously from our help. Our biological history may predispose us to ignore the suﬀering of faraway people, but we don’t have to act that way. — Comparing the Effect of Rational and Emotional Appeals on Donation Behavior

Similarly, you may be familiar with scope insensitivity: that the function from (# of children at risk) → (willingness to pay to protect the children) is not linear, but perhaps logarithmic. Is it that people “can’t multiply”? Probably not.

Under the shard theory view, it’s not that brains can’t multiply, it’s that for most people, the altruism-shard is most strongly invoked in face-to-face, one-on-one interactions, because those are the situations which have been most strongly touched by altruism-related reinforcement events. Whatever the altruism-shard’s influence on decision-making, it doesn’t steer decision-making so as to produce a linear willingness-to-pay relationship.

Friendship strength seems contextual

Personally, I (TurnTrout) am more inclined to make plans with my friends when I’m already hanging out with them—when we are already physically near each other. But why?

Historically, when I’ve hung out with a friend, that was fun and rewarding and reinforced my decision to hang out with that friend, and to continue spending time with them when we were already hanging out. As above, one possible way this could^[10] happen is via a genetically hardcoded smile-activated reward circuit.

Since shards more strongly influence decisions in their historical reinforcement situations, the shards reinforced by interacting with my friend have the greatest control over my future plans when I’m actually hanging out with my friend.

Milgram is also contextual

The Milgram experiment(s) on obedience to authority figures was a series of social psychology experiments conducted by Yale University psychologist Stanley Milgram. They measured the willingness of study participants, men in the age range of 20 to 50 from a diverse range of occupations with varying levels of education, to obey an authority figure who instructed them to perform acts conflicting with their personal conscience. Participants were led to believe that they were assisting an unrelated experiment, in which they had to administer electric shocks to a "learner". These fake electric shocks gradually increased to levels that would have been fatal had they been real. — Wikipedia

We think that people convergently learn obedience- and cooperation-shards which more strongly influence decisions in the presence of an authority figure, perhaps because of historical obedience-reinforcement events in the presence of teachers / parents. These shards strongly activate in this situation.

We don’t pretend to have sufficient mastery of shard theory to a priori quantitatively predict Milgram’s obedience rate. However, shard theory explains why people obey so strongly in this experimental setup, but not in most everyday situations: The presence of an authority figure and of an official-seeming experimental protocol. This may seem obvious, but remember that human behavior requires a mechanistic explanation. “Common sense” doesn’t cut it. “Cooperation- and obedience-shards more strongly activate in this situation because this situation is similar to historical reinforcement contexts” is a nontrivial retrodiction.

Indeed, varying the contextual features dramatically affected the percentage of people who administered “lethal” shocks:

The shard theory of human values - LessWrong

Sunflowers and timidity

Consider the following claim: “People reliably become more timid when surrounded by tall sunflowers. They become easier to sell products to and ask favors from.”

Let’s see if we can explain this with shard theory. Consider the mental context. The person knows there’s a sunflower near them. What historical reinforcement events pertain to this context? Well, the person probably has pleasant associations with sunflowers, perhaps spawned by aesthetic reinforcement events which reinforced thoughts like “go to the field where sunflowers grow” and “look at the sunflower.”

Therefore, the sunflower-timidity-shard was grown from… Hm. It wasn’t grown. The claim isn’t true, and this shard doesn’t exist, because it’s not downstream of past reinforcement.

Thus: Shard theory does not explain everything, because shards are grown from previous reinforcement events and previous thoughts. Shard theory constrains anticipation around actual observed human nature.

Optional exercise: Why might it feel wrong to not look both ways before crossing the street, even if you have reliable information that the coast is clear?

Optional exercise: Suppose that it's more emotionally difficult to kill a person face-to-face than from far away and out of sight. Explain via shard theory.^[11]

We think that many biases are convergently produced artifacts of the human learning process & environment

We think that simple reward circuitry leads to different cognition activating in different circumstances. Different circumstances can activate cognition that implements different values, and this can lead to inconsistent or biased behavior. We conjecture that many biases are convergent artifacts of the human training process and internal shard dynamics. People aren’t just randomly/hardcoded to be more or less “rational” in different situations.

Projection bias

Humans have a tendency to mispredict their future marginal utilities by assuming that they will remain at present levels. This leads to inconsistency as marginal utilities (for example, tastes) change over time in a way that the individual did not expect. For example, when individuals are asked to choose between a piece of fruit and an unhealthy snack (such as a candy bar) for a future meal, the choice is strongly affected by their "current" level of hunger. — Dynamic inconsistency - Wikipedia

We believe that this is not a misprediction of how tastes will change in the future. Many adults know perfectly well that they will later crave the candy bar. However, a satiated adult has a greater probability of choosing fruit for their later self, because their deliberative shards are more strongly activated than their craving-related shards. The current level of hunger strongly controls which food-related shards are activated.

Sunk cost fallacy

Why are we hesitant to shift away from the course of action that we’re currently pursuing? There are two shard theory-related factors that we think contribute to sunk cost fallacy:

The currently active shards are those that bid for the current course of action. Those shards probably bid for the current course. They also have more influence, since they’re currently very active. Thus, the currently active shard coalition supports the current course of action more strongly, when compared to your “typical” shard coalitions. This can cause the you-that-is-pursuing-the-course-of-action to continue, even after your “otherwise” self would have stopped.
Shards activate more strongly in concrete situations. Actually seeing a bear will activate self-preservation shards more strongly than simply imagining a bear. Thus, the concrete benefits of the current course of action will more easily activate shards than the abstract benefits of an imagined course of action. This can lead to overestimating the value of continuing the current activity relative to the value of other options.

Time inconsistency

A person might deliberately avoid passing through the sweets aisle in a supermarket in order to avoid temptation. This is a very strange thing to do, and it makes no sense from the perspective of an agent maximizing expected utility over quantities like "sweet food consumed" and "leisure time" and "health." Such an EU-maximizing agent would decide to buy sweets or not, but wouldn’t worry about entering the aisle itself. Avoiding temptation makes perfect sense under shard theory.

Shards are contextually activated, and the sweet-shard is most strongly activated when you can actually see sweets. We think that planning-capable shards are manipulating future contexts so as to prevent the full activation of your sweet shard.

Similarly,

Which do you prefer, to be given 500 dollars today or 505 dollars tomorrow?
Which do you prefer, to be given 500 dollars 365 days from now or 505 dollars 366 days from now?

In such situations, people tend to choose $500 in (A) but $505 in (B), which is inconsistent with exponentially-discounted-utility models of the value of money. To explain this observed behavioral regularity using shard theory, consider the historical reinforcement contexts around immediate and delayed gratification. If contexts involving short-term opportunities activate different shards than contexts involving long-term opportunities, then it’s unsurprising that a person might choose 500 dollars in (A) but 505 dollars in (B).^[12] (Of course, a full shard theory explanation must explain why those contexts activate different shards. We strongly intuit that there’s a good explanation, but do not think we have a satisfying story here yet.)

Framing effect

This is another bias that’s downstream of shards activating contextually. Asking the same question in different contexts can change which value-shards activate, and thus change how people answer the question. Consider also: People are hesitant to drink from a cup labeled “poison”, even if they themselves were the one to put the label there.

Other factors driving biases

There are many different reasons why someone might act in a biased manner. We’ve described some shard theory explanations for the listed biases. These explanations are not exhaustive. While writing this, we found an experiment with results that seem contrary to the shard theory explanations of sunk cost. Namely, experiment 4 (specifically, the uncorrelated condition) in this study on sunk cost in pigeons.

However, the cognitive biases literature is so large and heterogeneous that there probably isn’t any theory which cleanly explains all reported experimental outcomes. We think that shard theory has decently broad explanatory power for many aspects of human values and biases, even though not all observations fit neatly into the shard theory frame. (Alternatively, we might have done the shard theory analysis wrong for experiment 4.)

Why people can't enumerate all their values

Shards being contextual also helps explain why we can’t specify our full values. We can describe a moral theory that seems to capture our values in a given mental context, but it’s usually easy to find some counterexample to such a theory—some context or situation where the specified theory prescribes absurd behavior.

If shards implement your values, and shards activate situationally, your values will also be situational. Once you move away from the mental context / situation in which you came up with the moral theory, you might activate shards that the theory fails to capture. We think that this is why the static utility function framing is hard to operate for humans.

E.g., the classical utilitarianism maxim to maximize joy might initially seem appealing, but it doesn’t take long to generate a new mental context which activates shards that value emotions other than joy, or shards that value things in physical reality beyond your own mental state.

You might generate such new mental contexts by directly searching for shards that bid against pure joy maximization, or by searching for hypothetical scenarios which activate such shards ("finding a counterexample", in the language of moral philosophy). However, there is no clean way to query all possible shards, and we can’t enumerate every possible context in which shards could activate. It's thus very difficult to precisely quantify all of our values, or to create an explicit utility function that describes our values.

Content we aren’t (yet) discussing

The story we’ve presented here skips over important parts of human value formation. E.g., humans can do moral philosophy and refactor their deliberative moral framework without necessarily encountering any externally-activated reinforcement events, and humans also learn values through processes like cultural osmosis or imitation of other humans. Additionally, we haven’t addressed learned reinforcers (where a correlate of reinforcement events eventually becomes reinforcing in and of itself). We’ve also avoided most discussion of shard theory’s AI alignment implications.

This post explains our basic picture of shard formation in humans. We will address deeper shard theory-related questions in later posts.

Conclusion

Working from three reasonable assumptions about how the brain works, shard theory implies that human values (e.g. caring about siblings) are implemented by contextually activated circuits which activate in situations downstream of past reinforcement (e.g. when physically around siblings) so as to steer decision-making towards the objects of past reinforcement (e.g. making plans to spend more time together). According to shard theory, human values may be complex, but much of human value formation is simple.

For shard theory discussion, join our Discord server. Charles Foster wrote Appendix A.3. We thank David Udell, Peter Barnett, Raymond Arnold, Garrett Baker, Steve Byrnes, and Thomas Kwa for feedback on this finalized post. Many more people provided feedback on an earlier version.

Appendices

A.1 The formation of the world model

Most of our values seem to be about the real world. Mechanistically, we think that this means that they are functions of the state of our world model. We therefore infer that human values do not form durably or in earnest until after the human has learned a proto-world model. Since the world model is learned from scratch (by assumption 1 in section I), the world model takes time to develop. In particular, we infer that babies don’t have any recognizable “values” to speak of.

Therefore, to understand why human values empirically coalesce around the world model, we will sketch a detailed picture of how the world model might form. We think that self-supervised learning (item 2 in section I) produces your world model.

Due to learning from scratch, the fancy and interesting parts of your brain start off mostly useless. Here’s a speculative^[13] story about how a baby learns to reduce predictive loss, in the process building a world model:

The baby is born^[14] into a world where she is pummeled by predictive error after predictive error, because most of her brain consists of locally randomly initialized neural circuitry.
The baby’s brain learns that a quick loss-reducing hack is to predict that the next sensory activations will equal the previous ones: That nothing will observationally change from moment to moment. If the baby is stationary, much of the visual scene is constant (modulo saccades). Similar statements may hold for other sensory modalities, from smell (olfaction) to location of body parts (proprioception).
1. At the same time, the baby starts learning edge detectors in V1^[15] (which seem to be universally learned / convergently useful in vision tasks) in order to take advantage of visual regularities across space and time, from moment to moment.
The baby learns to detect when they are being moved or when their eyes are about to saccade, in order to crudely anticipate e.g. translations of part of the visual field. For example, given the prior edge-detector activations and her current acceleration, the baby predicts that the next edge detectors to light up will be a certain translation of the previous edge-detector patterns.
1. This acceleration → visual translation circuitry is reliably learned because it’s convergently useful for reducing predictive loss in many situations under our laws of physics.
2. Driven purely by her self-supervised predictive learning, the baby has learned something interesting about how she is embedded in the world.
3. Once the “In what way is my head accelerating?” circuit is learned, other circuits can invoke it. This pushes toward modularity and generality, since it’s easier to learn a circuit which is predictively useful for two tasks, than to separately learn two variants of the same circuit. See also invariant representations.
The baby begins to learn rules of thumb e.g. about how simple objects move. She continues to build abstract representations of how movement relates to upcoming observations.
1. For example, she gains another easy reduction in predictive loss by using her own motor commands to predict where her body parts will soon be located (i.e. to predict upcoming proprioceptive observations).
2. This is the beginning of her self-model.
The rules of thumb become increasingly sophisticated. Object recognition and modeling begins in order to more precisely predict low- and medium-level visual activations, like “if I recognize a square-ish object at time t and it has smoothly moved left for k timesteps, predict I will recognize a square-ish object at time t+1 which is yet farther left in my visual field.”
As the low-hanging fruit are picked, the baby’s brain eventually learns higher-level rules.
1. “If a stationary object is to my right and I turn my head to the left, then I will stop seeing it, but if I turn my head back to the right, I will see it again.”
2. This rule requires statefulness via short-term memory and some coarse summary of the object itself (small time-scale object permanence within a shallow world-model).
Object permanence develops from the generalization of specific heuristics for predicting common objects, to an invariant scheme for handling objects and their relationship to the child.
1. Developmental milestones vary from baby to baby because it takes them a varying amount of time to learn certain keystone but convergent abstractions, such as self-models.
2. Weak evidence that this learning timeline is convergent: Crows (and other smart animals) reach object permanence milestones in a similar order as human babies reach them.
3. The more abstractions are learned, the easier it is to lay down additional functionality. When we see a new model of car, we do not have to relearn our edge detectors or car-detectors.
Learning continues, but we will stop here.

In this story, the world model is built from the self-supervised loss signal. Reinforcement probably also guides and focuses attention. For example, perhaps brainstem-hardcoded (but crude) face detectors hook into a reward circuit which focuses the learning on human faces.

A.2 Terminology

Shards are not full subagents

In our conception, shards vary in their sophistication (e.g. IF-THEN reflexes vs planning-capable, reflective shards which query the world model in order to steer the future in a certain direction) and generality of activating contexts (e.g. only activates when hungry and a lollipop is in the middle of the visual field vs activates whenever you're thinking about a person). However, we think that shards are not discrete subagents with their own world models and mental workspaces. We currently estimate that most shards are "optimizers" to the extent that a bacterium or a thermostat is an optimizer.

“Values”

We defined^[16] “values” as “contextual influences on decision-making.” We think that “valuing someone’s friendship” is what it feels like from the inside to be an algorithm with a contextually activated decision-making influence which increases the probability of e.g. deciding to hang out with that friend. Here are three extra considerations and clarifications.

Type-correctness. We think that our definition is deeply appropriate in certain ways. Just because you value eating donuts, doesn’t mean you want to retain that pro-donut influence on your decision-making. This is what it means to reflectively endorse a value shard—that the shards which reason about your shard composition, bid for the donut-shard to stick around. By the same logic, it makes total sense to want your values to change over time—the “reflective” parts of you want the shard composition in the future to be different from the present composition. (For example, many arachnophobes probably want to drop their fear of spiders.) Rather than humans being “weird” for wanting their values to change over time, we think it’s probably the default for smart agents meeting our learning-process assumptions (section I).

Furthermore, your values do not reflect a reflectively endorsed utility function. First off, those are different types of objects. Values bid for and against options, while a utility function grades options. Second, your values vary contextually, while any such utility function would be constant across contexts. More on these points later, in more advanced shard theory posts.

Different shard compositions can produce similar urges. If you feel an urge to approach nearby donuts, that indicates a range of possibilities:

A donut shard is firing to increase P(eating the donut) because the WM indicates there’s a short plan that produces that outcome, and seeing/smelling a donut activates the donut shard particularly strongly.
A hedonic shard is firing to increase P(eating the donut) because the WM indicates there’s a short plan that produces a highly pleasurable outcome.
A social shard is firing because your friends are all eating donuts, and the social shard was historically reinforced for executing plans where you “fit in” / gain their approval.
…

So, just because you feel an urge to eat the donut, doesn’t necessarily mean you have a donut shard or that you “value” donuts under our definition. (But you probably do.)

Shards are just collections of subshards. One subshard of your family-shard might steer towards futures where your family is happy, while another subshard may influence decisions so that your mother is proud of you. On my (TurnTrout’s) current understanding, “family shard” is just an abstraction of a set of heterogeneous subshards which are downstream of similar historical reinforcement events (e.g. related to spending time with your family). By and large, subshards of the same shard do not all steer towards the same kind of future.

“Shard Theory”

Over the last several months, many people have read either a draft version of this document, Alignment Forum comments by shard theory researchers, or otherwise heard about “shard theory” in some form. However, in the absence of a canonical public document explaining the ideas and defining terms, “shard theory” has become overloaded. Here, then, are several definitions.

This document lays out (the beginning of) the shard theory of human values. This theory attempts a mechanistic account of how values / decision-influencers arise in human brains.
1. As hinted at by our remark on shard theory mispredicting behavior in pigeons, we also expect this theory to qualitatively describe important aspects of animal cognition (insofar as those animals satisfy learning from scratch + self-supervised learning + reinforcement learning).
2. Typical shard theory questions:
  1. “What is the mechanistic process by which a few people developed preferences over what happens under different laws of physics?”
  2. “What is the mechanistic basis of certain shards (e.g. people respecting you) being ‘reflectively endorsed’, while other shards (e.g. avoiding spiders) can be consciously ‘planned around’ (e.g. going to exposure therapy so that you stop embarrassingly startling when you see a spider)?” Thanks to Thane Ruthenis for this example.
  3. “Why do humans have good general alignment properties, like robustness to ontological shifts?”
The shard paradigm/theory/frame of AI alignment analyzes the value formation processes which will occur in deep learning, and tries to figure out their properties.
1. Typical questions asked under this paradigm/frame:
  1. “How can we predictably control the way in which a policy network generalizes? For example, under what training regimes and reinforcement schedules would a CoinRun agent generalize to pursuing coins instead of the right end of the level? What quantitative relationships and considerations govern this process?”
  2. “Will deep learning agents robustly and reliably navigate ontological shifts?”
2. This paradigm places a strong (and, we argue, appropriate) emphasis on taking cues from humans, since they are the only empirical examples of real-world general intelligences which “form values” in some reasonable sense.
3. That said, alignment implications are out of scope for this post. We postpone discussion to future posts.
“Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment.
1. We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”
2. Example insights include Reward is not the optimization target and Human values & biases are inaccessible to the genome.

A.3 Evidence for neuroscience assumptions

In section I, we stated that shard theory makes three key neuroscientific assumptions. Below we restate those assumptions, and give pointers to what we believe to be representative evidence from the psychology & neuroscience literature:

The cortex is basically locally randomly initialized.
1. Steve Byrnes has already written on several key lines of evidence that suggest the telencephalon (which includes the cerebral cortex) & cerebellum learn primarily from scratch. We recommend his writing as an entrypoint into that literature.
2. One easily observable weak piece of evidence: humans are super altricial—if the genome hardcoded a bunch of the cortex, why would babies take so long to become autonomous?
The brain does self-supervised learning.
1. Certain forms of spike-timing dependent plasticity (STDP) as observed in many regions of telencephalon would straightforwardly support self-supervised learning at the synaptic level, as connections are adjusted such that earlier inputs (pre-synaptic firing) anticipate later outputs (post-synaptic firing).
2. Within the hippocampus, place-selective cells fire in the order of the spatial locations they are bound to, with a coding scheme that plays out whole sequences of place codes that the animal will later visit.
3. If the predictive processing framework is an accurate picture of information processing in the brain, then the brain obviously does self-supervised learning.
The brain does reinforcement learning.
1. Within captive animal care, positive reinforcement training appears to be a common paradigm (see this paper for a reference in the case of nonhuman primates). This at least suggests that “shaping complex behavior through reward” is possible.
2. Operant & respondent conditioning methods like fear conditioning have a long history of success, and are now related back to key neural structures that support the acquisition and access of learned responses. These paradigms work so well, experimenters have been able to use them to have mice learn to directly control the activity of a single neuron in their motor cortex.
3. Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
4. In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.

^{^}
More precisely, we adopt Steve Byrnes’ stronger conjecture that the telencephelon and cerebellum are locally ~randomly initialized.
^{^}
There are non-synaptic ways to transmit information in the brain, including ephaptic transmission, gap junctions, and volume transmission. We also consider these to be part of a circuit’s mental context.
^{^}
We take an agnostic stance on the form of RL in the brain, both because we have trouble spelling out exact neurally plausible base credit assignment and reinforcement learning algorithms, but also so that the analysis does not make additional assumptions.
^{^}
In psychology, “shaping” roughly refers to this process of learning increasingly sophisticated heuristics.
^{^}
Shards activate more strongly in historical reinforcement contexts, according to our RL intuitions, introspective experience, and inference from observed human behavior. We have some abstract theoretical arguments that RL should work this way in the brain, but won't include them in this post.
^{^}
We think human planning is less like Monte-Carlo Tree Search and more like greedy heuristic search. The heuristic is computed in large part by the outputs of the value shards, which themselves receive input from the world model about the consequences of the plan stub.
^{^}
For example, turning back and forth while hungry might produce continual slight negative reinforcement events, at which point good credit assignment blames and downweights the micro-incoherences.
^{^}
We think that “hedonic” shards of value can indeed form, and this would be part of why people seem to intrinsically value “rewarding” experiences. However, two points. 1) In this specific situation, the juice-shard forms around real-life juice. 2) We think that even self-proclaimed hedonists have some substantial values which are reality-based instead of reward-based.
^{^}
We looked for a citation but couldn’t find one quickly.
^{^}
We think the actual historical hanging-out-with-friend reinforcement events transpire differently. We may write more about this in future essays.
^{^}
“It’s easier to kill a distant and unseen victim” seems common-sensically true, but we couldn’t actually find citations. Therefore, we are flagging this as possibly wrong folk wisdom. We would be surprised if it were wrong.
^{^}
Shard theory reasoning says that while humans might be well-described as “hyperbolic discounters”, the real mechanistic explanation is importantly different. People may well not be doing any explicitly represented discounting; instead, discounting may only convergently arise as a superficial regularity! This presents an obstacle to alignment schemes aiming to infer human preferences by assuming that people are actually discounting.
^{^}
We made this timeline up. We expect that we got many details wrong for a typical timeline, but the point is not the exact order. The point is to outline the kind of process by which the world model might arise only from self-supervised learning.
^{^}
For simplicity, we start the analysis at birth. There is probably embryonic self-supervised learning as well. We don’t think it matters for this section.
^{^}
Interesting but presently unimportant: My (TurnTrout)’s current guess is that given certain hard-coded wiring (e.g. where the optic nerve projects), the functional areas of the brain comprise the robust, convergent solution to: How should the brain organize cognitive labor to minimize the large metabolic costs of information transport (and, later, decision-making latency). This explains why learning a new language produces a new Broca’s area close to the original, and it explains why rewiring ferrets’ retinal projections into the auditory cortex seems to grow a visual cortex there instead. (jacob_cannell posited a similar explanation in 2015.)

The actual function of each functional area is overdetermined by the convergent usefulness of e.g. visual processing or language processing. Convergence builds upon convergence to produce reliable but slightly-varied specialization of cognitive labor across people’s brains. That is, people learn edge detectors because they’re useful, and people’s brains put them in V1 in order to minimize the costs of transferring information.

Furthermore, this process compounds upon itself. Initially there were weak functional convergences, and then mutations finetuned regional learning hyperparameters and connectome topology to better suit those weak functional convergences, and then the convergences sharpened, and so on. We later found that Voss et al.’s Branch Specialization made a similar conjecture about the functional areas.
^{^}
I (TurnTrout) don’t know whether philosophers have already considered this definition (nor do I think that’s important to our arguments here). A few minutes of searching didn’t return any such definition, but please let me know if it already exists!

Man, what a post!

My knowledge of alignment is somewhat limited, so keep in mind some of my questions may be a bit dumb simply because there are holes in my understanding.

It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…

I basically agree with the last sentence of this statement, but I'm trying to figure out how to square it with my knowledge of genetics. Political attitudes, for example, are heritable. Yet I agree there are no hardcoded versions of "democrat" or "republican" in the brain.

This leaves us with a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.

This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome. So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.

Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.

Wow. I'm not sure if you're aware of this research, but shard theory sounds shockingly similar to Guynet's description of how the parasitic lamprey fish make decisions in "The Hungry Brain". Let me just quote the whole section from Scott Alexander's Review of the book:

How does the lamprey decide what to do? Within the lamprey basal ganglia lies a key structure called the striatum, which is the portion of the basal ganglia that receives most of the incoming signals from other parts of the brain. The striatum receives “bids” from other brain regions, each of which represents a specific action. A little piece of the lamprey’s brain is whispering “mate” to the striatum, while another piece is shouting “flee the predator” and so on. It would be a very bad idea for these movements to occur simultaneously – because a lamprey can’t do all of them at the same time – so to prevent simultaneous activation of many different movements, all these regions are held in check by powerful inhibitory connections from the basal ganglia. This means that the basal ganglia keep all behaviors in “off” mode by default. Only once a specific action’s bid has been selected do the basal ganglia turn off this inhibitory control, allowing the behavior to occur. You can think of the basal ganglia as a bouncer that chooses which behavior gets access to the muscles and turns away the rest. This fulfills the first key property of a selector: it must be able to pick one option and allow it access to the muscles.

Spoiler: the pallium is the region that evolved into the cerebral cortex in higher animals.

Each little region of the pallium is responsible for a particular behavior, such as tracking prey, suctioning onto a rock, or fleeing predators. These regions are thought to have two basic functions. The first is to execute the behavior in which it specializes, once it has received permission from the basal ganglia. For example, the “track prey” region activates downstream pathways that contract the lamprey’s muscles in a pattern that causes the animal to track its prey. The second basic function of these regions is to collect relevant information about the lamprey’s surroundings and internal state, which determines how strong a bid it will put in to the striatum. For example, if there’s a predator nearby, the “flee predator” region will put in a very strong bid to the striatum, while the “build a nest” bid will be weak…

Each little region of the pallium is attempting to execute its specific behavior and competing against all other regions that are incompatible with it. The strength of each bid represents how valuable that specific behavior appears to the organism at that particular moment, and the striatum’s job is simple: select the strongest bid. This fulfills the second key property of a selector – that it must be able to choose the best option for a given situation…

With all this in mind, it’s helpful to think of each individual region of the lamprey pallium as an option generator that’s responsible for a specific behavior. Each option generator is constantly competing with all other incompatible option generators for access to the muscles, and the option generator with the strongest bid at any particular moment wins the competition.

You can read the whole review here or the book here. It sounds like you may have independently rederived a theory of how the brain works that neuroscientists have known about for a while.

I think this independent corroboration of the basic outline of the theory makes it even more likely shard theory is broadly correct.

I hope someone can work on the mathematics of shard theory. It seems fairly obvious to me that shard theory or something similar to it is broadly correct, but for it to impact alignment, you're probably going to need a more precise definition that can be operationalized and give specific predictions about the behavior we're likely to see.

I assume that shards are composed of some group of neurons within a neural network, correct? If so, it would be useful if someone can actually map them out. Exactly how many neurons are in a shard? Does the number change over time? How often do neurons in a shard fire together? Do neurons ever get reassigned to another shard during training? In self-supervised learning environments, do we ever observe shards guiding behavior away from contexts in which other shards with opposing values would be activated?

Answers to all the above questions seem likely to be downstream of a mathematical description of shards.

This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome.

I'd also imagine that mathematical skill is heritable. [Finds an article on Google Scholar] The abstract of https://doi.org/10.1037/a0015115 seems to agree. Yet due to information inaccesibility and lack of selection pressure ancestrally, I infer math ability probably isn't hardcoded.

There are a range of possible explanations which reconcile these two observations, like "better genetically specified learning hyperparameters in brain regions which convergently get allocated to math" or "tweaks to the connectivity initialization procedure^[1] involving that brain region (how neurons get ~randomly wired up at the local level)."

I expect similar explanations for heritability of biases.

So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.

Agreed.

^{^}
Compare eg the efficacy of IID Gaussian initialization of weights in an ANN vs using Xavier to tamp down the variance of activations in later layers.

Could you clarify what you mean by values not being "hack after evolutionary hack"?

What this sounds like, but I think you don't mean: "Human values are all emergent from a simple and highly general bit of our genetic blueprint, which was simple for evolution to find and has therefore been unchanged more or less since the invention of within-lifetime learning. Evolution never developed a lot of elaborate machinery to influence our values."

What I think you do mean: "Human values are emergent from a simple and general bit of our genetic blueprint (our general learning algorithm), plus a bunch of evolutionary nudges (maybe slightly hackish) to guide this learning algorithm towards things like friendship, eating when hungry, avoiding disgusting things, etc. Some of these nudges generalize so well they've basically persisted across mammalian evolution, while some of them humans only share with social primates, but the point is that even though we have really different values from chimpanzees, that's more because our learning algorithm is scaled up and our environment is different, the nudges on the learning algorithm have barely had to change at all."

What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."

What I think you do mean:

This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor!

What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."

I'd say our position contrasts with "A substantial portion of human value formation is genetically pre-determined in a complicated way, such that values are more like adaptations and less like exaptations—more like contextually-activated genetic machinery and influences than learned artifacts of simple learning-process-signals."

In terms of past literature, I disagree with the psychological nativism I've read thus far. I also have not yet read much evolutionary psychology, but expect to deem most of it implausible due to information inaccessibility of the learned world model.

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about.

Curated. "Big if true". I love the depth and detail in shard theory. Separate from whether all its details are correct, I feel reading and thinking about this will get me towards a better understanding of humans and artificial networks both, if only via making reflect on how things work.

I do fear that shard theory gets a bit too much popularity from the coolness of the name, but I do think there is merit here, and if we had more theories of this scope, it'd be quite good.

As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)

Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.

By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)

We can make some interesting observations on prosocial motivations in humans:

Due to Elephant in the Brain issues, an aspiration to be prosocial isn't always enough to generate prosociality as a virtue in the way that counts. Something like high metacognition + commitment to high integrity seem required as well.
Not all people have genuinely prosocial motivations.
People who differ from each other on prosocial motivations (and metacognition and integrity) seem to fall into "surprisingly" distinct clusters.

By the last bullet point, I mean that it seems plausible that we can learn a lot about someone's character even in situations that are obviously "a test." E.g., the best venture capitalists don't often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:

I'm better at some things than Jessica, and she's better at some things than me. One of the things she's best at is judging people. She's one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you're picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.

If Graham is correct about his wife's ability, this means that people with "shady character" sometimes fail in test situations specifically due to their character – which is strange because you'd expect that the rational strategy in these situation is "act as though you had good character."

In humans, "perfect psychopaths" arguably don't exist. That is, people without genuinely prosocial motivations, even when they're highly intelligent, don't behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can't help but behave in subtly suspicious ways even in situations where they're able to guess that judges are trying to assess their character.

From the perspective of Shard Theory's approach, it seems interesting to ask "Why is this?"

My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:

Asymmetric behavioral strategies: Even in "test situations" where the time and means for evaluation are limited (e.g., trial tasks followed by lengthy interviews), people can convey a lot of relevant information through speech. Honest strategies have some asymmetric benefits ("words aren't cheap"). (The term "asymmetric behavioral strategies" is inspired by this comment on "asymmetric tools.")
- Pointing out others’ good qualities.
  - People who consistently praise others for their good qualities, even in situations where this isn’t socially advantageous, credibly signal that they don’t apply a zero-sum mindset to social situations.
- Making oneself transparent (includes sharing disfavorable information).
  - People who consistently tell others why they behave in certain ways, make certain decisions, or hold specific views, present a clearer picture of themselves. Others can then check that picture for consistency. The more readily one shares information, the harder it would be to keep lies consistent. The habit of proactive transparency also sets up a precedent: it makes it harder to suddenly shift to intransparency later on, at one’s convenience.
  - Pointing out one’s hidden negative qualities. One subcategory of “making oneself transparent” is when people disclose personal shortcomings even in situations where they would have been unlikely to otherwise come up. In doing so, they credibly signal that they don’t need to oversell themselves in order to gain others’ appreciation. The more openly someone discloses their imperfections, the more their honest intent and their genuine competencies will shine through.
- Handling difficult interpersonal conversations on private, prosocial emotions.
  - People who don’t shy away from difficult interpersonal conversations (e.g., owning up to one’s mistakes and trying to resolve conflicts) can display emotional depth and maturity as well as an ability to be vulnerable. Difficult interpersonal conversations thereby serve as a fairly reliable signal of someone’s moral character (especially in real-time without practice and rehearsing) because vulnerability is hard to fake for people who aren’t in touch with emotions like guilt and shame, or are incapable of feeling them. For instance, pathological narcissists tend to lack insight into their negative emotions, whereas psychopaths lack certain prosocial emotions entirely. If people with those traits nonetheless attempt to have difficult interpersonal conversations, they risk being unmasked. (Analogy: someone who lacks a sense of smell will be unmasked when talking about the intricacies of perfumery, even if they've done practicing for faking it.)
- Any individual signal can be faked. A skilled manipulator will definitely go out of their way to fake prosocial signals or cleverly spin up ambiguities in how to interpret past events. To tell whether a person is manipulative, I recommend giving relatively little weight to single examples of their behavior and focus on the character qualities that show up the most consistently.
Developmental constraints: The way evolution works, mind designs "cannot go back to the drawing board" – single mutations cannot alter too many things at once without badly messing up the resulting design.
- For instance, manipulators get better at manipulating if they have a psychology of the sort (e.g.) "high approach seeking, low sensitivity to punishment." Developmental constraint: People cannot alter their dispositions at will.
- People who self-deceive become more credible liars. Developmental tradeoff: Once you self-deceive, you can no longer go back and "unroll" what you've done.
- Some people's emotions might have evolved to be credible signals, making people "irrationally" interpersonally vulnerable (e.g., disposition to be fearful and anxious) or "irrationally" affected by others' discomfort (e.g., high affective empathy). Developmental constraint: Faking emotions you don't have is challenging even for skilled manipulators.
- Different niches / life history strategies: Deceptive strategies seem to be optimized for different niches (at least in some cases). For instance, I've found that we can tell a lot about the character of men by looking at their romantic preferences. (E.g., if someone seeks out shallow relationship after shallow relationship and doesn't seem to want "more depth," that can be a yellow flag. It becomes a red flag if they're not honest about their motivations for the relationship and if they prefer to keep the connection shallow even though the other person would want more depth.)
"No man's land" in fitness gradients: In the ancestral environment, asymmetric tools + developmental constraints + inter-species selection pressure for character (neither too weak, nor too strong) produced fitness gradients that steer towards attractors of either high honesty vs high deceitfulness. From a fitness perspective, it sucks to "practice" both extremes of genuine honesty and dishonesty in the same phenotype because the strategies hone in on different sides of various developmental tradeoffs. (And there are enough poor judges of character so that dishonest phenotypes can mostly focus on niches where the attain high reward somewhat easily so they don't have to constantly expose themselves to the highest selection pressures for getting unmasked.)
Capabilities constraints (relative to the capabilities of competent judges): People who find themselves with the deceitful phenotype cannot bridge the gap and learn to act the exact same way a prosocial actor would act (but they can fool incompetent judges or competent judges who face time-constraints or information-constraints). This is a limitation of capabilities: it would be different if people were more skilled learners and had better control over their psychology.

In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right "developmental constraints" and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others' competence and character, which means that people were always surrounded by judges of similar intelligence.)

Lastly, there's a problem where, if you dial up capabilities too much, it becomes increasingly easier to "fake everything." (For the reasons Ajeya explains in her account of deceptive alignment.)

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies can easily go astray, for reasons pointed out in the essay. (It wasn't obvious to me that you went astray in your comment, to be clear -- more noting this for other readers.)

I think Shard Theory is one of the most promising approaches on human values that I've seen on LW, and I'm very happy to see this work posted. (Of course, I'm probably biased in that I also count my own approaches to human values among the most promising and Shard Theory shares a number a similarities with it - e.g. this post talks about something-like-shards issuing mutually competitive bids that get strengthened or weakened depending on how environmental factors activate those shards, and this post talked about values and world-models being learned in an intertwined manner.)

But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.

Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.

But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.

I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.

What do power differentials have to do with the kind of mechanistic training story posited by shard theory?

The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality.

So let's figure out how to supply good reinforcement events to AI agents. I think that approach will generalize pretty well (and is, in a sense, all that success requires in the deep learning alignment regime).

I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).

So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.)

These seem like standard objections around here so I assume you've thought about them. I just don't notice those thoughts anywhere in the work.

I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following?

If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)

So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration.

One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to have all the context. So we don't really know how it would work in the regime where it seems irreparably dangerous. Like I could say "what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished", but who knows how RLHF will actually be used in engineering?

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.

Like, you said:

shard theory resembles RLHF, and seems to share its flaws

So, if some alignment theory says "this approach (e.g. RLHF) is flawed and probably won't produce human-compatible values", and we notice "shard theory resembles RLHF", then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I'd update against the alignment theory / reasoning which called RLHF flawed. (Of course, there are reasons -- like inductive biases -- that RLHF-like processes could work in humans but not in AI, but any argument against RLHF would have to discriminate between the human/AI case in a way which accounts for those obstructions.)

On the object level:

If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"? If so -- that's not (necessarily) how credit assignment works. The AI's credit assignment isn't necessarily running along the lines of "people were deceived, so upweight computations which deceive people."

Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of "sometimes emit completions from the self-supervised pretrained world model." Then the AI reinforces this heuristic (among other changes from the gradient).

the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other.

I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.

How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"?

Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life. Whether that's deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else.

If so -- that's not (necessarily) how credit assignment works.

Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.

I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

Would you not agree that models are unaligned by default, unless there is something that aligns them?

Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say "RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation." And then, of course, we should compare the predicted alignment concerns in people, with the observed alignment situation, and update accordingly. I've updated down hard on alignment difficulty when I've run this exercise in the past.

Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life.

I don't see why this is worth granting the connotations we usually associate with "deception"
I think that if the AI just repeats "I saved someone's life" without that being true, we will find out and stop rewarding that?
1. Unless the AI didn't just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it's already deceptive in a much worse way.
2. But somehow the AI has to get to that cognitive state, first. I think it's definitely possible, but not at all clearly the obvious outcome.

Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.

That wasn't the part of the post I meant to point to. I was saying that just because we externally observe something we would call "deception/misleading task completion" (e.g. getting us to reward the AI for prosociality), does not mean that "deceptive thought patterns" get reinforced into the agent! The map is not the territory of the AI's updating process. The reward will, I think, reinforce and generalize the AI's existing cognitive subroutines which produced the judged-prosocial behavior, which subroutines don't necessarily have anything to do with explicit deception (as you noted).

Would you not agree that models are unaligned by default, unless there is something that aligns them?

Is a donut "unaligned by default"? The networks start out randomly initialized. I agree that effort has to be put in to make the AI care about human-good outcomes in particular, as opposed to caring about ~nothing, or caring about some other random set of reward correlates. But I'm not assuming the model starts out deceptive, nor that it will become that with high probability. That's one question I'm trying to figure out with fresh eyes.

I think I sorta disagree in the sense that high-functioning sociopaths live in the same society as neurotypical people, but don’t wind up “aligned”. I think the innate reward function is playing a big role. (And by the way, nobody knows what that innate human reward function is or how it works, according to me.) That said, maybe the innate reward function is insufficient and we also need multi-agent dynamics. I don’t currently know.

I’m sympathetic to your broader point, but until somebody says exactly what the rewards (a.k.a. “reinforcement events”) are, I’m withholding judgment. I’m open to the weaker argument that there are kinda dumb obvious things to try where we don’t have strong reason to believe that they will create friendly AGI, but we also don’t have strong reason to believe that they won’t create friendly AGI. See here. This is a less pessimistic take than Eliezer’s, for example.

I'm often asked, Why "shard theory"? I suggested this name to Quintin when realizing that human values have the type signature of contextually activated decision-making influences. The obvious choice, then, was to call these things "shards of value"—drawing inspiration from Eliezer's Thou art godshatter, where he originally wrote "shard of desire."

(Contrary to several jokes, the choice was not just "because 'shard theory' sounds sick.")

This name has several advantages. Value-shards can have many subshards/facets which vary contextually (a real crystal may look slightly different along its faces or have an uneven growth pattern); value-shards grow in influence over time under repeated positive (just as real crystals can grow); value-shards imply a degree of rigidity, but also incompleteness—they are pieces of a whole (on my current guess, the eventual utility function which is the reflective equilibrium of value-handshakes between the set of endorsed shards which bid as a function of their own future prospects). Lastly, a set of initial shards will (I expect) generally steer the future towards growing themselves (e.g. banana-shard leads to more banana-consumption -> more reward -> the shard grows and becomes more sophisticated); similarly, given an initial start of a crystalline lattice which is growing, I'd imagine it becomes more possible to predict the later lattice configuration due to the nature of crystals.

This is really interesting. It's hard to speak too definitively about theories of human values, but for what it's worth these ideas do pass my intuitive smell test.

One intriguing aspect is that, assuming I've followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:

On the one hand, it seems to suggest a path to generalizing circuits-type work to the model-based RL paradigm. (With shards, which bid for outcomes on a contextually activated basis, being analogous to circuits, which contribute to prediction probabilities on a contextually activated basis.)
On the other hand, it also seems to generalize the psychological concept of classical conditioning (Pavlov's salivating dog, etc.), which has tended to be studied over the short term for practical reasons, to arbitrarily (?) longer planning horizons. The discussion of learning in babies also puts one in mind of the unfortunate Little Albert Experiment, done in the 1920s:

For the experiment proper, by which point Albert was 11 months old, he was put on a mattress on a table in the middle of a room. A white laboratory rat was placed near Albert and he was allowed to play with it. At this point, Watson and Rayner made a loud sound behind Albert's back by striking a suspended steel bar with a hammer each time the baby touched the rat. Albert responded to the noise by crying and showing fear. After several such pairings of the two stimuli, Albert was presented with only the rat. Upon seeing the rat, Albert became very distressed, crying and crawling away.

[...]

In further experiments, Little Albert seemed to generalize his response to the white rat. He became distressed at the sight of several other furry objects, such as a rabbit, a furry dog, and a seal-skin coat, and even a Santa Claus mask with white cotton balls in the beard.

A couple more random thoughts on stories one could tell through the lens of shard theory:

As we age, if all goes well, we develop shards with longer planning horizons. Planning over longer horizons requires more cognitive capacity (all else equal), and long-horizon shards do seem to have some ability to either reinforce or dampen the influence of shorter-horizon shards. This is part of the continuing process of "internally aligning" a human mind.
Introspectively, I think there is also an energy cost involved in switching between "active" shards. Software developers understand this as context-switching, actively dislike it, and evolve strategies to minimize it in their daily work. I suspect a lot of the biases you might categorize under "resistance to change" (projection bias, sunk cost fallacy and so on) have this as a factor.

I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?

Anyway, great post. Looking forward to more.

I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy. Importantly, I didn't want the reader to think that we're positing a bunch of homunculi. Maybe I should have just written that.

But I also feel relatively ignorant more advanced shard dynamics. While I can give interesting speculation, I don't have enough evidence-fuel to make such stories actually knowably correct.

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.

What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards?

I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different emotional states tend to activate similar memories (easier to remember negative things about your life when you're upset than if you are happy), and different emotional states tend to activate different shards.

I suspect that something like the Shadlen & Shohamy take on decision-making might be going on:

The proposal is that humans make choices based on subjective value [...] by perceiving a possible option and then retrieving memories which carry information about the value of that option. For instance, when deciding between an apple and a chocolate bar, someone might recall how apples and chocolate bars have tasted in the past, how they felt after eating them, what kinds of associations they have about the healthiness of apples vs. chocolate, any other emotional associations they might have (such as fond memories of their grandmother’s apple pie) and so on.
Shadlen & Shohamy further hypothesize that the reason why the decision process seems to take time is that different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

Under that view, I think that shards would effectively have separate world models, since each physically separate memory network suggesting that an action is good or bad is effectively its own shard; and since a memory network is a miniature world model, there's a sense in which shards are nothing but separate world models.

E.g. the memory of "licking the juice tasted sweet" is a miniature world model according to which licking the juice lets you taste something sweet, and is also a shard. (Or at least it forms an important component of a shard.) That miniature world model is separate from the shard/memory network/world model holding instances of times when adults taught the child to say "thank you" when given something; the latter shard only has a world model of situations where you're expected to say "thank you", and no world model of the consequences of licking juice.

Got it. That makes sense, thanks!

I wonder how the following behavioral patterns fit into Shard Theory

Many mammalian species have strong default aversion to young of their own species. They (including females) deliberately avoid contact with the young and can even be infanticidal. Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before. Do you think this can be explained by the rewiring of her reward circuit such that she finds simple actions associated with the pups highly rewarding and then bootstraps to learning complex behaviors from that?
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water, even though on all previous occasions they found it extremely aversive, which caused them to develop conditioned fear response to the cue predictive of salty water. (see Steve's post on this experiment)

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

I've been wondering about the latter for a while. These two results are less strongly predicted by shard theoretic reasoning than by "hardcoded" hypotheses. Pure-RL+SL shard theory loses points on these two observations, and points to other mechanisms IMO (or I'm missing some implications of pure-RL+SL shard theory).

This is pretty exciting. I've not really done any direct work to push forward alignment in the last couple years, but this is exactly the sort of direction I was hoping someone would go when I wrote my research agenda for deconfusing human values. What came out of it was that there was some research to do that I wasn't equipped to do myself, and I'm very happy to say you've done the sort of thing I had hoped for.

On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!

Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?

Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?

A simple reflection shows that happiness is not identical to reward prediction error: If I’m on a long, tiring journey of predictable duration, I still find relief at the moment I reach my destination. This is true even for journeys I’ve taken many times before, so that there can be little question that my unconscious has had opportunity to learn the predicted arrival time, and this isn’t just a matter of my conscious predictions getting ahead of my unconscious ones.

On the other hand, I also gain happiness from learning, well before I arrive, that traffic on my route has dissipated. So there does seem to be some amount of satisfaction gained just from learning new information, even prior to “cashing it in”. Hence, happiness is not identical to simple reward either.

Perhaps shard theory can offer a straightforward answer here: happiness (respectively suffering) is when a realized feature of the agent’s world model corresponds to something that a shard which is currently active values (respectively devalues).

If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.

It also explains the experience some have had, achieved through the use of meditation or other deliberate effort, of bodily pain without attendant suffering. They are presumably finding ways to activate shards that simply do not place negative value on pain.

Finally: happiness is then not a unidimensional, inter-comparable thing, but instead each type is to an extent sui generis. This comports with my intuition: I have no real scale on which I can weigh the pleasure of an orgasm against the delight of mathematical discovery.

In the things you write, I see a clear analogy with Bernard Baars' Global Workspace Theory. Especially his writings on "Goal Frames" and "Frame Stacks" seem to overlap with some of your ideas on how shards bid for global dominance. Also not unlike Dennett's "Fame in the Brain".

GWT is also largely a theory on how a massively parallel group of processors can give rise to a limited, serial conscious experience. His work is a bit difficult to get into and it's been a while, so it would take me some more time to write up a distillation. Let me know if you are interested.

Thank you for the post!

I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I'm not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:

You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse rewards. This looks very close to shard theory, except that you replace V with a bunch of shards, right? However, I think this later part doesn't make predictions different from "V is a neural network", because neural networks often learn context-dependent things, and I expect Alpha Go V-network to be very context dependent.

Is sharding a way to understand what neural networks can do in human understandable terms? Or is it a claim about what kind of neural network V is (because there are neural networks which aren't very "shard-like")?

Or do you think that sharding explains more than "the brain is like Alpha Go"? For example, maybe it's hard for different part of the V network to self-reflect. But that feels pretty weak, because human don't do that much either. Did I miss important predictions shard theory does and the classic RL+supervised learning setup doesn't?

How does shard theory explain romantic jealousy? It seems like most people feel jealous when their romantic partner does things like dancing with someone else or laughing at their jokes. How do shards like this form from simple reward circuitry? I'm having trouble coming up with a good story of how this happens. I would appreciate if someone could sketch one out for me.

I don't know.

Speculatively, jealousy responses/worries could be downstream of imitation/culture (which "raises the hypothesis"/has self-supervised learning ingrain the completion, such that now the cached completion is a consequence which can be easily hit by credit assignment / upweighted into a real shard). Another source would be negative reward events on outcomes where you end up alone / their attentions stray. Which, itself, isn't from simple reward circuitry, but a generalization of other learned reward events which I expect are themselves downstream of simple reward circuitry. (Not that that reduces my confusion much)

Man, what a post!

My knowledge of alignment is somewhat limited, so keep in mind some of my questions may be a bit dumb simply because there are holes in my understanding.

It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…

This leaves us with a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.

Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.

How does the lamprey decide what to do? Within the lamprey basal ganglia lies a key structure called the striatum, which is the portion of the basal ganglia that receives most of the incoming signals from other parts of the brain. The striatum receives “bids” from other brain regions, each of which represents a specific action. A little piece of the lamprey’s brain is whispering “mate” to the striatum, while another piece is shouting “flee the predator” and so on. It would be a very bad idea for these movements to occur simultaneously – because a lamprey can’t do all of them at the same time – so to prevent simultaneous activation of many different movements, all these regions are held in check by powerful inhibitory connections from the basal ganglia. This means that the basal ganglia keep all behaviors in “off” mode by default. Only once a specific action’s bid has been selected do the basal ganglia turn off this inhibitory control, allowing the behavior to occur. You can think of the basal ganglia as a bouncer that chooses which behavior gets access to the muscles and turns away the rest. This fulfills the first key property of a selector: it must be able to pick one option and allow it access to the muscles.

Spoiler: the pallium is the region that evolved into the cerebral cortex in higher animals.

Each little region of the pallium is responsible for a particular behavior, such as tracking prey, suctioning onto a rock, or fleeing predators. These regions are thought to have two basic functions. The first is to execute the behavior in which it specializes, once it has received permission from the basal ganglia. For example, the “track prey” region activates downstream pathways that contract the lamprey’s muscles in a pattern that causes the animal to track its prey. The second basic function of these regions is to collect relevant information about the lamprey’s surroundings and internal state, which determines how strong a bid it will put in to the striatum. For example, if there’s a predator nearby, the “flee predator” region will put in a very strong bid to the striatum, while the “build a nest” bid will be weak…

Each little region of the pallium is attempting to execute its specific behavior and competing against all other regions that are incompatible with it. The strength of each bid represents how valuable that specific behavior appears to the organism at that particular moment, and the striatum’s job is simple: select the strongest bid. This fulfills the second key property of a selector – that it must be able to choose the best option for a given situation…

With all this in mind, it’s helpful to think of each individual region of the lamprey pallium as an option generator that’s responsible for a specific behavior. Each option generator is constantly competing with all other incompatible option generators for access to the muscles, and the option generator with the strongest bid at any particular moment wins the competition.

You can read the whole review here or the book here. It sounds like you may have independently rederived a theory of how the brain works that neuroscientists have known about for a while.

I think this independent corroboration of the basic outline of the theory makes it even more likely shard theory is broadly correct.

Answers to all the above questions seem likely to be downstream of a mathematical description of shards.

This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome.

I expect similar explanations for heritability of biases.

So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.

Agreed.

^{^}
Compare eg the efficacy of IID Gaussian initialization of weights in an ANN vs using Xavier to tamp down the variance of activations in later layers.

Could you clarify what you mean by values not being "hack after evolutionary hack"?

What I think you do mean:

This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor!

What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."

I do fear that shard theory gets a bit too much popularity from the coolness of the name, but I do think there is merit here, and if we had more theories of this scope, it'd be quite good.

We can make some interesting observations on prosocial motivations in humans:

Due to Elephant in the Brain issues, an aspiration to be prosocial isn't always enough to generate prosociality as a virtue in the way that counts. Something like high metacognition + commitment to high integrity seem required as well.
Not all people have genuinely prosocial motivations.
People who differ from each other on prosocial motivations (and metacognition and integrity) seem to fall into "surprisingly" distinct clusters.

I'm better at some things than Jessica, and she's better at some things than me. One of the things she's best at is judging people. She's one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you're picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.

Asymmetric behavioral strategies: Even in "test situations" where the time and means for evaluation are limited (e.g., trial tasks followed by lengthy interviews), people can convey a lot of relevant information through speech. Honest strategies have some asymmetric benefits ("words aren't cheap"). (The term "asymmetric behavioral strategies" is inspired by this comment on "asymmetric tools.")
- Pointing out others’ good qualities.
  - People who consistently praise others for their good qualities, even in situations where this isn’t socially advantageous, credibly signal that they don’t apply a zero-sum mindset to social situations.
- Making oneself transparent (includes sharing disfavorable information).
  - People who consistently tell others why they behave in certain ways, make certain decisions, or hold specific views, present a clearer picture of themselves. Others can then check that picture for consistency. The more readily one shares information, the harder it would be to keep lies consistent. The habit of proactive transparency also sets up a precedent: it makes it harder to suddenly shift to intransparency later on, at one’s convenience.
  - Pointing out one’s hidden negative qualities. One subcategory of “making oneself transparent” is when people disclose personal shortcomings even in situations where they would have been unlikely to otherwise come up. In doing so, they credibly signal that they don’t need to oversell themselves in order to gain others’ appreciation. The more openly someone discloses their imperfections, the more their honest intent and their genuine competencies will shine through.
- Handling difficult interpersonal conversations on private, prosocial emotions.
  - People who don’t shy away from difficult interpersonal conversations (e.g., owning up to one’s mistakes and trying to resolve conflicts) can display emotional depth and maturity as well as an ability to be vulnerable. Difficult interpersonal conversations thereby serve as a fairly reliable signal of someone’s moral character (especially in real-time without practice and rehearsing) because vulnerability is hard to fake for people who aren’t in touch with emotions like guilt and shame, or are incapable of feeling them. For instance, pathological narcissists tend to lack insight into their negative emotions, whereas psychopaths lack certain prosocial emotions entirely. If people with those traits nonetheless attempt to have difficult interpersonal conversations, they risk being unmasked. (Analogy: someone who lacks a sense of smell will be unmasked when talking about the intricacies of perfumery, even if they've done practicing for faking it.)
- Any individual signal can be faked. A skilled manipulator will definitely go out of their way to fake prosocial signals or cleverly spin up ambiguities in how to interpret past events. To tell whether a person is manipulative, I recommend giving relatively little weight to single examples of their behavior and focus on the character qualities that show up the most consistently.
Developmental constraints: The way evolution works, mind designs "cannot go back to the drawing board" – single mutations cannot alter too many things at once without badly messing up the resulting design.
- For instance, manipulators get better at manipulating if they have a psychology of the sort (e.g.) "high approach seeking, low sensitivity to punishment." Developmental constraint: People cannot alter their dispositions at will.
- People who self-deceive become more credible liars. Developmental tradeoff: Once you self-deceive, you can no longer go back and "unroll" what you've done.
- Some people's emotions might have evolved to be credible signals, making people "irrationally" interpersonally vulnerable (e.g., disposition to be fearful and anxious) or "irrationally" affected by others' discomfort (e.g., high affective empathy). Developmental constraint: Faking emotions you don't have is challenging even for skilled manipulators.
- Different niches / life history strategies: Deceptive strategies seem to be optimized for different niches (at least in some cases). For instance, I've found that we can tell a lot about the character of men by looking at their romantic preferences. (E.g., if someone seeks out shallow relationship after shallow relationship and doesn't seem to want "more depth," that can be a yellow flag. It becomes a red flag if they're not honest about their motivations for the relationship and if they prefer to keep the connection shallow even though the other person would want more depth.)
"No man's land" in fitness gradients: In the ancestral environment, asymmetric tools + developmental constraints + inter-species selection pressure for character (neither too weak, nor too strong) produced fitness gradients that steer towards attractors of either high honesty vs high deceitfulness. From a fitness perspective, it sucks to "practice" both extremes of genuine honesty and dishonesty in the same phenotype because the strategies hone in on different sides of various developmental tradeoffs. (And there are enough poor judges of character so that dishonest phenotypes can mostly focus on niches where the attain high reward somewhat easily so they don't have to constantly expose themselves to the highest selection pressures for getting unmasked.)
Capabilities constraints (relative to the capabilities of competent judges): People who find themselves with the deceitful phenotype cannot bridge the gap and learn to act the exact same way a prosocial actor would act (but they can fool incompetent judges or competent judges who face time-constraints or information-constraints). This is a limitation of capabilities: it would be different if people were more skilled learners and had better control over their psychology.

Lastly, there's a problem where, if you dial up capabilities too much, it becomes increasingly easier to "fake everything." (For the reasons Ajeya explains in her account of deceptive alignment.)

I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

But how does this help with alignment? Sharded systems seem hard to robustly align outside of the context of an entity who participates on equal footing with other humans in society.

But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.

What do power differentials have to do with the kind of mechanistic training story posited by shard theory?

I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something).

These seem like standard objections around here so I assume you've thought about them. I just don't notice those thoughts anywhere in the work.

I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following?

If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.

Like, you said:

shard theory resembles RLHF, and seems to share its flaws

On the object level:

If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

Perhaps the AI thought the person would approve of that statement, and so it did it, and got rewarded, which reinforces the approval-seeking shard? (Which is bad news in a different way!)
Perhaps the AI was just exploring into a statement suggested by a self-supervised pretrained initialization, off of an already-learned general heuristic of "sometimes emit completions from the self-supervised pretrained world model." Then the AI reinforces this heuristic (among other changes from the gradient).

the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other.

I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.

Are you saying "The AI says something which makes us erroneously believe it saved a person's life, and we reward it, and this can spawn a deception-shard"?

If so -- that's not (necessarily) how credit assignment works.

I don't know whether this is a problem at all, in general. I expect unaligned models to convergently deceive us. But that requires them to already be unaligned.

Would you not agree that models are unaligned by default, unless there is something that aligns them?

Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life.

I don't see why this is worth granting the connotations we usually associate with "deception"
I think that if the AI just repeats "I saved someone's life" without that being true, we will find out and stop rewarding that?
1. Unless the AI didn't just happen to get erroneously rewarded for prosociality (as originally discussed), but planned for that to happen, in which case it's already deceptive in a much worse way.
2. But somehow the AI has to get to that cognitive state, first. I think it's definitely possible, but not at all clearly the obvious outcome.

Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing.

Would you not agree that models are unaligned by default, unless there is something that aligns them?

(Contrary to several jokes, the choice was not just "because 'shard theory' sounds sick.")

This is really interesting. It's hard to speak too definitively about theories of human values, but for what it's worth these ideas do pass my intuitive smell test.

One intriguing aspect is that, assuming I've followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:

On the one hand, it seems to suggest a path to generalizing circuits-type work to the model-based RL paradigm. (With shards, which bid for outcomes on a contextually activated basis, being analogous to circuits, which contribute to prediction probabilities on a contextually activated basis.)
On the other hand, it also seems to generalize the psychological concept of classical conditioning (Pavlov's salivating dog, etc.), which has tended to be studied over the short term for practical reasons, to arbitrarily (?) longer planning horizons. The discussion of learning in babies also puts one in mind of the unfortunate Little Albert Experiment, done in the 1920s:

For the experiment proper, by which point Albert was 11 months old, he was put on a mattress on a table in the middle of a room. A white laboratory rat was placed near Albert and he was allowed to play with it. At this point, Watson and Rayner made a loud sound behind Albert's back by striking a suspended steel bar with a hammer each time the baby touched the rat. Albert responded to the noise by crying and showing fear. After several such pairings of the two stimuli, Albert was presented with only the rat. Upon seeing the rat, Albert became very distressed, crying and crawling away.

[...]

In further experiments, Little Albert seemed to generalize his response to the white rat. He became distressed at the sight of several other furry objects, such as a rabbit, a furry dog, and a seal-skin coat, and even a Santa Claus mask with white cotton balls in the beard.

A couple more random thoughts on stories one could tell through the lens of shard theory:

As we age, if all goes well, we develop shards with longer planning horizons. Planning over longer horizons requires more cognitive capacity (all else equal), and long-horizon shards do seem to have some ability to either reinforce or dampen the influence of shorter-horizon shards. This is part of the continuing process of "internally aligning" a human mind.
Introspectively, I think there is also an energy cost involved in switching between "active" shards. Software developers understand this as context-switching, actively dislike it, and evolve strategies to minimize it in their daily work. I suspect a lot of the biases you might categorize under "resistance to change" (projection bias, sunk cost fallacy and so on) have this as a factor.

Anyway, great post. Looking forward to more.

I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?

But I also feel relatively ignorant more advanced shard dynamics. While I can give interesting speculation, I don't have enough evidence-fuel to make such stories actually knowably correct.

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy.

I suspect that something like the Shadlen & Shohamy take on decision-making might be going on:

The proposal is that humans make choices based on subjective value [...] by perceiving a possible option and then retrieving memories which carry information about the value of that option. For instance, when deciding between an apple and a chocolate bar, someone might recall how apples and chocolate bars have tasted in the past, how they felt after eating them, what kinds of associations they have about the healthiness of apples vs. chocolate, any other emotional associations they might have (such as fond memories of their grandmother’s apple pie) and so on.
Shadlen & Shohamy further hypothesize that the reason why the decision process seems to take time is that different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

Got it. That makes sense, thanks!

I wonder how the following behavioral patterns fit into Shard Theory

Many mammalian species have strong default aversion to young of their own species. They (including females) deliberately avoid contact with the young and can even be infanticidal. Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before. Do you think this can be explained by the rewiring of her reward circuit such that she finds simple actions associated with the pups highly rewarding and then bootstraps to learning complex behaviors from that?
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water, even though on all previous occasions they found it extremely aversive, which caused them to develop conditioned fear response to the cue predictive of salty water. (see Steve's post on this experiment)

Physiological events associated with pregnancy (mostly hormones) rewires the mother's brain such that when she gives birth, she immediately takes care of the young, grooms them etc., something she has never done before.
Salt-starved rats develop an appetite for salt and are drawn to stimuli predictive of extremely salty water

On first pass this seems to address many of the common problems with traditional approaches to formalizing values. I hope that this proves a fruitful line of research!

Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?

If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.

Thank you for the post!

I don't know.

74

The shard theory of human values

74

I. Neuroscientific assumptions

II. Reinforcement events shape human value shards

III. Explaining human behavior using shard theory

Altruism is contextual

Friendship strength seems contextual

Milgram is also contextual

Sunflowers and timidity

We think that many biases are convergently produced artifacts of the human learning process & environment

Projection bias

Sunk cost fallacy

Time inconsistency

Framing effect

Other factors driving biases

Why people can't enumerate all their values

Content we aren’t (yet) discussing

Conclusion

Appendices

A.1 The formation of the world model

A.2 Terminology

Shards are not full subagents

“Values”

“Shard Theory”

A.3 Evidence for neuroscience assumptions