Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Magdalena Wache for giving feedback on a recent version, and to Alex Turner for giving feedback on an early version of this article.
When thinking about shard theory, I noticed that my brain wanted to answer questions such as "What does shard theory predict?" or "Do I agree with shard theory?". Additionally, I observed other people engage in similar thinking. I now think this is confused, and that it is better to view shard theory as a bag of related claims that should be reasoned about and evaluated separately.
Therefore, In this post I want to explain the following:
This distillation is in spirit very similar to LawrenceC’s Shard Theory in Nine Theses. Note, however, that I did not read that distillation in order to produce a more independent explanation of shard theory.
My one-sentence categorization is that shard theory is both a theory for human value formation and also a paradigm/frame for thinking about alignment. It might also become a theory of value formation in RL agents, but it’s not quite there yet since it doesn’t make enough concrete and formalized empirical predictions yet.
Instead of splitting this distillation into “claims about humans and claims about trained RL agents” I decided to make a different split. Namely, I will start with a section on claims in the shard theory meme-space which do not involve shards at all, and later on claims that actually make use of the concept of a shard. I hope that this separation makes it easier for people to agree or disagree with very specific claims instead of accepting or rejecting the theory as a whole.
Please let me know if there are other important claims that I forgot, and any other relevant feedback on this post.
After each claim, I will indicate my level of agreement as follows:
The following are claims from shard theory that can mostly be formulated without even talking about shards. Shards are often in the background of much of the thinking, but they don't need to be mentioned explicitly.
There is neuroscientific evidence showing that humans get most of their complex values from within-lifetime learning. In other words, human values are "learned from scratch" and not "hardcoded". One core argument is that we value many things that didn’t even exist in our evolutionary environment, like photographs of our past or specific regional traditions. Other arguments are made in Human values & biases are inaccessible to the genome.
There are hardcoded reward circuits in human brains, mostly coming from the brain stem, that provide reinforcement signals that the brain uses to develop its values, but the resulting values do not coincide with this reward.
The claim is used as supporting evidence in humans provide an untapped wealth of evidence about alignment and the shard theory of human values.
I tentatively agree with this claim after reading much of the above-mentioned supporting evidence. The reason that my agreement is only tentative is that I'm not even remotely a neuroscientist. More concretely, while it seems reasonable that our actual values emerge in our lifetimes, it seems conceivable to me that the evolutionarily formed inductive biases of our brain contain many "hacks" that are hard to reproduce in machine learning systems.
We want AI systems to care about latent concepts such as "happiness" or "human values". We know exactly one example of a system that cares about such concepts, and it is the developed human brain. Furthermore, both humans and (future) AI systems need to obtain their values from within-lifetime learning with comparable algorithms.
Thus, in order to get evidence for how to align AI systems, we should learn more about how humans get their values.
This seems correct to me as a directional claim: probably more people should look into this.
However, I would not go as far as Quintin, who tentatively guesses that human within-lifetime learning provides 60% of the evidence for how inner goals relate to outer optimization criteria, whereas current machine learning is only a small part of a set of evidence sources that in total account for 36%. I might agree with this claim when purely looking at publicly available information right now, but I have a strong intuition that developing a general theory of cognition and learning, and making targeted machine learning experiments, will eventually form a considerably larger source of evidence.
When reasoning about agents that result from a reinforcement learning process, people often think about the reward function itself: they try to figure out what an agent “maximizing” that reward function would do. However, this is confused: Reward is not the optimization target. Reinforcement agents do not, by default, "try" to maximize reward. They execute those computations and actions which were reinforced into them by policy gradients.
The resulting agent is not simply determined by the shape of the reward function, but more generally by its initialization, the reinforcement learning algorithm (including exploration bonuses and credit assignment procedure), the training curriculum, other algorithm details, and the environment.
I absolutely agree that thinking about the training dynamics as a whole is the correct way to reason about the behavior of trained RL agents. I just want to emphasize that risks that were previously argued for using optimization pressures often admit reformulations: insofar as reinforcement learning operates in such a way that the agent, over time, receives an ever higher reward, and insofar as adversarial actions — like wireheading — receive an extremely high reward, it is important to reason why the training process cannot possibly find them.
An argument why wireheading is worth worrying about is that some training processes with large exploration bonuses will try a lot of things. E.g., I would guess that the following type of agent could eventually care about the reward register itself:
I think we cannot easily get rid of any of the eight points, which makes the story very conjunctive. But I think all of the points are reasonable enough that it is at least plausible to obtain an agent which ends up wireheading.
However, whether you should work on wireheading is a different story: if you have that intuition, then consider whether your reasons are still credible given your current beliefs about reinforcement learning training processes.
Classical thinking in AI risk wants to create a utility function that can be "hard optimized" without leading to existential risk. In the modern ML paradigm, we have an analog of utility maximization, namely reinforcement learning. This analog made many people think that we need to find an adversarially safe reward signal.
But if (as the preceding claim argued) all of the concrete training dynamics determine the resulting agent, then it does not make sense to try to design a reward function such that maximizing that reward function is a "good training goal"; we won't see such an agent anyway. In other words, the decomposition of alignment into inner/outer alignment probably doesn't make much sense.
There is also a philosophical reason why the inner/outer alignment distinction seems problematic. Producing an agent that is supposed to maximize a reward function might end up intrinsically caring about the evaluation of its actions, instead of what they represent.
However, this does not mean that we should get rid of reward functions altogether; reward functions are amazing tools that provide gradients to shape the internals of agents. In fact, gradients are, as of now, almost our only tool to shape internal computations. We should thus view reward functions similar to how we view chisels that chisel a statue. In the same way that the chisel doesn't need to look like the finished statue, reward functions don't need to be adversarial robust evaluations.
Goodhart’s Law One potential counterargument proposed by Evan Hubinger is that even if we will never reach agents with maximal reward, having an outer aligned reward function is less susceptible to Goodhart's Law as ML systems are scaled up. If one also takes care that the agent doesn't learn to care about the evaluation of the reward function intrinsically, then many problems with the inner/outer alignment frame seem to disappear.
Overall, one stance of my brain is not that "the inner/outer alignment frame doesn't make any sense", but that everyone who works in that frame should additionally have a training story in mind that uses an outer aligned reward function to produce an aligned trained agent. Also, people using that frame should argue why it is competitive compared to frames that do not have the requirement of an adversarially robust reward function and might thus be considerably easier to implement.
Different Versions of Inner/Outer Alignment Additionally, not all versions of solving outer alignment seem to involve outer-aligned reward functions at all. This is exemplified by John Wentworth's viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem. Under that view, reward functions are replaced by natural abstractions of real-world concepts. Solving "outer alignment" then means retargeting the search of a superintelligence to the learned natural abstraction that correctly represents human values. Therefore, it is desirable to be specific that the shard theory position against inner/outer alignment doesn't easily apply to all versions of these concepts.
On that note, it seems desirable to settle on more precise language: people should clearly define what they mean whenever they use terms like inner and outer alignment. Additionally, I think it would be highly valuable to have a comprehensive collection/distillation of all the ways one could use these terms. Less realistically, it might be worthwhile if the community could converge on a specific meaning.
There are two ways in which people imagine RL training processes to create "graders" of some sort, meaning an internal evaluation/objective that the agent tries to maximize when deciding its course of action. One is that people imagine the RL agent to care exactly for the reward function output — which the preceding claim argues against as a useful frame. Another possibility is that the RL agent becomes a mesa optimizer with its own objective.
However, the actual agent's computations that result from policy gradient RL do not, by default, create an evaluation function inside the agent. Instead, the computations are, roughly, generalizations of contextual drives which led to reward in the past. In this sense, RL agents "act" in the world and do not try to find adversarial inputs to any type of evaluation function. A related claim was already in 2007 communicated by Eliezer, tying into what we now call goal misgeneralization.
That claim seems about right to me for vanilla policy gradient reinforcement learning. However, it is at least conceivable to have a set-up of advanced AI in which actions are chosen based on internal evaluations. For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.
Additionally, some model-based reinforcement learning algorithms like MuZero and EfficientZero actually seem to search for actions that are predicted to lead to a high value.
To come to a more definitive conclusion, one probably needs to have a concrete understanding of the type of AI expected in an AGI training process, possibly by clarifying the threat models.
Disclaimer: The exact claim on what shard theory (or different proponents) predict about consequentialist reasoning in AI is, according to Alex, not yet out there, and what I write here is very vague. Needless to say, Alex and Quintin, and other shard theory scholars, may in the future have more to say on what they actually believe. Instead of reading this vague claim and my opinion, you may also just read on in the next section on shards themselves, where some amount of consequentialism is implicit in the more speculative claims.
The preceding claim argued that RL training processes do not create graders, meaning that the resulting agents most likely don't simply try to maximize any (internal) score. To maximize a score is one instance of "caring terminally about outcomes", and thus a special case of consequentialism. This claim argues that there is something wrong with the original viewpoints on consequentialism, for example, the fact that they don’t account for the contextual nature in which computations in trained AI systems activate.
I'm uncertain what precisely the claim is. Instead of arguing for or against this claim, I now simply describe what I personally think on the topic of consequentialism in advanced AI. The explanation below, once again, focuses on policy gradient methods combined with a critic in particular, and I would encourage people to think the situation through for other training setups.
I think one can take roughly two opposing viewpoints (and many in between) on the question of how deontological vs. consequentialist a trained policy gradient RL agent will end up being. My outline seems related to Steve's post on consequentialism and corrigibility.
I'm also curious about whether such consequentialist reasoning lets the number of things an agent cares about, in some sense, "shrink down": as soon as an agent competently plans and cares about a small set of outcomes that are highly entangled with reward, I could imagine the agent to unlearn any previously learned heuristics since they become irrelevant or even counterproductive for obtaining high reward. If this happens, then even small changes in the few remaining goals could have an enormous effect on its behavior. This means that for consequentialist agents, it seems particularly important to monitor the goals they care about.
In this section, I include several claims involving the term that gave shard theory its name — namely, shards themselves. Shards were in the background of many of the claims and elaborations above; one hope is that delaying the use of the term “shard” until now helps people to evaluate the above-mentioned claims without being influenced by their overall evaluation of the many facets of shards.
Note that “shards” are roughly synonymous with “values”, but broader in their scope. This is explained in The shard theory of human values:
[M]ost people wouldn’t list “donuts” among their “values.” To avoid this counterintuitiveness, we would refer to a “donut shard” instead of a “donut value.”
This claim is less of a claim and more of a definition. It does have some claim-like connotations, though. If I would write the post again, I would probably carefully disentangle definitions from claims better. I advise the reader to keep this in their mind when they feel confused.
This claim is argued in the shard theory of human values. That post is mostly about human value formation, but I write the description below abstractly in the language of reinforcement learning.
We should assume that (policy gradient) reinforcement events lead roughly to a strengthening of computations that precede these events. Thus, after many reinforcement events, the agent is equipped with lots of computations that were useful for obtaining reward in the past. This doesn't yet explain all parts of the title written above the line "The Claim", and the following is meant to clarify the remaining ones:
There is more nuance that one could add to make the definition of shards more complex, but that I left out to not blow it up:
I fully agree that what is called a "shard" in the claim is the thing that is learned by reinforcement learning agents. Also, at least for humans, this definition and the underlying theory lead to interesting predictions that can be falsified and that explain many behaviors. However, for a pure policy gradient reinforcement learning setting, the definition seems very broad, and I find it difficult to extract non-obvious, sensible, and falsifiable predictions from it. Indeed, Alex Turner himself notes that calling shard theory a "theory" may have been a mistake. Nevertheless, continuing for a bit with the view that it is a theory, I would note the following issues that I experience at this point:
What follows are further claims about the nature of shards that are sometimes made. They are both more speculative (and often non-formalized) and make more predictions, meaning that they are on the way to developing the shard paradigm into an actual theory.
Each computation that was reinforced in the training process of a reinforcement learning agent (including humans) was activated in response to a very specific context. Only if the context of a new situation is relevantly similar to this old context will this computation be activated again. This means that shards are largely sparsely activated; they are “inactive” most of the time.
An agent may encounter a situation that shares similarities with several situations that activate computations. These computations are largely independent and do not interfere with each other, meaning that they all fully execute to produce large log probabilities of actions they find “desirable”. In the last layer, these are all added together and then softmaxed, leading to the final action probabilities.
The shards are thus sets of computations that “bid for actions”, and thus can be thought of as decision influences. This viewpoint is explained in more detail in this ML interpretation of shard theory.
I don’t know. Instead of speculating further whether this story is true, it seems very worthwhile to do hands-on mechanistic interpretability research to understand the inner workings of policy networks. In the human case, neuroscience may already have answers on how true this story is. If so, I’m mostly ignorant of that.
This claim is argued for in the shard theory of human values and is a current working model that makes the appearance of shards more elaborate, while still not being fully formalized.
The claim builds on the preceding claim by postulating the "shape" that shards take in humans and future AI systems that contain world models. World models can make complex predictions about the future world state. The context that activates shards may then include predictions about how the world will change in response to specific actions. A shard is then thought of as “bidding for actions/plans” that the world model predicts to lead to states that the shard "likes".
Additionally, inconsistencies in these plans are over time removed by reinforcement learning: it is disadvantageous for all shards if they collectively lead to repeated plan changes. The plans thus become more consistent over time and are endorsed by many of the active shards.
While the headline claim may seem like common sense, the details are quite intricate. It is also considerably less formalized than the claims before. For evaluating it, it seems important to extract more concrete falsifiable predictions; I’m not at a point where I can easily mechanistically model what is going on in the story above. Some comments:
This claim was made in Alex’s proposal for the diamond-alignment problem. Different shards in a pre-advanced agent may want different things, and while the reinforcement learning process broadly favors consistency of the pursued plans, the different shards are still sometimes competing against each other. At some point, the agent becomes reflective and can think about its own shard activation processes, and some of the shards become aware of the inconsistencies in the planning procedure. Assuming that many of the shards can also model each other quite well, they may be interested in cooperating with competing shards in order to more efficiently reach goals. Eventually, this is thought to lead to a values handshake in which the different shards agree to tile the future according to the respective chances of winning in an actively adversarial competition among each other. As a result, the AI acts as a coherent utility maximizer.
It definitely seems to me that humans sometimes engage in reflection processes that are getting somewhat close to a values handshake: Some people actively self-improve to become more focused, and much of that involves taking the needs of different parts of one's mind into account to design a life that is collectively endorsed while being highly efficient.
On the other hand, this claim is quite far ahead of the progress in formalizing shard theory and empirical evidence about AI systems. Maybe one should first form concrete models and falsifiable predictions of shard theory on today’s systems before engaging further in modeling shard theory’s consequences for the late training stages of a superintelligence.
Shard theory claims a few more things that have not made the “final cut” of this distillation. Instead of either not mentioning them at all, or otherwise permanently procrastinating to finish this article, I decided to just quickly mention them here:
In this post, I have explained what I consider the core claims of shard theory. I tried to disentangle them in such a way that they can largely be evaluated independently of each other. It turns out that many claims do not involve shards at all, while still meaningfully differing from conventional thought processes on AI alignment.
The claims on shards themselves, which I outlined in the second part of this post, progressively build on top of each other and form a tower of predictions about the inner workings of humans, (policy gradient) RL agents, and future superintelligences. The later claims are strong insofar as they paint a more speculative picture of the inner workings of RL agents that engage in world modeling. These claims are also less concrete and thus currently harder to falsify. I hope that empirical evidence and further formalizations can work together to refine this picture.
Note that “trying a lot of things” is not the same as “searching with a universal quantifier over trajectories". The latter would definitely find wireheading as a solution to the optimization process but is entirely unrealistic as a training process for advanced AI.
Note that "maximizing that reward function" should maybe be interpreted as "maximizing the thing that is rewarded by the reward function". For example, Alex Turner writes the following:
Evan privately provided another definition which better accounts for the way he currently considers the problem of outer+inner alignment:"A model that has the same goal that the loss/reward function describes. So if the loss function rewards agents for getting gold coins, then the training goal is an agent that terminally cares about gold coins."
I must admit that this distinction is seriously confusing to me since it seems to “assume away” the possibility of wireheading, or more generally of the “embeddedness” of the reward function in the real world. If one uses such a definition, then the problem seems to decompose into three problems instead of only 2:
- Make sure the reward function is outer aligned (i.e., it rewards good goals).
- Make sure that the training setup is designed in such a way that the agent cannot maximize the reward function in a different way than by achieving “what it rewards”.
- Make sure the agent maximizes reward.
Potentially, mechanistic interpretability could one day allow us to shape the goals of RL agents in more direct ways, for example by Retargeting the Search.
See also the comment of Thomas Larsen for further clarification
Epistemic Status: I looked shortly into MuZero to roughly understand the equations in there. EfficientZero is said to build on MuZero in some way, and while I don’t know the details, I would guess that means that it involves an internal value- or reward prediction. This may be wrong.
This is not a critique of shard theory scholars. I could ask Alex or Quintin what they actually think, but I’m too impatient and want to rather finish this article faster.
The reason why the definition becomes more “claim-shaped” for humans compared to policy-gradient RL agents is that humans are not, in fact, known for certain to underlie (roughly) the same learning dynamics.
In the case of humans, we should probably not think of log-probabilities but of “drives” toward certain actions.
By states that the shard "likes" I simply mean the states defining the type of reinforcement event that all the computations in the shard steer toward.
RL training processes create actors, not graders …For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.
RL training processes create actors, not graders …
For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.
I think you maybe have some confusions along the lines I was discussing here:
I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:(A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.(B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.
I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:
[I wasn’t sure when I first wrote that comment, but Alex Turner clarified that he was exclusively talking about (B) not (A) when he said “Don’t align agents to evaluations of plans” and such.]
The brain algorithm fits (A) (or so I claim), but that’s compatible with either (B) or (not-B), depending on what happens during training etc.