Disentangling Shard Theory into Atomic Claims

Leon Lang

Introduction

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Magdalena Wache for giving feedback on a recent version, and to Alex Turner for giving feedback on an early version of this article.

When thinking about shard theory, I noticed that my brain wanted to answer questions such as "What does shard theory predict?" or "Do I agree with shard theory?". Additionally, I observed other people engage in similar thinking. I now think this is confused, and that it is better to view shard theory as a bag of related claims that should be reasoned about and evaluated separately.

Therefore, In this post I want to explain the following:

What do I currently consider the main claims of shard theory?
What is my own stance on these claims?

This distillation is in spirit very similar to LawrenceC’s Shard Theory in Nine Theses. Note, however, that I did not read that distillation in order to produce a more independent explanation of shard theory.

My one-sentence categorization is that shard theory is both a theory for human value formation and also a paradigm/frame for thinking about alignment. It might also become a theory of value formation in RL agents, but it’s not quite there yet since it doesn’t make enough concrete and formalized empirical predictions yet.

Instead of splitting this distillation into “claims about humans and claims about trained RL agents” I decided to make a different split. Namely, I will start with a section on claims in the shard theory meme-space which do not involve shards at all, and later on claims that actually make use of the concept of a shard. I hope that this separation makes it easier for people to agree or disagree with very specific claims instead of accepting or rejecting the theory as a whole.

Please let me know if there are other important claims that I forgot, and any other relevant feedback on this post.

After each claim, I will indicate my level of agreement as follows:

✓ : Agree
(✓) : tentative agree
? : Neither agree nor disagree

Shard Theory claims without any shards

The following are claims from shard theory that can mostly be formulated without even talking about shards. Shards are often in the background of much of the thinking, but they don't need to be mentioned explicitly.

Humans get their values from within-lifetime learning

The Claim

There is neuroscientific evidence showing that humans get most of their complex values from within-lifetime learning. In other words, human values are "learned from scratch" and not "hardcoded". One core argument is that we value many things that didn’t even exist in our evolutionary environment, like photographs of our past or specific regional traditions. Other arguments are made in Human values & biases are inaccessible to the genome.

There are hardcoded reward circuits in human brains, mostly coming from the brain stem, that provide reinforcement signals that the brain uses to develop its values, but the resulting values do not coincide with this reward.

The claim is used as supporting evidence in humans provide an untapped wealth of evidence about alignment and the shard theory of human values.

My Opinion: (✓)

I tentatively agree with this claim after reading much of the above-mentioned supporting evidence. The reason that my agreement is only tentative is that I'm not even remotely a neuroscientist. More concretely, while it seems reasonable that our actual values emerge in our lifetimes, it seems conceivable to me that the evolutionarily formed inductive biases of our brain contain many "hacks" that are hard to reproduce in machine learning systems.

To achieve alignment of ML systems, we should learn more about how humans get their values

The Claim

We want AI systems to care about latent concepts such as "happiness" or "human values". We know exactly one example of a system that cares about such concepts, and it is the developed human brain. Furthermore, both humans and (future) AI systems need to obtain their values from within-lifetime learning with comparable algorithms.

Thus, in order to get evidence for how to align AI systems, we should learn more about how humans get their values.

My Opinion: ✓

This seems correct to me as a directional claim: probably more people should look into this.

However, I would not go as far as Quintin, who tentatively guesses that human within-lifetime learning provides 60% of the evidence for how inner goals relate to outer optimization criteria, whereas current machine learning is only a small part of a set of evidence sources that in total account for 36%. I might agree with this claim when purely looking at publicly available information right now, but I have a strong intuition that developing a general theory of cognition and learning, and making targeted machine learning experiments, will eventually form a considerably larger source of evidence.

Training dynamics, not optimization pressures, determine the behavior of RL agents

The Claim

When reasoning about agents that result from a reinforcement learning process, people often think about the reward function itself: they try to figure out what an agent “maximizing” that reward function would do. However, this is confused: Reward is not the optimization target. Reinforcement agents do not, by default, "try" to maximize reward. They execute those computations and actions which were reinforced into them by policy gradients.

The resulting agent is not simply determined by the shape of the reward function, but more generally by its initialization, the reinforcement learning algorithm (including exploration bonuses and credit assignment procedure), the training curriculum, other algorithm details, and the environment.

My Opinion: ✓

I absolutely agree that thinking about the training dynamics as a whole is the correct way to reason about the behavior of trained RL agents. I just want to emphasize that risks that were previously argued for using optimization pressures often admit reformulations: insofar as reinforcement learning operates in such a way that the agent, over time, receives an ever higher reward, and insofar as adversarial actions — like wireheading — receive an extremely high reward, it is important to reason why the training process cannot possibly find them.

An argument why wireheading is worth worrying about is that some training processes with large exploration bonuses will try a lot of things.^[1] E.g., I would guess that the following type of agent could eventually care about the reward register itself:

An agent trained to an AGI level of competence...
with a strong exploration bonus...
which eventually learns to chunk the world into “lots of abstract concepts”...
which understands that it is trained by RL…
and thereby learns about its “reward register”...
and whose exploration bonus makes it eventually curious about artificially trying to set the reward to a high value…
and whose intelligence and power make it capable of that undertaking...
and whose curiosity happens to suppress thoughts about how the wireheading event will produce a policy gradient that changes the agent's goals.

I think we cannot easily get rid of any of the eight points, which makes the story very conjunctive. But I think all of the points are reasonable enough that it is at least plausible to obtain an agent which ends up wireheading.

However, whether you should work on wireheading is a different story: if you have that intuition, then consider whether your reasons are still credible given your current beliefs about reinforcement learning training processes.

Rewards are tools to provide a gradient, not training goals; Against inner/outer alignment

The Claim

Classical thinking in AI risk wants to create a utility function that can be "hard optimized" without leading to existential risk. In the modern ML paradigm, we have an analog of utility maximization, namely reinforcement learning. This analog made many people think that we need to find an adversarially safe reward signal.

But if (as the preceding claim argued) all of the concrete training dynamics determine the resulting agent, then it does not make sense to try to design a reward function such that maximizing that reward function^[2] is a "good training goal"; we won't see such an agent anyway. In other words, the decomposition of alignment into inner/outer alignment probably doesn't make much sense.

There is also a philosophical reason why the inner/outer alignment distinction seems problematic. Producing an agent that is supposed to maximize a reward function might end up intrinsically caring about the evaluation of its actions, instead of what they represent.

However, this does not mean that we should get rid of reward functions altogether; reward functions are amazing tools that provide gradients to shape the internals of agents. In fact, gradients are, as of now, almost our only tool to shape internal computations.^[3] We should thus view reward functions similar to how we view chisels that chisel a statue. In the same way that the chisel doesn't need to look like the finished statue, reward functions don't need to be adversarial robust evaluations.

My Opinion: ?

Goodhart’s Law One potential counterargument proposed by Evan Hubinger is that even if we will never reach agents with maximal reward, having an outer aligned reward function is less susceptible to Goodhart's Law as ML systems are scaled up. If one also takes care that the agent doesn't learn to care about the evaluation of the reward function intrinsically, then many problems with the inner/outer alignment frame seem to disappear.

Overall, one stance of my brain is not that "the inner/outer alignment frame doesn't make any sense", but that everyone who works in that frame should additionally have a training story in mind that uses an outer aligned reward function to produce an aligned trained agent. Also, people using that frame should argue why it is competitive compared to frames that do not have the requirement of an adversarially robust reward function and might thus be considerably easier to implement.

Different Versions of Inner/Outer Alignment Additionally, not all versions of solving outer alignment seem to involve outer-aligned reward functions at all. This is exemplified by John Wentworth's viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem.^[4] Under that view, reward functions are replaced by natural abstractions of real-world concepts. Solving "outer alignment" then means retargeting the search of a superintelligence to the learned natural abstraction that correctly represents human values. Therefore, it is desirable to be specific that the shard theory position against inner/outer alignment doesn't easily apply to all versions of these concepts.

On that note, it seems desirable to settle on more precise language: people should clearly define what they mean whenever they use terms like inner and outer alignment. Additionally, I think it would be highly valuable to have a comprehensive collection/distillation of all the ways one could use these terms. Less realistically, it might be worthwhile if the community could converge on a specific meaning.

RL training processes create actors, not graders

The Claim

There are two ways in which people imagine RL training processes to create "graders" of some sort, meaning an internal evaluation/objective that the agent tries to maximize when deciding its course of action. One is that people imagine the RL agent to care exactly for the reward function output — which the preceding claim argues against as a useful frame. Another possibility is that the RL agent becomes a mesa optimizer with its own objective.

However, the actual agent's computations that result from policy gradient RL do not, by default, create an evaluation function inside the agent. Instead, the computations are, roughly, generalizations of contextual drives which led to reward in the past. In this sense, RL agents "act" in the world and do not try to find adversarial inputs to any type of evaluation function. A related claim was already in 2007 communicated by Eliezer, tying into what we now call goal misgeneralization.

My Opinion: ?

That claim seems about right to me for vanilla policy gradient reinforcement learning. However, it is at least conceivable to have a set-up of advanced AI in which actions are chosen based on internal evaluations. For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.

Additionally, some model-based reinforcement learning algorithms like MuZero and EfficientZero actually seem to search for actions that are predicted to lead to a high value.^[5]

To come to a more definitive conclusion, one probably needs to have a concrete understanding of the type of AI expected in an AGI training process, possibly by clarifying the threat models.

(Naive) consequentialist reasoning might not be convergent in advanced AI

The Claim

Disclaimer: The exact claim on what shard theory (or different proponents) predict about consequentialist reasoning in AI is, according to Alex, not yet out there, and what I write here is very vague. Needless to say, Alex and Quintin, and other shard theory scholars, may in the future have more to say on what they actually believe. Instead of reading this vague claim and my opinion, you may also just read on in the next section on shards themselves, where some amount of consequentialism is implicit in the more speculative claims.

The preceding claim argued that RL training processes do not create graders, meaning that the resulting agents most likely don't simply try to maximize any (internal) score. To maximize a score is one instance of "caring terminally about outcomes", and thus a special case of consequentialism. This claim argues that there is something wrong with the original viewpoints on consequentialism, for example, the fact that they don’t account for the contextual nature in which computations in trained AI systems activate.

My Opinion: ?

I'm uncertain what precisely the claim is.^[6] Instead of arguing for or against this claim, I now simply describe what I personally think on the topic of consequentialism in advanced AI. The explanation below, once again, focuses on policy gradient methods combined with a critic in particular, and I would encourage people to think the situation through for other training setups.

I think one can take roughly two opposing viewpoints (and many in between) on the question of how deontological vs. consequentialist a trained policy gradient RL agent will end up being. My outline seems related to Steve's post on consequentialism and corrigibility.

Deontology: On the one end of the spectrum, agents mostly engage in random actions in much of the training process, and, more or less by chance, this leads them to sometimes enter states that have a high evaluation by the critic. This then leads to a contextual strengthening of the action just performed.
I imagine this process to eventually form a “forest-like structure”: the trunks of the trees are actions that immediately lead to past reinforcement events. Splitting points into branches are states. The branches themselves are contextually activated actions that “lead downward the tree” to states closer to a past reinforcement event. If a new action is strengthened in the context of a state that does not yet belong to a tree, then this action forms a new tiny branch/leaf of a tree, growing it to a larger size.
- This feels like deontology to me: in its internal computations, the agent likely does not think about the final (rewarding) outcome, but mainly about the "locally" sensible thing to do. Insofar as the agent can be thought to "care" about anything, it probably cares about "doing the right thing" at any given moment.
- In some sense, actions that were in the training process "instrumentally" valuable for obtaining reward may feel "terminally" valuable for such a "deontological" agent.
- The story is made more complicated by function approximation, meaning that the agent generalizes from states seen during training to new states. This applies to both the policy and the value function. Both generalizations can make the agent pursue learned actions that do not actually lead to a high reward.
Consequentialism: On the other end of the spectrum, agents build an elaborate world model and have internal predictors of outcomes that they can directly use to make decisions. Instead of performing actions, they perform “thoughts”, which include plans, and search for plans leading to highly favorable results. Once the best plan is found, it sets actions in motion, eventually leading to states with high reward. A few comments:
- Mechanistically, this view feels a lot less clear to me than the "deontology"-view. It would be quite an effort for me to think about how the training process of a vanilla policy gradient agent would go about finding such a policy. How does the credit assignment algorithm go about improving the search algorithm for the agent instead of building up contextual heuristics for what to do? How does the learning process decide which few outcomes end up as goals of the search processes?
- From the "optimization"-perspective, I can see that such a policy might be more optimal with respect to the reward function, but in light of the above claim that optimization pressures are over-used in reasoning about agent dynamics, this is not very convincing.
- Nevertheless, in the section on shards that “bid on plans” below, I explain a shard theory interpretation of trained RL agents that is quite similar to what I write about here. Also, see Section 3.2 of the new version of The Alignment Problem from a Deep Learning Perspective for preliminary evidence on learned implicit planning and goals in RL agents.

I'm also curious about whether such consequentialist reasoning lets the number of things an agent cares about, in some sense, "shrink down": as soon as an agent competently plans and cares about a small set of outcomes that are highly entangled with reward, I could imagine the agent to unlearn any previously learned heuristics since they become irrelevant or even counterproductive for obtaining high reward. If this happens, then even small changes in the few remaining goals could have an enormous effect on its behavior. This means that for consequentialist agents, it seems particularly important to monitor the goals they care about.

Shard Theory claims involving shards

In this section, I include several claims involving the term that gave shard theory its name — namely, shards themselves. Shards were in the background of many of the claims and elaborations above; one hope is that delaying the use of the term “shard” until now helps people to evaluate the above-mentioned claims without being influenced by their overall evaluation of the many facets of shards.

Note that “shards” are roughly synonymous with “values”, but broader in their scope. This is explained in The shard theory of human values:

[M]ost people wouldn’t list “donuts” among their “values.” To avoid this counterintuitiveness, we would refer to a “donut shard” instead of a “donut value.”

Shards are roughly sets of contextually activated computations that were useful in steering toward reward in the past

The Claim

This claim is less of a claim and more of a definition. It does have some claim-like connotations, though. If I would write the post again, I would probably carefully disentangle definitions from claims better. I advise the reader to keep this in their mind when they feel confused.

This claim is argued in the shard theory of human values. That post is mostly about human value formation, but I write the description below abstractly in the language of reinforcement learning.

We should assume that (policy gradient) reinforcement events lead roughly to a strengthening of computations that precede these events. Thus, after many reinforcement events, the agent is equipped with lots of computations that were useful for obtaining reward in the past. This doesn't yet explain all parts of the title written above the line "The Claim", and the following is meant to clarify the remaining ones:

The claim says the actions "steer toward" reward. This is meant to convey that the agent does not necessarily immediately obtain reward after each learned action, but that larger action sequences may be necessary for finally obtaining reward.
The claim says the computations are "contextually activated": they activate in the specific context in which they were first invoked. However, function approximation means that the same computation may be activated in many contexts that share similarities with the original states. This is part of what the term “roughly” is supposed to convey.
The claim does not say that these computations still effectively steer toward reward: they were useful in the past. Indeed, the reward function may meanwhile have changed, or other changes to the policy may change the precise causal effect of the learned computations.
- As an example of the reward function being changed, consider the case that it is replaced by the "null"-function that doesn't reward anything. Depending on the RL algorithm, this may lead to the agent's policy being frozen and the agent continuing to behave in exactly the same way it did before, without ever receiving any reward anymore.
- As an example of how changes to the policy may change the causal effect of specific computations, consider the case that someone develops a desire to see apples, which early in training may lead to many apples. However, once the policy changes to also value art, the person may combine these desires by buying art that shows apples.
Now, those actions that were useful in steering to the same or a similar type of reinforcement event in the past are bundled together into a set, and that set is then called a shard. For example, an agent may obtain reward for picking up things in the real world. All the computations that steer it toward picking up an object are then part of the "pick up object"-shard. Using the deontology view I explained above, one could also say that all computations belonging to the same "tree" are part of the same shard.

There is more nuance that one could add to make the definition of shards more complex, but that I left out to not blow it up:

Actually, we should add the words “generalizations of” before “contextually” since the policy involves deep learning and thus generalizes its computations to new contexts;
Additionally, not all convergently reinforced actions may ever have been useful for steering toward reward: in actor-critic policy gradient reinforcement learning, actions are strengthened if they lead to states with a higher evaluation than the state one is coming from. However, there is also generalization happening in the learned value function of the critic. If this generalization is erroneous, then the high predicted value may not correspond to the prospect to achieve a high reward. This can be bad or good depending on whether the approximation errors reflect something true about our actual intentions.

My Opinion: (✓)

I fully agree that what is called a "shard" in the claim is the thing that is learned by reinforcement learning agents. Also, at least for humans, this definition and the underlying theory lead to interesting predictions that can be falsified and that explain many behaviors.^[7] However, for a pure policy gradient reinforcement learning setting, the definition seems very broad, and I find it difficult to extract non-obvious, sensible, and falsifiable predictions from it. Indeed, Alex Turner himself notes that calling shard theory a "theory" may have been a mistake. Nevertheless, continuing for a bit with the view that it is a theory, I would note the following issues that I experience at this point:

Sometimes during training, the agent may perform a useless action, and shortly after performs an action leading to reward. The credit assignment algorithm will then likely strengthen the useless action. This strengthened useless action is then (at least temporarily) baked into the agent, but by definition not actually part of a "shard": "usefulness" is part of that definition! This means that an agent has two types of computations: ones that were useful for steering toward reward, and ones that were not.
What falsifiable predictions can I make about computations in trained RL agents or about their training process based on this? It seems like I can't: "shard" is simply a name attached to certain sets of computations in RL agents which don't encompass all computations in these agents. Definitions do not constitute theorems.
Possibly this is beside the point, and making predictions is not what the concept of a "shard" was meant for. Maybe the point is just to shape our thinking in such a way that it leads us to make mechanistic claims about the types of computations that form in reinforcement learning agents. This is a view I fully agree with.
In this sense, shard theory is a frame or paradigm to think in, and might have the side-effect that people make better predictions about the behavior of RL agents than people who mainly think about optimization pressures. This claim seems fair, and I think I have observed at least one person to have wildly better alignment takes after reading reward is not the optimization target.
Someone could possibly argue that the definition of shards does in fact imply predictions: namely that the actions preceding a reinforcement event, and the actions preceding entering states with a high evaluation from a critic, will be more likely after the next gradient step; and that one needs to make mechanistic claims when arguing that any other type of action or thought process will convergently be strengthened in an RL training process. I fully agree with this, it just feels somewhat tautological to me. However, if this is not intuitive/clear to you, then from your point of view, shard theory does make interesting falsifiable predictions, and I have the opinion that you should engage with shard theory immediately.

What follows are further claims about the nature of shards that are sometimes made. They are both more speculative (and often non-formalized) and make more predictions, meaning that they are on the way to developing the shard paradigm into an actual theory.

Shards in Humans and AI Systems are Sparsely Activated Decision Influences

The Claim

Each computation that was reinforced in the training process of a reinforcement learning agent (including humans) was activated in response to a very specific context. Only if the context of a new situation is relevantly similar to this old context will this computation be activated again. This means that shards are largely sparsely activated; they are “inactive” most of the time.

An agent may encounter a situation that shares similarities with several situations that activate computations. These computations are largely independent and do not interfere with each other, meaning that they all fully execute to produce large log probabilities of actions they find “desirable”.^[8] In the last layer, these are all added together and then softmaxed, leading to the final action probabilities.

The shards are thus sets of computations that “bid for actions”, and thus can be thought of as decision influences. This viewpoint is explained in more detail in this ML interpretation of shard theory.

My Opinion: ?

I don’t know. Instead of speculating further whether this story is true, it seems very worthwhile to do hands-on mechanistic interpretability research to understand the inner workings of policy networks. In the human case, neuroscience may already have answers on how true this story is. If so, I’m mostly ignorant of that.

Shards in humans and future AI systems use a world model to collectively pursue coherent plans

The Claim

This claim is argued for in the shard theory of human values and is a current working model that makes the appearance of shards more elaborate, while still not being fully formalized.

The claim builds on the preceding claim by postulating the "shape" that shards take in humans and future AI systems that contain world models. World models can make complex predictions about the future world state. The context that activates shards may then include predictions about how the world will change in response to specific actions. A shard is then thought of as “bidding for actions/plans” that the world model predicts to lead to states that the shard "likes".^[9]

Additionally, inconsistencies in these plans are over time removed by reinforcement learning: it is disadvantageous for all shards if they collectively lead to repeated plan changes. The plans thus become more consistent over time and are endorsed by many of the active shards.

My Opinion: ?

While the headline claim may seem like common sense, the details are quite intricate. It is also considerably less formalized than the claims before. For evaluating it, it seems important to extract more concrete falsifiable predictions; I’m not at a point where I can easily mechanistically model what is going on in the story above. Some comments:

Whether this story is accurate for humans, or at least a good model, can probably be answered both introspectively (with the caveat of it being hard to communicate and there being no ground truth) and using neuroscience. On the neuroscience front, the brain-like AGI safety sequence seems to lend support to some version of this claim.
This claim comes close to the speculations on consequentialism I presented above and additionally includes competition or cooperation between different drives.
Whether the story is accurate in RL agents probably depends on the training setup. I could imagine that implicitly model-based systems like EfficientZero come closer to the description than simple policy gradient RL agents.

Late training stages might involve a shard reflection process leading to a coherent utility function

The Claim

This claim was made in Alex’s proposal for the diamond-alignment problem. Different shards in a pre-advanced agent may want different things, and while the reinforcement learning process broadly favors consistency of the pursued plans, the different shards are still sometimes competing against each other. At some point, the agent becomes reflective and can think about its own shard activation processes, and some of the shards become aware of the inconsistencies in the planning procedure. Assuming that many of the shards can also model each other quite well, they may be interested in cooperating with competing shards in order to more efficiently reach goals. Eventually, this is thought to lead to a values handshake in which the different shards agree to tile the future according to the respective chances of winning in an actively adversarial competition among each other. As a result, the AI acts as a coherent utility maximizer.

My Opinion: ?

It definitely seems to me that humans sometimes engage in reflection processes that are getting somewhat close to a values handshake: Some people actively self-improve to become more focused, and much of that involves taking the needs of different parts of one's mind into account to design a life that is collectively endorsed while being highly efficient.

On the other hand, this claim is quite far ahead of the progress in formalizing shard theory and empirical evidence about AI systems. Maybe one should first form concrete models and falsifiable predictions of shard theory on today’s systems before engaging further in modeling shard theory’s consequences for the late training stages of a superintelligence.

Further Claims

Shard theory claims a few more things that have not made the “final cut” of this distillation. Instead of either not mentioning them at all, or otherwise permanently procrastinating to finish this article, I decided to just quickly mention them here:

To avoid value drift, the agent can try to cause desirable shards to be active during and before reinforcement events.
Thinking in the shard frame provides an outline for how to solve the diamond alignment problem.
Agents will “competently generalize in multiple ways”.
The term “shard” makes sense.
Sometimes, mathematically simple reward functions can reinforce complex, desired behavior.

Conclusion

In this post, I have explained what I consider the core claims of shard theory. I tried to disentangle them in such a way that they can largely be evaluated independently of each other. It turns out that many claims do not involve shards at all, while still meaningfully differing from conventional thought processes on AI alignment.

The claims on shards themselves, which I outlined in the second part of this post, progressively build on top of each other and form a tower of predictions about the inner workings of humans, (policy gradient) RL agents, and future superintelligences. The later claims are strong insofar as they paint a more speculative picture of the inner workings of RL agents that engage in world modeling. These claims are also less concrete and thus currently harder to falsify. I hope that empirical evidence and further formalizations can work together to refine this picture.

^{^}
Note that “trying a lot of things” is not the same as “searching with a universal quantifier over trajectories". The latter would definitely find wireheading as a solution to the optimization process but is entirely unrealistic as a training process for advanced AI.
^{^}
Note that "maximizing that reward function" should maybe be interpreted as "maximizing the thing that is rewarded by the reward function". For example, Alex Turner writes the following:
Evan privately provided another definition which better accounts for the way he currently considers the problem of outer+inner alignment:
"A model that has the same goal that the loss/reward function describes. So if the loss function rewards agents for getting gold coins, then the training goal is an agent that terminally cares about gold coins."
I must admit that this distinction is seriously confusing to me since it seems to “assume away” the possibility of wireheading, or more generally of the “embeddedness” of the reward function in the real world. If one uses such a definition, then the problem seems to decompose into three problems instead of only 2:
- Make sure the reward function is outer aligned (i.e., it rewards good goals).
- Make sure that the training setup is designed in such a way that the agent cannot maximize the reward function in a different way than by achieving “what it rewards”.
- Make sure the agent maximizes reward.
^{^}
Potentially, mechanistic interpretability could one day allow us to shape the goals of RL agents in more direct ways, for example by Retargeting the Search.
^{^}
See also the comment of Thomas Larsen for further clarification
^{^}
Epistemic Status: I looked shortly into MuZero to roughly understand the equations in there. EfficientZero is said to build on MuZero in some way, and while I don’t know the details, I would guess that means that it involves an internal value- or reward prediction. This may be wrong.
^{^}
This is not a critique of shard theory scholars. I could ask Alex or Quintin what they actually think, but I’m too impatient and want to rather finish this article faster.
^{^}
The reason why the definition becomes more “claim-shaped” for humans compared to policy-gradient RL agents is that humans are not, in fact, known for certain to underlie (roughly) the same learning dynamics.
^{^}
In the case of humans, we should probably not think of log-probabilities but of “drives” toward certain actions.
^{^}
By states that the shard "likes" I simply mean the states defining the type of reinforcement event that all the computations in the shard steer toward.

[-]Steven Byrnes4y40

RL training processes create actors, not graders …
For example, in the motivation framework in the brain-like AGI safety sequence, the thought generator produces thoughts that are explicitly chosen so as to maximize the reward prediction error.

I think you maybe have some confusions along the lines I was discussing here:

I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:
(A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
(B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.

[I wasn’t sure when I first wrote that comment, but Alex Turner clarified that he was exclusively talking about (B) not (A) when he said “Don’t align agents to evaluations of plans” and such.]

The brain algorithm fits (A) (or so I claim), but that’s compatible with either (B) or (not-B), depending on what happens during training etc.

41

Disentangling Shard Theory into Atomic Claims

41

Introduction

Shard Theory claims without any shards

Humans get their values from within-lifetime learning

The Claim

My Opinion: (✓)

To achieve alignment of ML systems, we should learn more about how humans get their values

The Claim

My Opinion: ✓

Training dynamics, not optimization pressures, determine the behavior of RL agents

The Claim

My Opinion: ✓

Rewards are tools to provide a gradient, not training goals; Against inner/outer alignment

The Claim

My Opinion: ?

RL training processes create actors, not graders

The Claim

My Opinion: ?

(Naive) consequentialist reasoning might not be convergent in advanced AI

The Claim

My Opinion: ?

Shard Theory claims involving shards

Shards are roughly sets of contextually activated computations that were useful in steering toward reward in the past

The Claim

My Opinion: (✓)

Shards in Humans and AI Systems are Sparsely Activated Decision Influences

The Claim

My Opinion: ?

Shards in humans and future AI systems use a world model to collectively pursue coherent plans

The Claim

My Opinion: ?

Late training stages might involve a shard reflection process leading to a coherent utility function

The Claim

My Opinion: ?

Further Claims

Conclusion