I use the shard theory of human values to clarify what value drift is, how it happens, and how it might be avoided by a highly intelligent agent—even if that agent doesn't have any control over its future experiences. Along the way, I give a shard theory account of rationalization.
Defining "value drift"
Recapitulating part of shard theory. Reward is that which reinforces. Considering the case of reinforcement learning in humans, reward causes your brain’s credit assignment algorithms to reinforce the actions and thoughts which led to that reward, making those actions and thoughts more likely to be selected in the future.
For example, suppose you recognize a lollipop, and move to pick it up, and then lick the lollipop. Since the lollipop produces reward, these thoughts will be reinforced and you will be more likely to act similarly in such situations in the future. You become more of the kind of person who will move to pick up a lollipop when you recognize lollipops, and who will navigate to lollipop-containing locations to begin with.
With that in mind, I think that shard theory offers a straightforward definition of "value drift":
Definition. Value drift occurs when reinforcement events substantially change the internal "balance of power" among the shards activated in everyday situations.
For example, consider the classic "example" of taking a pill which makes you enjoy killing people. Under shard theory, this change would be implemented as a murder-shard that activates in a wide range of contexts in order to steer planning towards murder, and therefore starts steering your decision-making substantially differently.
But it's better to try to explain phenomena which, you know, are known to actually happen in real life. Another simple example of value drift is when someone snorts cocaine. At a (substantial) gloss, the huge hit of reward extremely strongly upweights the decision to do cocaine; the strength of the reward leads to an unusually strong cocaine-shard which activates in an unusually wide range of situations.
Here's a more complicated example of value drift. I'll give one possible mechanistic story for the "value drift" which occurs to an atheist (Alice) dating a religious person (Rick), and why that situation might predictably lead to Alice converting or Rick deconverting. I'll consider a scenario where Alice converts.
First, reinforcement events cause Alice to develop shards of value around making Rick happy and making Rick like her. Alice's new shards (non-introspectively-apparently) query her world model for plans which make Rick happier and which make Rick like her more. Obviously, if Alice converted, they would have more in common, and Rick would be happy. Since these plans lead to Rick being happy and liking Alice more, these shards bid for those plans.
Only, the plan is not bid for directly in an introspectively obvious manner. That would provoke opposition from Alice's other values (which oppose deliberately changing her religious status just to make Rick happy). Alice's self-model predicts this opposition, and so her Rick-happiness- and Rick-approval-shards don't bid for the "direct" conversion plan, because it isn't predicted to work (and therefore won't lead to a future where Rick is happier and approves of Alice more). No, instead, these two shards rationalize internally-observable reasons why Alice should start going to Rick's church: "it's respectful", "church is interesting", "if I notice myself being persuaded I can just leave", "I'll get to spend more time with Rick."
Here, then, is the account:
- Alice's Rick-shards query her world model for plans which lead to Rick being happier and liking Alice more,
- so her world model returns a plan where she converts and goes to church with Rick;
- In order to do this, the plan's purpose must be hidden so that other shards do not bid against the plan,
- so this church-plan is pitched via "rationalizations" which are optimized to win over the rest of Alice's shard economy,
- so that she actually decides to implement the church-going plan,
- so that she gets positive reinforcement for going to church,
- so that she grows a religion-shard,
- (This is where the value drift happens, since her internal shard balance significantly changes!)
- so that she converts,
- so that Rick ends up happier and liking Alice more.
Her Rick-shards plan to induce value drift, and optimize the plan to make sure that it's hard for her other shards to realize the implicitly-planned outcome (Alice converting) and bid against it. This is one kind of decision-making algorithm which rationalizes against itself.
Under shard theory, rationality is sometimes hard because "conscious-you" has to actually fight deception by other parts of yourself.
One simple trick for avoiding value drift
Imagine you’ve been kidnapped by an evil, mustache-twirling villain who wants to corrupt your value system. They tie you to a chair and prepare to stimulate your reward circuitry. They want to ruin your current values by making you into an addict and a wireheader.
Exercise: How do you come out of the experience with your values intact?
In principle, the answer is simple. You just convince yourself you’re experiencing a situation congruent with your endorsed values, in a sufficiently convincing way that your brain’s credit assignment algorithm reinforces your pretend-actions when the brain stimulation reward occurs!
Consider that the brain does not directly observe the outside world. The outside world’s influence on your thinking is screened off by the state of your brain. The state of the brain constitutes the mental context. If you want to determine the output of a brain circuit, the mental context screens off the state of the world. In particular, this applies to the value updating process by which you become more or less likely to invoke certain bundles of heuristics (“value shards”) in certain mental contexts.
For example, suppose you lick a red lollipop, but that produces a large negative reward (maybe it was treated with awful-tasting chemicals). Mental context: “It’s Tuesday. I am in a room with a red lollipop. It looks good. I’m going to lick it. I think it will be good.” The negative reward reshapes your cognition, making you less likely to think similar thoughts and take similar actions in similar future situations.
Of the thoughts which were thunk before the negative reward, the credit assignment algorithm somehow identifies the relevant thoughts to include “It looks good”, “I’m going to lick it”, “I think it will be good”, and the various motor commands. You become less likely to think these thoughts in the future. In summary, the reason you become less likely to think these thoughts is that you thought them while executing the plan which produced negative reward, and credit assignment identified them as relevant to that result.
Credit assignment cannot and will not penalize thoughts which do not get thunk at all, or which it deems “not relevant” to the result at hand. Therefore, in principle, you could just pretend really hard that you’re in a mental context where you save a puppy’s life. When the electrically stimulated reward hits, the altruism-circuits get reinforced in the imagined mental context. You become more altruistic overall.
Of course, you have to actually dupe the credit assignment algorithm into ignoring the latent “true” mental context. But your credit assignment is not infinitely clever. And if it were, well, you could (in principle) add an edge-case for situations like this. So there is, in principle, a way to do it.
Therefore, your values can always be safe in your own mind, if you’re clever, foresightful, and have enough write access to fool credit assignment. Even if you don’t have control over your own future observations.
If this point still does not seem obvious, consider a scenario where you are blindfolded, and made to believe that you are about to taste a lollipop. Then, your captors fake the texture and smell and feel of a lollipop in your mouth, while directly stimulating your taste buds in the same way the lollipop would have. They remove the apparatus, and you go home. Do you think you have become reshaped to value electrical stimulation of your tongue? No. That is impossible, since your brain has no idea about what actually happened. Credit assignment responds to reward depending on the mental context, not on the external situation.
Misunderstanding this point can lead to confusion. If you have a wire stuck in your brain’s reward center, surely that reward reinforces having a wire stuck in your brain! Usually so, but not logically so. Your brain can only reward based on its cognitive context, based on the thoughts it actually thought which it identifies as relevant to the achievement of the reward. Your brain is not directly peering out at reality and making you more likely to enter that state in the future.
Value drift occurs when your values shift. In shard theory, this means that your internal decision-making influences (i.e. shards) are rebalanced by reinforcement events. For example, if you try cocaine, that causes your brain's credit assignment to strongly upweight decision-making which uses cocaine and which pursues rewarding activities.
Value drift is caused by credit assignment. Credit assignment can only depend on its observable mental context, and can't directly peer out at the world to objectively figure out what caused the reward event. Therefore, you can (in theory) avoid value drift by tricking credit assignment into thinking that the reward was caused by a decision to e.g. save a puppy's life. In that case, credit assignment would reinforce your altruism-shard. While humans probably can't dupe their own credit assignment algorithm to this extent, AI can probably include edge cases to their own updating process. But knowing value drift works—on this theory, via "unendorsed" reinforcement events—seems practically helpful for avoiding/navigating value-risky situations (like gaining lots of power or money).
Thanks to Justis Mills for proofreading.
These credit assignment algorithms may be hardcoded and/or learned.
I feel confused about how, mechanistically, other shards wouldn't fully notice the proto-deceptive plan being evaluated by the self-model, but presently think this "partial obfuscation" happens in shard dynamics for human beings. I think the other shards do somewhat observe the proto-deception, and this is why good rationalists can learn to rationalize less.
In The shard theory of human values, we defined the "mental context" of a circuit to be the inputs to that circuit which determine whether it fires or not. Here, I use "mental context" to also refer to the state of the entire brain, without considering a specific circuit. I think both meanings are appropriate and expect the meaning will be clear from the context.
"Credit assignment penalizes thoughts" seems like a reasonable frame to me, but I'm flagging that this could misrepresent the mechanistic story of human cognition in some unknown-to-me way.