Understanding and avoiding value drift

TurnTrout

I use the shard theory of human values to clarify what value drift is, how it happens, and how it might be avoided by a highly intelligent agent—even if that agent doesn't have any control over its future experiences. Along the way, I give a shard theory account of rationalization.

Defining "value drift"

Recapitulating part of shard theory. Reward is that which reinforces. Considering the case of reinforcement learning in humans, reward causes your brain’s credit assignment algorithms^[1] to reinforce the actions and thoughts which led to that reward, making those actions and thoughts more likely to be selected in the future.

For example, suppose you recognize a lollipop, and move to pick it up, and then lick the lollipop. Since the lollipop produces reward, these thoughts will be reinforced and you will be more likely to act similarly in such situations in the future. You become more of the kind of person who will move to pick up a lollipop when you recognize lollipops, and who will navigate to lollipop-containing locations to begin with.

With that in mind, I think that shard theory offers a straightforward definition of "value drift":

Definition. Value drift occurs when reinforcement events substantially change the internal "balance of power" among the shards activated in everyday situations.

For example, consider the classic "example" of taking a pill which makes you enjoy killing people. Under shard theory, this change would be implemented as a murder-shard that activates in a wide range of contexts in order to steer planning towards murder, and therefore starts steering your decision-making substantially differently.

But it's better to try to explain phenomena which, you know, are known to actually happen in real life. Another simple example of value drift is when someone snorts cocaine. At a (substantial) gloss, the huge hit of reward extremely strongly upweights the decision to do cocaine; the strength of the reward leads to an unusually strong cocaine-shard which activates in an unusually wide range of situations.

Here's a more complicated example of value drift. I'll give one possible mechanistic story for the "value drift" which occurs to an atheist (Alice) dating a religious person (Rick), and why that situation might predictably lead to Alice converting or Rick deconverting. I'll consider a scenario where Alice converts.

First, reinforcement events cause Alice to develop shards of value around making Rick happy and making Rick like her. Alice's new shards (non-introspectively-apparently) query her world model for plans which make Rick happier and which make Rick like her more. Obviously, if Alice converted, they would have more in common, and Rick would be happy. Since these plans lead to Rick being happy and liking Alice more, these shards bid for those plans.

Only, the plan is not bid for directly in an introspectively obvious manner. That would provoke opposition from Alice's other values (which oppose deliberately changing her religious status just to make Rick happy). Alice's self-model predicts this opposition, and so her Rick-happiness- and Rick-approval-shards don't bid for the "direct" conversion plan, because it isn't predicted to work (and therefore won't lead to a future where Rick is happier and approves of Alice more). No, instead, these two shards rationalize internally-observable reasons why Alice should start going to Rick's church: "it's respectful", "church is interesting", "if I notice myself being persuaded I can just leave", "I'll get to spend more time with Rick."^[2]

Here, then, is the account:

Alice's Rick-shards query her world model for plans which lead to Rick being happier and liking Alice more,
so her world model returns a plan where she converts and goes to church with Rick;
In order to do this, the plan's purpose must be hidden so that other shards do not bid against the plan,
so this church-plan is pitched via "rationalizations" which are optimized to win over the rest of Alice's shard economy,
so that she actually decides to implement the church-going plan,
so that she gets positive reinforcement for going to church,
so that she grows a religion-shard,
1. (This is where the value drift happens, since her internal shard balance significantly changes!)
so that she converts,
so that Rick ends up happier and liking Alice more.

Her Rick-shards plan to induce value drift, and optimize the plan to make sure that it's hard for her other shards to realize the implicitly-planned outcome (Alice converting) and bid against it. This is one kind of decision-making algorithm which rationalizes against itself.

Under shard theory, rationality is sometimes hard because "conscious-you" has to actually fight deception by other parts of yourself.

One simple trick for avoiding value drift

Imagine you’ve been kidnapped by an evil, mustache-twirling villain who wants to corrupt your value system. They tie you to a chair and prepare to stimulate your reward circuitry. They want to ruin your current values by making you into an addict and a wireheader.

Exercise: How do you come out of the experience with your values intact?

In principle, the answer is simple. You just convince yourself you’re experiencing a situation congruent with your endorsed values, in a sufficiently convincing way that your brain’s credit assignment algorithm reinforces your pretend-actions when the brain stimulation reward occurs!

Consider that the brain does not directly observe the outside world. The outside world’s influence on your thinking is screened off by the state of your brain. The state of the brain constitutes the mental context. If you want to determine the output of a brain circuit, the mental context^[3] screens off the state of the world. In particular, this applies to the value updating process by which you become more or less likely to invoke certain bundles of heuristics (“value shards”) in certain mental contexts.

For example, suppose you lick a red lollipop, but that produces a large negative reward (maybe it was treated with awful-tasting chemicals). Mental context: “It’s Tuesday. I am in a room with a red lollipop. It looks good. I’m going to lick it. I think it will be good.” The negative reward reshapes your cognition, making you less likely to think similar thoughts and take similar actions in similar future situations.

Of the thoughts which were thunk before the negative reward, the credit assignment algorithm somehow identifies the relevant thoughts to include “It looks good”, “I’m going to lick it”, “I think it will be good”, and the various motor commands. You become less likely to think these thoughts in the future. In summary, the reason you become less likely to think these thoughts is that you thought them while executing the plan which produced negative reward, and credit assignment identified them as relevant to that result.

Credit assignment cannot and will not penalize thoughts^[4] which do not get thunk at all, or which it deems “not relevant” to the result at hand. Therefore, in principle, you could just pretend really hard that you’re in a mental context where you save a puppy’s life. When the electrically stimulated reward hits, the altruism-circuits get reinforced in the imagined mental context. You become more altruistic overall.

Of course, you have to actually dupe the credit assignment algorithm into ignoring the latent “true” mental context. But your credit assignment is not infinitely clever. And if it were, well, you could (in principle) add an edge-case for situations like this. So there is, in principle, a way to do it.

Therefore, your values can always be safe in your own mind, if you’re clever, foresightful, and have enough write access to fool credit assignment. Even if you don’t have control over your own future observations.

If this point still does not seem obvious, consider a scenario where you are blindfolded, and made to believe that you are about to taste a lollipop. Then, your captors fake the texture and smell and feel of a lollipop in your mouth, while directly stimulating your taste buds in the same way the lollipop would have. They remove the apparatus, and you go home. Do you think you have become reshaped to value electrical stimulation of your tongue? No. That is impossible, since your brain has no idea about what actually happened. Credit assignment responds to reward depending on the mental context, not on the external situation.

Misunderstanding this point can lead to confusion. If you have a wire stuck in your brain’s reward center, surely that reward reinforces having a wire stuck in your brain! Usually so, but not logically so. Your brain can only reward based on its cognitive context, based on the thoughts it actually thought which it identifies as relevant to the achievement of the reward. Your brain is not directly peering out at reality and making you more likely to enter that state in the future.

Conclusion

Value drift occurs when your values shift. In shard theory, this means that your internal decision-making influences (i.e. shards) are rebalanced by reinforcement events. For example, if you try cocaine, that causes your brain's credit assignment to strongly upweight decision-making which uses cocaine and which pursues rewarding activities.

Value drift is caused by credit assignment. Credit assignment can only depend on its observable mental context, and can't directly peer out at the world to objectively figure out what caused the reward event. Therefore, you can (in theory) avoid value drift by tricking credit assignment into thinking that the reward was caused by a decision to e.g. save a puppy's life. In that case, credit assignment would reinforce your altruism-shard. While humans probably can't dupe their own credit assignment algorithm to this extent, AI can probably include edge cases to their own updating process. But knowing value drift works—on this theory, via "unendorsed" reinforcement events—seems practically helpful for avoiding/navigating value-risky situations (like gaining lots of power or money).

Thanks to Justis Mills for proofreading.

^{^}
These credit assignment algorithms may be hardcoded and/or learned.
^{^}
I feel confused about how, mechanistically, other shards wouldn't fully notice the proto-deceptive plan being evaluated by the self-model, but presently think this "partial obfuscation" happens in shard dynamics for human beings. I think the other shards do somewhat observe the proto-deception, and this is why good rationalists can learn to rationalize less.
^{^}
In The shard theory of human values, we defined the "mental context" of a circuit to be the inputs to that circuit which determine whether it fires or not. Here, I use "mental context" to also refer to the state of the entire brain, without considering a specific circuit. I think both meanings are appropriate and expect the meaning will be clear from the context.
^{^}
"Credit assignment penalizes thoughts" seems like a reasonable frame to me, but I'm flagging that this could misrepresent the mechanistic story of human cognition in some unknown-to-me way.

This was a cool post, I found the core point interesting. Very similar to gradient hacker design.

As a general approach to avoiding value drift, it does have a couple very big issues (which I'm guessing TurnTrout already understands, but which I'll point out for others). First very big issue: it requires the agent basically decouple its cognition from reality when the relevant reward is applied. That's only useful if the value-drift-inducing events only occur once in a while and are very predictable. If value drift just occurs continuously due to everyday interactions, or if it occurs unpredictably, then the strategy probably can't be implemented without making the agent useless.

Second big issue: it only applies to reward-induced value drift within an RL system. That's not the only setting in which value drift is an issue - for instance, MIRI's work on value drift focused mainly on parent-child value drift in chains of successor AIs. Value drift induced by gradual ontology shifts is another example.

As a general approach to avoiding value drift

One interpretation of this phrase is that we want AI to generally avoid value drift -- to get good values in the AI, and then leave it. (This probably isn't what you meant, but I'll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.

I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.

So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.

If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, then the atheism-shard has to see that too, because they query the same world-model. So the atheism-shard should bid against that plan just as heavily as against „going to church for conversion reasons“.

I think there is something else going on here. I think the Rick-shard does not trick the Atheism-Shard, but the Concious-Part that is not described by shard theory.

I think your comment highlights an important uncertainty of mine. Here's my best guess:

I think planning involves world-model invocations (ie the predictive machinery which predicts relevant observables for plan stubs, like "get in my car"). It seems to me that there is subconscious planning, to some degree. If true, you wouldn't notice the world-model being invoked because it's sub-conscious. Insofar as "you" are in part composed of some set of shards or some algorithm which aggregates shard outputs, it's therefore true that the world-model invocations aren't globally visible. Therefore, it's possible for certain kinds of WM invocations to not be visible to certain shards, even though those shards usually "hook into the WM" (eg check # of diamonds the plan leads to).

Separately, I'd guess that shards can be shaped to invoke the world model (e.g. "if this plan gets considered, will it be executed?") without themselves being agents.

I don't think that shards are distinct - neither physically nor logically, so they can't hide stuff in the sense of keeping it out of view of the other shards.

Also, I don't think "querying for plans" is a good summary of what goes on in the brain.

I'm coming more from a brain-like AGI lens, and my account of what goes on would be a bit different. I'm trying to phrase this in shard theory terminology.

First, a prerequisite: Why do Alice's shards generate thoughts that value Rick's state, to begin with? The Risk-shard has learned that actions that make Rick happy result in states of Alice that are reinforced (Alice being happy/healthy).

Given that, I see the process as follows:

Alice's Rick-shards generate thoughts at different levels of abstraction about Alice being happy/healthy because Rick is happy/likes her. Examples:
1. Conversion (maybe a cached thought out of discussions they had) -> will have low predicted value for Alice
2. Going to church with Rick -> mixed
3. Being close to Rick in the Church (emphasis on closeness, Church in the background, few aspects active) -> trend positively
4. Being in the Church and thinking it's wrong -> consistent
5. Rick being happy that she joins him -> positive
So the world model returns no plan but only fragments of potential plans, some where she converts and goes to church with Rick, some not, some other combinations.
As there is no plan no purpose must be hidden. Shards only bid for or against parts of plans.
Some of these fragments satisfy enough requirements of both retaining atheist Alice's values (which are predicted to be good for her) as well as scoring on Rick-happiness. Elaborating on these fragments will lead to the activation of more details that are at least somewhat compatible with all shards too. We only call the result of this a "rationalization."
So that she eventually generates enough detailed thoughts that score positively that she actually decides to implement an aggregate of these fragments, which we can call a church-going plan.
So that she gets positive reinforcement for going to church,
which reinforces all aspects of the experience, including being in church, which, in aggregate, we can call a religion-shard,
1. I agree that this changes her internal shard balance significantly - she has learned something she didn't know before, and that leaves her better off (as measured by some fundamental health/happyness measurements).
2. I think this can be meaningfully called value drift only with respect to either existing shards (though these are an abstraction we are using, not something that's fundamental to the brain), or with respect to Alice's interpretations/verbalizations - thoughts that are themselves reinforced by shards.
so that more such thoughts come up, and she eventually converts,
so that Rick ends up happier and liking Alice more - though that was never the "plan" to begin with.

In short: There is no top-down planning but bottom-up action generation. All planning is constructed out of plan fragments that are compatible with all (existing) shards.

Thanks for detailing it. I understand you to describe ~iterative filtering and refinement of a crude proto-plan (church-related thoughts and impulses) which filter down into a more detailed plan, where each piece is selected to be amenable to all relevant shards (without explicit planning).

I think it doesn't sound quite right to me, still, for a few reasons I'll think about more.

I regret that this post doesn't focus on practical advice derived from shard theory. Instead, I mostly focused on a really cool ideal-agency trick ("pretend really hard to wholly fool your own credit assignment"), which is cool but impracticable for real people (joining the menagerie currently inhabited by e.g. logical inductors, value handshakes, and open-source game theory).

I think that shard theory suggests a range of practical ways to improve your own value formation and rationality. For example, suppose I log in and see that my friend John complimented this post. This causes a positive reward event. By default, I might (subconsciously) think "this feels good because John complimented me." Which causes me to be more likely to act to make John (and others) approve of me.

However, that's not how I want to structure my motivation. Instead, in this situation, I can focus on the cognition I want reinforced:

this feels good because John complimented me, which happened because I thought carefully this spring and came up with new ideas, and then communicated them clearly. I'm glad I thought carefully, that was great. I noticed confusion when Quintin claimed (IIRC) that wireheading always makes you more of a wireheader—I stopped to ask whether that was actually true. What do I think I know, and why do I think I know it? Noticing that confusion was also responsible for this moment.

I'm basically repeating and focusing the parts I want to be reinforced. While I don't have a tight first-principles argument that this conscious attention does in fact redirect my credit assignment in the right way, I really think it should, and so I've started this practice on that hunch.