Originally a shortform comment.
Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard.
Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:
- A baby learns "IF juice in front of me, THEN drink",
- The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
- The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
- Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
The juice shard chains into itself, as its outputs cause the learning process to further reinforce and generalize the juice-shard. This shard reinforces itself across time and thought-steps.
But a "don't kill" shard seems like it should remain... stubby? Primitive? The "don't kill" shard can't self-chain into not doing something. If you're going to kill someone, and then don't because of the don't-kill shard, and that avoids predicted negative reward... Then maybe the "don't kill" shard gets reinforced and generalized a bit because it avoided negative reward (and so reward was higher than predicted, which I think would trigger e.g. a reinforcement event in people).
But—on my current guesses and intuitions—that shard doesn't become more sophisticated, it doesn't become reflective, it doesn't "agentically participate" in the internal shard politics (e.g. the agent's "meta-ethics", deciding what kind of agent it "wants to become"). Other parts of the agent want things, they want paperclips or whatever, and that's harder to do if the agent isn't allowed to kill anyone.
Crucially, the no-killing injunction can probably be steered around by the agent's other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard... There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might bid up lesioning plans which are optimized so as to not make the killing a salient plan feature to the reflective world-model, and thus, the plan does not activate the no-killing shard.
This line of argumentation is a point in favor of the following: Don't embed a shard which doesn't want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.
- Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why.
- This is one point in favor of the "convergent consequentialism" hypothesis, in some form.
- I think that people are not usually defined by negative values (e.g. "don't kill"), but by positives, and perhaps this is important.
Which I won't actually detail right now.