Originally a shortform comment.

Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard. 

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

  1. A baby learns "IF juice in front of me, THEN drink",
  2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
  3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
  4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
  5. ...

The juice shard chains into itself, as its outputs cause the learning process to further reinforce and generalize the juice-shard. This shard reinforces itself across time and thought-steps. 

But a "don't kill" shard seems like it should remain... stubby? Primitive? The "don't kill" shard can't self-chain into not doing something. If you're going to kill someone, and then don't because of the don't-kill shard, and that avoids predicted negative reward... Then maybe the "don't kill" shard gets reinforced and generalized a bit because it avoided negative reward (and so reward was higher than predicted, which I think would trigger e.g. a reinforcement event in people). 

But—on my current guesses and intuitions[1]—that shard doesn't become more sophisticated, it doesn't become reflective, it doesn't "agentically participate" in the internal shard politics (e.g. the agent's "meta-ethics", deciding what kind of agent it "wants to become"). Other parts of the agent want things, they want paperclips or whatever, and that's harder to do if the agent isn't allowed to kill anyone. 

Crucially, the no-killing injunction can probably be steered around by the agent's other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard... There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might bid up lesioning plans which are optimized so as to not make the killing a salient plan feature to the reflective world-model, and thus, the plan does not activate the no-killing shard.

This line of argumentation is a point in favor of the following: Don't embed a shard which doesn't want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.


Other points:

  • Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why. 
  • This is one point in favor of the "convergent consequentialism" hypothesis, in some form. 
  • I think that people are not usually defined by negative values (e.g. "don't kill"), but by positives, and perhaps this is important.
  1. ^

    Which I won't actually detail right now.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:05 PM

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)

But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.

A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.

I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.

Finally got around to reading this, only to discover that I independently noticed a similar thing just a couple days ago, see here for how I’m thinking about it. :)

I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they're on the same kind of framing/scale.

It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you're drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion - there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.

In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don't require or encourage optimisation as much.

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property. 

In the described scenario, "don't kill humans" may in fact lead to scenarios where the AI can again kill humans, but this feels like an ambient statistical property of the world (killing people is easy) and not like a property of the shard's optimization (the shard isn't influencing logits on the basis of whether those actions will lead to future opportunities to not kill people, or something?). So I do expect "don't kill people" to become more sophisticated/reflective, but I intuitively feel there remains some important difference that I can't quite articulate.

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.

 

I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:

  • Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it's only a single reinforcement event (actually getting the juice).
  • Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)

The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)

The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward".

Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

My initial thoughts were:

  • On one hand, if you positively reinforce, the system will seek it out, if you negatively reinforce the system will work around it.
  • On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the system has a reason to maneuver you into a hypothetical state where you choose between them, while if both are negative utility then the system has a reason to actively steer you away from having to make such a choice.

(This analysis is still naive in that it doesn't account for opportunity cost).

I'd really love to see greater formalisation of this intuition. Even what I've said above is quite ambiguous.

On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general).

(In general, I think reward is not best understood as an optimization target like "utility.")

Good point.

(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).