Positive values seem more robust and lasting than prohibitions

I think the key question is how much policy-caused variance in the outcome there is.

That is, juice can chain onto itself because we assume there will be a bunch of scenarios where reinforcement-triggering depends a lot on the choices made. If you are near a juice but not in front of the juice, you can walk up to the juice, which triggers reinforcement, or you can not walk up to the juice, which doesn't trigger reinforcement. The fact that these two plausible actions differ in their reinforcement is what I am referring to with "policy-caused variance".

If you are told not to kill someone, then you can usually stay sufficiently far away from killing anyone, in such a way that the variance in killing is 0, because killing itself has a constant value of 0. (Unlike juice-drinking which might have a value of 0 or 1, depending on both scenario and actions.)

But you could also have a case with a positive value that fails to be robust/lasting, if the positive value fails to have variance. One example would be if the positive value is bounded and always achieved; for instance you might imagine that if you are always carrying a juice dispenser, juice-drinking would always have a value of 1, and therefore again there wouldn't be any variance to reinforce seeking juice.

A more subtle point is that if the task is too difficult, e.g. if you are in a desert with no juice available, then the policy-caused variance is also 0, because the juice is constant 0. This is a general point that encompasses many failures of reinforcement learning. Often, if you don't do things like reward shaping, then reinforcement learning simply fails, because the task is too complex to learn.

I think future RL algorithms will be able to succeed using less policy-caused variance than present RL algorithms require by using models to track the outcomes through many layers of interactions.

[-]Steven Byrnes3y30

Finally got around to reading this, only to discover that I independently noticed a similar thing just a couple days ago, see here for how I’m thinking about it. :)

[-]RobertKirk3y20

I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they're on the same kind of framing/scale.

It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you're drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion - there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.

In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don't require or encourage optimisation as much.

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

[-]TurnTrout3y41

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.

In the described scenario, "don't kill humans" may in fact lead to scenarios where the AI can again kill humans, but this feels like an ambient statistical property of the world (killing people is easy) and not like a property of the shard's optimization (the shard isn't influencing logits on the basis of whether those actions will lead to future opportunities to not kill people, or something?). So I do expect "don't kill people" to become more sophisticated/reflective, but I intuitively feel there remains some important difference that I can't quite articulate.

[-]RobertKirk3y30

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.

I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:

Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it's only a single reinforcement event (actually getting the juice).
Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)

The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)

The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward".

Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

[-]Chris_Leong3y10

My initial thoughts were:

On one hand, if you positively reinforce, the system will seek it out, if you negatively reinforce the system will work around it.
On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

Having thought about it a bit more, I think I managed to resolve the tension. It seems that if at least one of the actions is positive utility, then the system has a reason to maneuver you into a hypothetical state where you choose between them, while if both are negative utility then the system has a reason to actively steer you away from having to make such a choice.

(This analysis is still naive in that it doesn't account for opportunity cost).

I'd really love to see greater formalisation of this intuition. Even what I've said above is quite ambiguous.

[-]TurnTrout3y30

On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general).

(In general, I think reward is not best understood as an optimization target like "utility.")

[-]Chris_Leong3y10

Good point.

(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

25

Positive values seem more robust and lasting than prohibitions

25