Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

Thanks Abram for this sequence - for some reason I wasn't aware of it until someone linked to it recently.

Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.

[-]abramdemski5y30

Ah, looks like I missed this question for quite a while!

I agree that it's not quite one or the other. I think that like wireheading, we can split delusion box into "the easy problem" and "the hard problem". The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn't real. Then, much like observation-utility functions, the agent won't think entering into the delusion box is a good idea when it's planning -- and also, won't get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened).

But the hard problem of delusion box would be: we can't make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids "generalized delusion boxes"?

[-]Vladimir_Nesov5y30

The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist reasoning.

With this analogy, it might be possible to transfer lessons from the more familiar problem of learning facts about the world, to the harder problem of protecting preference.

[-]TurnTrout3y00

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".

This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of "learning what is being rewarded" that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

I think that Dewey is wrong about RL agents having this problem in general. Dewey wrote (emphasis mine):

Reinforcement learning, we have argued, is not an adequate real-world solution to the
problem of maximizing an initially unknown utility function. Reinforcement learners,
by definition, act to maximize their expected observed rewards; they may learn that
human goals are in some cases instrumentally useful to high rewards, but this dynamic
is not tenable for agents of human or higher intelligence, especially considering the
possibility of an intelligence explosion.

"By definition"?

The trouble with the reinforcement learning notion (1) is that it can only prefer or
disprefer future interaction histories on the basis of the rewards they contain. Reinforcement
learning has no language in which to express alternative final goals, discarding all non-
reward information contained in an interaction history.

I will go out on a limb and guess that this paper is nearly entirely wrong in its key claims. Similarly with Avoiding Wireheading with Value Reinforcement Learning:

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called
wireheading problem.

[-]abramdemski3y30

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".

I think that it generally seems like a good idea to have solid theories of two different things:

What is the thing we are hoping to teach the AI?
What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like.

For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn't do that when given the opportunity. After doing some philosophy where you try to positively specify what you're trying to train, it's easier to notice that this sort of training still leaves the human-manipulation failure mode open.

After doing this kind of philosophy for a while, it's intuitive to form the more general prediction that if you haven't been able to write down a formal model of the kind of thing you're trying to teach, there are probably easy failure modes like this which your training hasn't attempted to rule out at all.

[-]abramdemski3y20

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".

I think I don't understand what you mean by (2), and as a consequence, don't understand the rest of this paragraph?

WRT (1), I don't think I was being careful about the distinction in this post, but I do think the following:

The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It's a false explanation of why wireheading is a concern.

The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).

[-]TurnTrout3y20

To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading.

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.

This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

[-]abramdemski3y20

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it's closer in practice to "all the hypotheses are around at the beginning" -- it doesn't matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don't change that much by introducing it at different stages in training.

Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.

Let's set aside the question of whether it's true, though, and consider the point you're making.

This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).

So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn't necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn't necessarily the thing it immediately pushes for/against.

I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.

But I don't disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is "sophisticated enough" to be doing training-process-simulation.

I'm having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that's not that useful.

And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

I'm confused about what you mean here. My best interpretation is that you don't think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.

Maybe on your understanding, the actual reason why current RL systems don't wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don't see it.

[-]TurnTrout3y20

To me, the tangent space stuff suggests that it's closer in practice to "all the hypotheses are around at the beginning" -- it doesn't matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don't change that much by introducing it at different stages in training.

This seems to prove too much in general, although it could be "right in spirit." If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.

And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point?
I'm confused about what you mean here.

I was responding to:

To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models

I bet you can predict what I'm about to say, but I'll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.

So I think the statement "how well do the agent's motivations predict the reinforcement event" doesn't make sense if it's cast as "manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds)." I think it does make sense if you think about what behavioral influences ("shards") within the agent will upweight logits on the actions which led to reward.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

16

Stable Pointers to Value

The Easy vs Hard Problem

Illustration: An Agent Embedded in Its Own Utility Function

Indifference and CDT