Note on algorithms with multiple trained components

Steven Byrnes

Example 1: consider a GAN. There’s a generator and a discriminator. As an intuitive mnemonic, we can say

The “purpose” of the generator is to trick the discriminator,
The “purpose” of the discriminator is to not get tricked by the generator.

(Relatedly, people will say “the generator is trained to trick the discriminator”, etc.)

…But (I hope) everyone knows that these bullet points are only a mnemonic.

The one and only real “purpose” of the whole system and everything in it is to generate cool images that we like, and get our papers into NeurIPS or whatever.

And indeed, I think everyone who uses GANs is aware that it’s possible for a programmer to make the discriminator “better” (when narrowly viewed as having a “purpose” of not getting tricked by the generator), but with the direct result of making the whole system worse at generating cool images. For example, if there were a code-change that made the discriminator perfect at discriminating, then there would be no gradient for training the generator, and the whole system would be useless.

So we shouldn’t take those bullet-point mnemonics too literally.

Example 2: In actor-critic RL, people sometimes say:

The “purpose” of the value function is to approximate future rewards [or discounted sum of future reward, or whatever].

…But that’s also just a mnemonic. The one and only real “purpose” of the whole RL system (of which the value function is just one part) is that it does whatever we want the RL system to do, e.g. win at chess, get our papers into NeurIPS, build us a luxury gay space communist utopia, etc.

So it’s at least conceivable that some algorithmic change would make the value function into a better approximation of the discounted sum of future rewards, yet make the RL agent worse at doing things that we want it to do.

Actually, this particular example is not merely “conceivable”, but expected, thanks to wireheading. If the value function is used to assess which plans are good versus bad, and the value function is a perfect approximation of expected future reward, then you’re almost guaranteed to get an AI that is trying to wirehead.

(I myself am a model-based RL agent (I claim), and I don’t want to wirehead, and I claim that this is directly related to my internal value function issuing very inaccurate predictions of the future reward associated with wireheading. Details in footnote.^[1])

So anyway, I expect our future AGIs to have a value function that gets updated by TD learning (or some other update rule). And if they do, I expect to occasionally casually say things like “The purpose of these weight-updates is to make the value function into a better and better approximation of expected future reward”. But if I say that, please be aware that I am using the word “purpose” as a mnemonic, not to be taken too literally.

As a particular example, I often hear the claim that as RL algorithms get more and more “powerful” and “advanced” in the future, we can feel more and more confident making claims like “The value function is an extremely accurate approximation of expected future reward”. Well, I disagree! That’s not necessarily what makes an RL algorithm more “advanced”, and it’s not necessarily what future programmers will be trying to do! Indeed, when future programmers are fiddling with architectures, hyperparameters, training environments, and so on, they may sometimes go out of their way to try to make the value function worse at accurately approximating the expected future reward! (In other words, future programmers may go out of their way to try to ensure that the value function training process does not converge to the global “optimum”.)

General takeaway: An ML algorithm can have multiple parts which we can describe mnemonically as having a “purpose” related to how they’re updated (e.g. by gradient descent), but we shouldn’t take those “purposes” too literally.

From the perspective of a machine designer, the one and only true purpose of every gear in a machine is that the whole machine works well. Anything else is just a convenient imperfect approximation / mnemonic.

^{^}
I have an intellectual expectation that if I installed an electrode in a particular part of my brain and spent all day stimulating it, this would feel (at the time) like an extremely important and valuable thing to do. But that intellectual expectation in my brain has not propagated into a visceral expectation, i.e. the kind of expectation that would make me feel a craving to actually go implant an electrode in my brain right now.
If I actually implant the electrode next week and start stimulating it, then my visceral expectation would update to synchronize with my (more accurate) intellectual expectation. In plain language, I would get addicted.
I claim that we should describe this situation in model-based RL terms. The “intellectual expectation” is coming from my world-model, and the “visceral expectations” (including valence) are coming from my RL value function. And currently my brain’s value function is a very poor approximation of expected future rewards, in regards to this wireheading plan. Yet making it into a better approximation is a bad thing that I’d like to avoid. There is no mechanism in the brain that enforces perfect consistency between intellectual (world-model) expectations and visceral (value-function) expectations, and I’m happy for it to be that way, and I would make an AGI that way too.

I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.

In the wire-heading case this feels different in that you have essentially two separate value functions -- a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience. The difference here is mostly in training data -- the cortex is trained on a large sensory corpus including linguistic text describing wire heading. The subcortical value function is largely trained on personal rewarding experiences. It would be odd to have them necessarily be always consistent and would lead to strange failure modes exactly like wire heading, or generally being able to be viscerally convinced of anything you read that sounds convincing.

In the wire-heading case this feels different in that you have essentially two separate value functions -- a cortical LM based one which can extrapolate values in linguistic/concept space and a classic RL basal-ganglia value function which is based on your personal experience.

I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever).

Other than that, I don’t think we’re in disagreement about anything, AFAICT.

I agree with the general point here but I think there's an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.

Here’s a question:

In a non-embedded (cartesian) training environment where wireheading is impossible, is it the case that:

IF an intervention makes the value function strictly more accurate as an approximation of expected future reward,
THEN this intervention is guaranteed to lead to an RL agent that does more cool things that the programmers want?

I can’t immediately think of any counterexamples to that claim, but I would still guess that counterexamples exist.

(For the record, I do not claim that wireheading is nothing to worry about. I think that wireheading is a plausible but not inevitable failure mode. I don’t currently know of any plan in which there is a strong reason to believe that wireheading definitely won’t happen, except plans that severely cripple capabilities, such that the AGI can’t invent new technology etc. And I agree with you that if AI people continue to do all their work in wirehead-proof cartesian training environments, and don’t even try to think about wireheading, then we shouldn’t expect them to make any progress on the wireheading problem!)