Models Don't "Get Reward"

[-]dsj3y*3862

I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.

In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).

One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.

One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.

This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)

More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.

[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.

Also changed a few words around for clarity.]

[-]Raemon3y43

Curated. I think I had read a bunch of stuff pointing in this direction before, but somehow this post helped the concepts (i.e. the distinction between selecting for bad behavior and for goal-directedness) be a lot clearer in my mind.

[-]David Reber3y31

Under the "reward as selection" framing, I find the behaviour much less confusing:
We use reward to select for actions that led to the agent reaching the coin.
This selects for models implementing the algorithm "move towards the coin".
However, it also selects for models implementing the algorithm "always move to the right".
It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

I've been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is "move right" -> "get the coin" -> "get reward". However, if the variable of "did you get the coin" is effectively latent (because the model selection doesn't discriminate on this variable) then the causal model M is indistinguishable from M' which is "move right" -> "get reward" (which though it is not the true causal model governing the system, generates the same observational distribution).

In fact, the incorrect model M' actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.

[-]Chris_Leong8mo10

I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.

I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.

[-]Chris van Merwijk3y10

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

80

80

How Vanilla Reinforcement Learning Works

Why Does This Matter?

Rewriting the Threat Model

One Final Exercise For the Reader