In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight.
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
I was missing an important insight into how reinforcement learning setups are actually implemented. This lack of understanding led to lots of muddled thinking and general sloppiness on my part. I see others making the exact same mistake so I thought I would try and motivate a more careful use of language!
If I were to explain RL to my parents, I might say something like this:
Do you agree with this? Is this analogy flawed in any way?
I claim this is actually NOT how vanilla reinforcement learning works.The framing above views models as "wanting" reward, with reward being something models "receive" on taking certain actions. What actually happens is this:
The insight is that the model itself never "gets" the reward. Reward is something used separately from the model/environment.
To motivate this, let's view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let's assume the model is a conscious and coherent entity. From it's perspective, the above process looks like:
The model never "sees" the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.Reward is the mechanism by which we select parameters, it is not something "given" to the model.
To (rather gruesomely) link this back to the dog analogy, RL is more like asking 100 dogs to sit, breeding the dogs which do sit and killing those which don't. Overtime, you will have a dog that can sit on command. No dog ever gets given a biscuit.
The phrasing I find most clear is this: Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation.
The "selection lens" has shifted my alignment intuitions a fair bit.
Goal-DirectednessIt has changed how I think about goal-directed systems. I had unconsciously assumed models were strongly goal-directed by default and would do whatever they could to get more reward.
It's now clearer that goal-directedness in models is not a certainty, but something that can be potentially induced by the training process. If a model is goal-directed with respect to some goal, it is because such goal-directed cognition was selected for. Furthermore, it should be obvious that any learned goal will not be "get more reward", but something else. The model doesn't even see the reward!
CoinRunLangosco et al. found an interesting failure mode in CoinRun.
The set up is this:
At train-time everything goes as you would expect. The agent will move to the right-hand side of the level and reach the coin.However, if at test-time you move the coin so it is now on the left-hand side of the level, the agent will not navigate to the coin, but instead continue navigating to the right-hand side of the level.
When I first saw this result, my initial response was one of confusion before giving way to "Inner misalignment is real. We are in trouble."
Under the "reward as incentivization" framing, my rationalisation of the CoinRun behaviour was:
(In hindsight, there were several things wrong with my thinking...)
Under the "reward as selection" framing, I find the behaviour much less confusing:
Let's take another look at the simplified deception/RLHF threat model:
This assumes that models "want" reward, which isn't true. I think this threat model is confounding two related but different failure cases, which I would rewrite as the following:
1. Selecting For Bad Behaviour
2. Induced Goal-Directedness
So failure cases such as deception are still very much possible, but I would guess a fair few people are confused about the concrete mechanisms by which deception can be brought about. I think this does meaningfully change how you should think about alignment. For instance, on rereading Ajeya Cotra's writing on situational awareness, I have gone from thinking that "playing the training game" is a certainty to something that could happen, but only after training somehow induces goal-directedness in the model.
When reading about alignment, I now notice myself checking the following:
I have found going through the above to be a useful intuition-building exercise. Hopefully that will be the same for others!
I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.
In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).
One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.
One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.
Also changed a few words around for clarity.]
Curated. I think I had read a bunch of stuff pointing in this direction before, but somehow this post helped the concepts (i.e. the distinction between selecting for bad behavior and for goal-directedness) be a lot clearer in my mind.
Under the "reward as selection" framing, I find the behaviour much less confusing:We use reward to select for actions that led to the agent reaching the coin.This selects for models implementing the algorithm "move towards the coin".However, it also selects for models implementing the algorithm "always move to the right".It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.
I've been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is "move right" -> "get the coin" -> "get reward". However, if the variable of "did you get the coin" is effectively latent (because the model selection doesn't discriminate on this variable) then the causal model M is indistinguishable from M' which is "move right" -> "get reward" (which though it is not the true causal model governing the system, generates the same observational distribution).
In fact, the incorrect model M' actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.
The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).