I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.
In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).
One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.
One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.
Also changed a few words around for clarity.]
For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?
[Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]
Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.
Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use.
[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the solution to begin with before weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes in the past. Hence the counter-pressure is the same on all coordinates.
A caveat: this assumes we're doing full batch AdamW, as opposed to randomized minibatches. In the latter case, the increased noise in the "memorizer" weights will in fact cause Adam to be less confident about those weights, and thus assign less magnitude to them. But this happens essentially right from the start, so it doesn't really explain grokking.
Here's an example of this, taking randomized minibatches of size 10 (out of 100 total) on each step, optimizing with AdamW (learning rate = 0.001, weight decay = 0.01). I show the first three "memorizer" weights (out of 100 total) plus the bias:
As you can see, it does place less magnitude on the memorizers due to the increased noise, but this happens right from the get-go; it never "groks".
If we do full batch AdamW, then the bias is treated indistinguishably:
For small weight decay settings and zero gradient noise, AdamW is doing something like finding the minimum-norm solution, but in space, not space.]
Here's a straightforward argument that phase changes are an artifact of AdamW and won't be seen with SGD (or SGD with momentum).
Suppose we have 101 weights all initialized to 0 in a linear model, and two possible ways to fit the training data:
The first is . (It sets the first 100 weights to 1, and the last one to 0.)
The second is . (It sets the first 100 weights to 0, and the last one to 1.)
(Any combination with will also fit the training data.)
Intuitively, the first solution memorizes the training data: we can imagine that each of the first 100 weights corresponds to storing the value of one of the 100 samples in our training set. The second solution is a simple, generalized algorithm for solving all instances from the underlying data distribution, whether in the training set or not.
has an norm which is ten times as large as . SGD, since it follows the gradient directly, will mostly move directly toward as it's the direction of steeper descent. It will ultimately converge on the minimum norm solution . (Momentum won't change this picture much, since it's just smearing out each SGD step over multiple updates, and each individual SGD step goes in the direction of steepest descent.)
AdamW, on the other hand, is basically the same as Adam at first, since weight decay doesn't do much when the weights are small. Since Adam is a scale-invariant, coordinate-wise adaptive learning rate algorithm, it will move at the same speed for each of the 101 coordinates in the direction which reduces loss, moving towards the solution , i.e. with heavy weight on the memorization solution. Weight decay will start to kick in a bit before this point, and over time AdamW will converge to (close to) the same minimum-norm solution as SGD. This is the phase transition from memorization to generalization.
Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?
Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?
A simple reflection shows that happiness is not identical to reward prediction error: If I’m on a long, tiring journey of predictable duration, I still find relief at the moment I reach my destination. This is true even for journeys I’ve taken many times before, so that there can be little question that my unconscious has had opportunity to learn the predicted arrival time, and this isn’t just a matter of my conscious predictions getting ahead of my unconscious ones.
On the other hand, I also gain happiness from learning, well before I arrive, that traffic on my route has dissipated. So there does seem to be some amount of satisfaction gained just from learning new information, even prior to “cashing it in”. Hence, happiness is not identical to simple reward either.
Perhaps shard theory can offer a straightforward answer here: happiness (respectively suffering) is when a realized feature of the agent’s world model corresponds to something that a shard which is currently active values (respectively devalues).
If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.
It also explains the experience some have had, achieved through the use of meditation or other deliberate effort, of bodily pain without attendant suffering. They are presumably finding ways to activate shards that simply do not place negative value on pain.
Finally: happiness is then not a unidimensional, inter-comparable thing, but instead each type is to an extent sui generis. This comports with my intuition: I have no real scale on which I can weigh the pleasure of an orgasm against the delight of mathematical discovery.