Thanks for doing this! I think a lot of people would be very interested in the debate transcripts if you posted them on GitHub or something.
In §3.1–3.3, you look at the main known ways that altruism between humans has evolved — direct and indirect reciprocity, as well as kin and group selection — and ask whether we expect such altruism from AI towards humans to be similarly adaptive.
However, as observed in R. Joyce (2007). The Evolution of Morality (p. 5),
Evolutionary psychology does not claim that observable human behavior is adaptive, but rather that it is produced by psychological mechanisms that are adaptations. The output of an adaptation need not be adaptive.
This is a subtle distinction which demands careful inspection.
In particular, are there circumstances under which AI training procedures and/or market or evolutionary incentives may produce psychological mechanisms which lead to altruistic behavior towards human beings, even when that altruistic behavior is not adaptive? For example, could altruism learned towards human beings early on, when humans have something to offer in return, be “sticky” later on (perhaps via durable, self-perpetuating power structures), when humans have nothing useful to offer? Or could learned altruism towards other AIs be broadly-scoped enough that it applies to humans as well, just as human altruistic tendencies sometimes apply to animal species which can offer us no plausible reciprocal gain? This latter case is analogous to the situation analyzed in your paper, and yet somehow a different result has (sometimes) occurred in reality than that predicted by your analysis.
I don’t claim the conclusion is wrong, but I think a closer look at this subtlety would give the arguments for it more force.
Although you don’t look at network reciprocity / spatial selection.
Even factory farming, which might seem like a counterexample, is not really. For the very existence of humans altruistically motivated to eliminate it — and who have a real shot at success — demands explanation under your analysis.
A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:
Reward-seeking reinforcement-learning agents can in some instances serve as models of utility-maximizing, self-modifying agents, but in current practice, RL systems are typically distinct from the agents they produce … In multi-task RL systems, for example, RL “rewards” serve not as sources of value to agents, but as signals that guide training[.]
And an additional point which calls into question the view of RL-produced agents as the product of one big training run (whose reward specification we better get right on the first try), as opposed to the product of an R&D feedback loop with reward as one non-static component:
RL systems per se are not reward-seekers (instead, they provide rewards), but are instead running instances of algorithms that can be seen as evolving in competition with others, with implementations subject to variation and selection by developers. Thus, in current RL practice, developers, RL systems, and agents have distinct purposes and roles.…RL algorithms have improved over time, not in response to RL rewards, but through research and development. If we adopt an agent-like perspective, RL algorithms can be viewed as competing in an evolutionary process where success or failure (being retained, modified, discarded, or published) depends on developers’ approval (not “reward”), which will consider not only current performance, but also assessed novelty and promise.
RL systems per se are not reward-seekers (instead, they provide rewards), but are instead running instances of algorithms that can be seen as evolving in competition with others, with implementations subject to variation and selection by developers. Thus, in current RL practice, developers, RL systems, and agents have distinct purposes and roles.
RL algorithms have improved over time, not in response to RL rewards, but through research and development. If we adopt an agent-like perspective, RL algorithms can be viewed as competing in an evolutionary process where success or failure (being retained, modified, discarded, or published) depends on developers’ approval (not “reward”), which will consider not only current performance, but also assessed novelty and promise.
Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?
Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?
A simple reflection shows that happiness is not identical to reward prediction error: If I’m on a long, tiring journey of predictable duration, I still find relief at the moment I reach my destination. This is true even for journeys I’ve taken many times before, so that there can be little question that my unconscious has had opportunity to learn the predicted arrival time, and this isn’t just a matter of my conscious predictions getting ahead of my unconscious ones.
On the other hand, I also gain happiness from learning, well before I arrive, that traffic on my route has dissipated. So there does seem to be some amount of satisfaction gained just from learning new information, even prior to “cashing it in”. Hence, happiness is not identical to simple reward either.
Perhaps shard theory can offer a straightforward answer here: happiness (respectively suffering) is when a realized feature of the agent’s world model corresponds to something that a shard which is currently active values (respectively devalues).
If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.
It also explains the experience some have had, achieved through the use of meditation or other deliberate effort, of bodily pain without attendant suffering. They are presumably finding ways to activate shards that simply do not place negative value on pain.
Finally: happiness is then not a unidimensional, inter-comparable thing, but instead each type is to an extent sui generis. This comports with my intuition: I have no real scale on which I can weigh the pleasure of an orgasm against the delight of mathematical discovery.
I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.
In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).
One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.
One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.
Also changed a few words around for clarity.]
For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?
[Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]
There was a recent Twitter thread about this. See here and here.
Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.
Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use.
[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the 12M+12G solution to begin with before weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes in the past. Hence the counter-pressure is the same on all coordinates.
A caveat: this assumes we're doing full batch AdamW, as opposed to randomized minibatches. In the latter case, the increased noise in the "memorizer" weights will in fact cause Adam to be less confident about those weights, and thus assign less magnitude to them. But this happens essentially right from the start, so it doesn't really explain grokking.
Here's an example of this, taking randomized minibatches of size 10 (out of 100 total) on each step, optimizing with AdamW (learning rate = 0.001, weight decay = 0.01). I show the first three "memorizer" weights (out of 100 total) plus the bias:
As you can see, it does place less magnitude on the memorizers due to the increased noise, but this happens right from the get-go; it never "groks".
If we do full batch AdamW, then the bias is treated indistinguishably:
For small weight decay settings and zero gradient noise, AdamW is doing something like finding the minimum-norm solution, but in L∞ space, not L2 space.]
Here's a straightforward argument that phase changes are an artifact of AdamW and won't be seen with SGD (or SGD with momentum).
Suppose we have 101 weights all initialized to 0 in a linear model, and two possible ways to fit the training data:
The first is M=[1,1,…,1,1,0]. (It sets the first 100 weights to 1, and the last one to 0.)
The second is G=[0,0,…,0,0,1]. (It sets the first 100 weights to 0, and the last one to 1.)
(Any combination aM+bG with a+b=1 will also fit the training data.)
Intuitively, the first solution M memorizes the training data: we can imagine that each of the first 100 weights corresponds to storing the value of one of the 100 samples in our training set. The second solution G is a simple, generalized algorithm for solving all instances from the underlying data distribution, whether in the training set or not.
M has an L2 norm which is ten times as large as G. SGD, since it follows the gradient directly, will mostly move directly toward G as it's the direction of steeper descent. It will ultimately converge on the minimum norm solution 1101M+100101G. (Momentum won't change this picture much, since it's just smearing out each SGD step over multiple updates, and each individual SGD step goes in the direction of steepest descent.)
AdamW, on the other hand, is basically the same as Adam at first, since L2 weight decay doesn't do much when the weights are small. Since Adam is a scale-invariant, coordinate-wise adaptive learning rate algorithm, it will move at the same speed for each of the 101 coordinates in the direction which reduces loss, moving towards the solution 12M+12G, i.e. with heavy weight on the memorization solution. Weight decay will start to kick in a bit before this point, and over time AdamW will converge to (close to) the same minimum-norm solution 1101M+100101G as SGD. This is the phase transition from memorization to generalization.