TL;DR: The dynamics of human learning processes and reward circuitry are more relevant than evolution for understanding how inner values arise from outer optimization criteria.
This post is related to Steve Byrnes’ Against evolution as an analogy for how humans will create AGI, but more narrowly focused on how we should make inferences about values.
Thanks to Alex Turner, Charles Foster, and Logan Riggs for their feedback on a draft of this post.
How should we expect AGI development to play out?
True precognition appears impossible, so we use various analogies to AGI development, such as evolution, current day humans, or current day machine learning. Such analogies are far from perfect, but we still may be able to extract useful information by carefully examining them.
In particular, we want to understand how inner values relate to the outer optimization criteria. Human evolution is one possible source of data on this question. In this post, I’ll argue that human evolution actually provides very little usable evidence on AGI outcomes. In contrast, analogies to the human learning process are much more fruitful.
One way people motivate extreme levels of concern about inner misalignment is to reference the fact that evolution failed to align humans to the objective of maximizing inclusive genetic fitness. From Eliezer Yudkowsky’s AGI Ruin post:
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about…
I don't think that "evolution -> human values" is the most useful reference class when trying to understand how outer optimization criteria relate to inner values. Evolution didn't directly optimize over our values. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's values. So, there are really (at least) two classes of observations from which we can draw evidence:
I will present five reasons why I think evidence from (2) “human learning -> human values” is more relevant to predicting AGI.
The relationship we want to make inferences about is:
I think that "AI learning -> AI values" is much more similar to "human learning -> human values" than it is to "evolution -> human values". Steve Byrnes makes this case in much more detail in his post on the matter. Two of the ways I think AI learning more closely resembles human learning, and not evolution, are:
"AI learning -> AI values", "human learning -> human values", and “evolution -> human values” each represent very different optimization processes, with many specific dissimilarities between any pair of them. However, I think the balance of dissimilarities points to "human learning -> human values" being the closer reference class for "AI learning -> AI values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
One common objection is that “human learning” represents a tiny region in the space of all possible mind designs, and so we cannot easily generalize our observations of humans to minds in general. This is, of course, true, and it greatly limits the strength of any AI-related conclusions we can draw from looking at "human learning -> human values". However, I again hold that inferences from "evolution -> human values" suffer from an even more extreme version of this same issue. "Evolution -> human values" represent an even more restricted look at the general space of optimization processes than we get from the observed variations in different humans' learning processes, reward circuit configurations, and learning environments.
Human evolution happened hundreds of thousands of years ago. We are deeply uncertain about the details of the human ancestral environment and which traits were under what selection pressure. We are still unsure about what precise selection pressure led humans to be so generally intelligent at all. We are very far away from being able to precisely quantify all the potentially values-related selection pressures in the ancestral environment, or how those selection pressures changed our reward systems or our tendencies to form downstream values.
In contrast, human within lifetime learning happens all the time right now. It’s available for analysis and even experimental intervention. Given two evidence sources about a given phenomenon, where one evidence source is much more easily accessible than the other, then all else equal, the more accessible evidence source should represent a greater fraction of our total information on the phenomenon. This is another reason why we should expect evidence from humans to account for a greater proportion of our total information about how inner values relate to outer optimization criteria.
I think that a careful account of how evolution shaped our learning process in the ancestral environment implies that evolution had next to no chance of aligning humans with inclusive genetic fitness.
There are no features of the ancestral environment which would lead to an ancestral human learning about the abstract idea of inclusive genetic fitness. There were no ancestral humans that held an explicit representation of inclusive genetic fitness. So, there was never an opportunity for evolution to select for humans who attached their values to an explicit representation of inclusive genetic fitness.
Regardless of how difficult it is, in general, to get learning systems to form values around different abstract concepts, evolution could not have possibly gotten us to form a value around the particular abstraction of inclusive genetic fitness because we didn’t form such an abstraction in the ancestral environment. Ancestral humans had zero variance in their tendency to form values around inclusive genetic fitness. Evolution cannot select for traits that don’t vary across a population, so evolution could not have selected for humans that formed their values around inclusive genetic fitness.
In contrast, the sorts of things that we humans end up valuing are usually the sorts of things that are easy to form abstractions around. Thus, we are not doomed by the same difficulty that likely prevented evolution from aligning humans to inclusive genetic fitness.
This point is extremely important. I want to make sure to convey it correctly, so I will quote two previous expressions of this point by other sources:
Risks from Learned Optimization notes that the lack of environmental data related to inclusive genetic fitness effectively increases the description length complexity of specifying an intelligence that deliberately optimizes for inclusive genetic fitness:
…description cost is especially high if the learned algorithm’s input data does not contain easy-to-infer information about how to optimize for the base objective. Biological evolution seems to differ from machine learning in this sense, since evolution’s specification of the brain has to go through the information funnel of DNA. The sensory data that early humans received didn’t allow them to infer the existence of DNA, nor the relationship between their actions and their genetic fitness. Therefore, for humans to have been aligned with evolution would have required them to have an innately specified model of DNA, as well as the various factors influencing their inclusive genetic fitness. Such a model would not have been able to make use of environmental information for compression, and thus would have required a greater description length. In contrast, our models of food, pain, etc. can be very short since they are directly related to our input data.
From Alex Turner (in private communication):
If values form because reward sends reinforcement flowing back through a person's cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-minimization process and thus are easy to form values around. But if there aren't local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can't instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
Edit: since writing this post, I've learned a lot more about inductive biases and what deep learning theory we currently have, so my relative weightings have shifted quite a lot towards "current results in machine learning".
I think that using "human learning -> human values" as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by "evolution -> human values". Looking at the learning trajectories of individual humans, it seems like a given person's values have a great deal in common with the sorts of experiences they've found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn't want a future totally devoid of dogs, or one in which dogs suffer greatly.
Please note that I am not arguing that humans are inner aligned, or that looking at humans implies inner alignment is easy. Humans are misaligned with maximizing their outer reward source (activation of reward circuitry). I operationalize this misalignment as: "After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they'd acquired from their learning environment which are misaligned to reward maximization in the new environment".
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are extremely bad by light of the human's preexisting values (e.g., killing a beloved family member), most humans won't push the button.
I also think this regularity in inner values is reasonably robust to large increases in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. It's probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
(Part 2 will address whether the “sharp left turn” demonstrated by human capabilities with respect to evolution implies that we should expect a similar sharp left turn in AI capabilities.)