Kaj Sotala


Where I agree and disagree with Eliezer

Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.

I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such as by everyone wanting to donate to sperm/egg banks".

But evolution does not seem to maximize fitness in that sense, where the fitness of a species would be a distinct thing-in-the-world that could be directly observed and optimized for. Something like "docileness" or "size", as used in animal breeding, would be a much better analogy, since those things are something that you can directly observe and optimize for - and human breeders do.

And... if humans had been explicitly bred for friendliness and corrigibility for a while, it seems to me that they likely would want to do the analogous thing of maximizing-their-donations-to-sperm/egg-banks. After all, we can already see that people who are high on either end of some personality trait such as altruism/selfishness, dominance/submission, openness/conservatism, etc., are likely to view that trait as a virtue (as long as nothing in the environment too overwhelmingly disproves this) and seek to become even more like that.

Altruistic people often want to become even more altruistic, selfish people eliminate their altruistic "weaknesses", dominant people to become more dominant, submissive people to make it easier for themselves to submit (this has some strong counterforces in our culture where submissiveness is generally considered undesirable, but you can still see it valued in e.g. workplace cultures where workers resent reforms that would give them more autonomy, preferring bosses to "just tell them what to do"), open people to become more open to experience, and so on.

Probably if people high on such traits were offered chances to self-modify to become even moreso - which seems analogous to the sperm/egg bank thing, since it's the cognitive optimization form of the instinctive thing - quite a few of them would.

A central AI alignment problem: capabilities generalization, and the sharp left turn

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF.

This analogy gets brought out a lot, but has anyone actually spelled it out explicitly? Because it's not clear to me that it holds if you try to explicitly work out the argument. 

In particular, I don't quite understand what it would mean for evolution to optimize the species for fitness, given that fitness is defined as a measure of reproductive success within the species. A genotype has a high fitness, if it tends to increase in frequency relative to other genotypes in that species. 

To be more precise, there is a measure of "absolute fitness" that refers to a specific genotype's success from one generation to the next: if a genotype has 100 individuals in one generation and 80 individuals in the next generation, then it has an absolute fitness of 0.8. But AFAIK evolutionary biology generally focuses on relative fitness - on how well a genotype performs relative to others in the species. If genotype A has an absolute fitness of 1.2 and genotype B has an absolute fitness of 1.5, then genotype B will tend to become more common than A, even though both have fitness > 1. 

Quoting from this Nature Reviews Genetics article:

Although absolute fitness is easy to think about, evolutionary geneticists almost always use a different summary statistic, relative fitness. The relative fitness of a genotype, symbolized w, equals its absolute fitness normalized in some way. In the most common normalization, the absolute fitness of each genotype is divided by the absolute fitness of the fittest genotype 11, such that the fittest genotype has a relative fitness of one. We can also define a selection coefficient, a measure of how much worse the A2 allele is than A1. Mathematically, w2 = 1−s. Just as before, we can calculate various statistics characterizing relative fitness. We can, for instance, find the mean relative fitness ( = pw1 + qw2), as well as the variance in relative fitness. [...]

It is the relative fitness of a genotype that almost always matters in evolutionary genetics. The reason is simple. Natural selection is a differential process: there are winners and losers. It is, therefore, the difference in fitness that typically matters.

Going with our previous example, genotype A would have a fitness of 0.8 and genotype B would have a fitness of 1. 

The most natural interpretation of the "fitness of the species" would be as the mean relative fitness of the species:

In late 1960s and early 1970s, Alan Robertson 24 and George Price 25 independently showed that the amount by which any trait, X, changes from one generation to the next is given by the genetic covariance between the trait and relative fitness. (The relevant covariance here is the “additive genetic covariance,” a statistic that disentangles the additive from dominance and epistatic effects of alleles 26) If a trait strongly covaries with relative fitness, it will change a good deal from one generation to the next; if not, not. This result is now known as the Secondary Theorem of Natural Selection 27, 28.

If the trait, X, is relative fitness itself, the additive genetic covariance between X and fitness collapses into the additive genetic variance in relative fitness, VA (w). Theory allows us to predict, therefore, how much the average relative fitness of a population will change from one generation to the next under selection: it will change by VA (w). Because a variance cannot be negative, the mean relative fitness of a population either increases or does not change under natural selection (the latter possibility could occur if, for instance, the population harbors no genetic variation). This finding, the Fundamental Theorem of Natural Selection, was first derived by Ronald A. Fisher 29 early in the history of evolutionary genetics. Despite the misleading nomenclature, the Fundamental Theorem is clearly a special case of the Secondary Theorem. It is the Secondary Theorem that is more fundamental.

However, it seems to me that - given that the mean relative fitness is defined by reference to the trait with the highest fitness within the genotype, that implies that the definition of the mean relative fitness changes over time. If the highest-fitness trait changes over time - because the environment changes (due to changes in the climate, other species, etc.), or because of the emergence of a new trait - then the mean relative fitness of the species also changes. The species might also be spread across different regions, with the same trait having different fitness in different regions:

A genotype’s fitness might vary spatially. Within a generation, a genotype might enjoy high fitness if it resides in one region but lower fitness if it resides in other regions. In diploids, spatial variation in fitness can, under certain conditions, maintain genetic variation in a population, a form of so-called balancing selection. The conditions required depend on the precise way in which natural selection acts.

In one scenario, different regions, following viability selection, contribute a fixed proportion of adults to a large random-mating population. This scenario involves “soft selection”: selection acts in a way that changes genotype frequencies within a region but that does not affect the number of adults produced by the region. [...]

In another scenario, different regions, following viability selection, contribute variable proportions of adults to a large random-mating population, depending on the genotypes (and thus fitnesses) of individuals within a region. This scenario involves “hard selection”: selection acts in a way that changes genotype frequencies within a region and affects the number of adults produced by the region.


The Fundamental Theorem of Natural Selection implies that the mean relative fitness,  of a population generally increases through time and specifies the amount by which it will increase per small unit of time. This suggests a tempting way to think about natural selection: it is a process that increases mean relative fitness.

While attractive and often powerful, it should be emphasized that— surprisingly— the mean fitness of a population does not always increase under natural selection. Population geneticists have identified a number of scenarios in which selection acts but [mean relative fitness] does not increase. These include frequency dependent selection (wherein the fitness of a genotype depends on its frequency in a population) and, in sexual species, certain forms of epistasis (wherein the fitness of a genotype depends on non-additive effects over multiple loci). Put differently, these findings show that the Fundamental Theorem of Natural Selection does not invariably hold. 

The paper does note that one can define alternative definitions of fitness under which the fundamental theorem does hold, but that the "relevant literature is forbidding". The general takeaway that I would draw from this is that fitness is not the kind of clear-cut, "carves reality at joints" kind of a measure that evolution would directly optimize in a similar kind of sense as you directly optimize, say, the amount of correct classifications that a neural net gets on MNIST. 

Rather it's a theoretical fiction or an abstract measure that can be defined in different ways, and which is defined in different ways in different contexts, depending on what kind of an aim one wants to achieve. But that's a simplifying interpretation imposed on complex process for the purpose of modeling it, rather than something that the process actually has an explicit optimization target. So there are ways in which you could view evolution as if it was optimizing for something, but it's not clear to me that it can be said to actually be optimizing for anything in particular - at least not in the sense in which we talk about a machine learning system being optimized for a particular goal.

Confused why a "capabilities research is good for alignment progress" position isn't discussed more

Agreed, but the black-box experimentation seems like it's plausibly a prerequisite for actual understanding? E.g. you couldn't analyze InceptionV1 or CLIP to understand its inner workings before you actually had those models. To use your car engine metaphor from the other comment, we can't open the engine and stick it full of sensors before we actually have the engine. And now that we do have engines, people are starting to stick them full of sensors, even if most of the work is still focused on building even fancier engines.

It seems reasonable to expect that as long as there are low-hanging fruit to be picked using black boxes, we get a lot of black boxes and the occasional paper dedicated to understanding what's going on with them and how they work.  Then when it starts getting harder to get novel interesting results with just black box tinkering, the focus will shift to greater theoretical understanding and more thoroughly understanding everything that we've accomplished so far. 

2021 AI Alignment Literature Review and Charity Comparison

I would be happy to see you write a top-level post about this paper. :)

Biology-Inspired AGI Timelines: The Trick That Never Works

I had a pretty strong negative reaction to it. I got the feeling that the post derives much of its rhetorical force from setting up an intentionally stupid character who can be condescended to, and that this is used to sneak in a conclusion that would seem much weaker without that device.

DeepMind: Generally capable agents emerge from open-ended play

Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.

Cortés, Pizarro, and Afonso as Precedents for Takeover

However, I don't think this is the whole explanation. The technological advantage of the conquistadors was not overwhelming.

With regard to the Americas at least, I just happened to read this article by a professional military historian, who characterizes the Native American military technology as being "thousands of years behind their Old World agrarian counterparts", which sounds like the advantage was actually rather overwhelming.

There is a massive amount of literature to explain what is sometimes called ‘the Great Divergence‘ (a term I am going to use here as valuable shorthand) between Europe and the rest of the world between 1500 and 1800. Of all of this, most readers are likely only to be familiar with one work, J. Diamond’s Guns, Germs and Steel (1997), which is unfortunate because Diamond’s model of geographic determinism is actually not terribly well regarded in the debate (although, to be fair, it is still better than some of the truly trash nationalistic nonsense that gets produced on this topic). Diamond asks the Great Divergence question with perhaps the least interesting framing: “Why Europe and not the New World?” and so we might as well get that question out of the way first.

I am well aware that when EU4 was released, this particular question – and generally the relative power of New World societies as compared to Old World societies – was a point of ferocious debate among fans (particularly on Paradox’s own forums). What makes this actually a less central question (though still an important one) is that the answer is wildly overdetermined. That is to say, any of these causes – the germs, the steel (through less the guns; Diamond’s attention is on the wrong developments there), but also horses, ocean-going ships, and dense, cohesive, disciplined military formations would have been enough in isolation to give almost any complex agrarian Old-World society military advantages which were likely to prove overwhelming in the event. The ‘killer technologies’ that made the conquest of the New World possible were (apart from the ships) old technologies in much of Afroeurasia; a Roman legion or a Han Chinese army of some fifteen centuries earlier would have had many of the same advantages had they been able to surmount the logistical problem of actually getting there. In the face of the vast shear in military technology (though often not in other technologies) which put Native American armies thousands of years behind their Old World agrarian counterparts, it is hard not to conclude that whatever Afroeurasian society was the first to resolve the logistical barriers to putting an army in the New World was also very likely to conquer it.

(On these points, see J.F. Guilmartin, “The Cutting Edge: An Analysis of the Spanish Invasion and Overthrow of the Inca Empire, 1532-1539,” in Transatlantic Encounters: European and Andeans in the Sixteenth Century, eds. K. J. Andrien and R. Adorno (1991) and W.E. Lee, “The Military Revolution of Native North America: Firearms, Forts and Politics” in Empires and Indigenes: Intercultural Alliance, Imperial Expansion and Warfare in the Early Modern World, eds. W.E. Lee (2011). Both provide a good sense of the scale of the ‘technological shear’ between old world and new world armies and in particular that the technologies which were transformative were often not new things like guns, but very old things, like pikes, horses and metal axes.)

With regard to the Indian Ocean, he writes:

the Portuguese cartaz-system (c. 1500-c. 1700) [was] the main way that the Portuguese and later European powers wrested control over trade in the Indian Ocean; it only worked because Portuguese warships were functionally unbeatable by anything else afloat in the region due to differences in local styles of shipbuilding).

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over their competitors (p. 326-328).

Sotala & Yampolskiy 2015 (p. 18) previously argued that:

In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.

Testing The Natural Abstraction Hypothesis: Project Intro

Oh cool! I put some effort into pursuing a very similar idea earlier:

I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.

but wasn't sure of how exactly to test it or work on it so I didn't get very far.

One idea that I had for testing it was rather different; make use of brain imaging research that seems able to map shared concepts between humans, and see whether that methodology could be used to also compare human-AI concepts:

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

The farthest that I got with my general approach was "Defining Human Values for Value Learners". It felt (and still feels) to me like concepts are quite task-specific: two people in the same environment will develop very different concepts depending on the job that they need to perform...  or even depending on the tools that they have available. The spatial concepts of sailors practicing traditional Polynesian navigation are sufficiently different from those of modern sailors that the "traditionalists" have extreme difficulty understanding what the kinds of birds-eye-view maps we're used to are even representing - and vice versa; Western anthropologists had considerable difficulties figuring out what exactly it was that the traditional navigation methods were even talking about. 

(E.g. the traditional way of navigating from one island to another involves imagining a third "reference" island and tracking its location relative to the stars as the journey proceeds. Some anthropologists thought that this third island was meant as an "emergency island" to escape to in case of unforeseen trouble, an interpretation challenged by the fact that the reference island may sometimes be completely imagined, so obviously not suitable as a backup port. Chapter 2 of Hutchins 1995 has a detailed discussion of the way that different tools for performing navigation affect one's conceptual representations, including the difficulties both the anthropologists and the traditional navigators had in trying to understand each other due to having incompatible concepts.)

Another example are legal concepts; e.g. American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control?

Eventually, the law was altered so that landowners couldn't forbid airplanes from flying over their land. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. In that case, we can think that our concept for landownership existed for the purpose of some vaguely-defined task (enabling the things that are commonly associated with owning land); when technology developed in a way that the existing concept started interfering with another task we value (fast travel), the concept came to be redefined so as to enable both tasks most efficiently.

This seemed to suggest an interplay between concepts and values; our values are to some extent defined in terms of our concepts, but our values and the tools that we have available for furthering our values also affect that how we define our concepts. This line of thought led me to think that that interaction must be rooted in what was evolutionarily beneficial:

... evolution selects for agents which best maximize their fitness, while agents cannot directly optimize for their own fitness as they are unaware of it. Agents can however have a reward function that rewards behaviors which increase the fitness of the agents. The optimal reward function is one which maximizes (in expectation) the fitness of any agents having it. Holding the intelligence of the agents constant, the closer an agent’s reward function is to the optimal reward function, the higher their fitness will be. Evolution should thus be expected to select for reward functions that are closest to the optimal reward function. In other words, organisms should be expected to receive rewards for carrying out tasks which have been evolutionarily adaptive in the past. [...]

We should expect an evolutionarily successful organism to develop concepts that abstract over situations that are similar with regards to receiving a reward from the optimal reward function. Suppose that a certain action in state s1 gives the organism a reward, and that there are also states s2–s5 in which taking some specific action causes the organism to end up in s1. Then we should expect the organism to develop a common concept for being in the states s2–s5, and we should expect that concept to be “more similar” to the concept of being in state s1 than to the concept of being in some state that was many actions away.

In other words, we have some set of innate values that our brain is trying to optimize for; if concepts are task-specific, then this suggests that the kinds of concepts that will be natural to us are those which are beneficial for achieving our innate values given our current (social, physical and technological) environment. E.g. for a child, the concepts of "a child" and "an adult" will seem very natural, because there are quite a few things that an adult can do for furthering or hindering the child's goals that fellow children can't do. (And a specific subset of all adults named "mom and dad" is typically even more relevant for a particular child than any other adults are, making this an even more natural concept.)

That in turn seems to suggest that in order to see what concepts will be natural for humans, we need to look at fields such as psychology and neuroscience in order to figure out what our innate values are and how the interplay of innate and acquired values develops over time. I've had some hope that some of my later work on the structure and functioning of the mind would be relevant for that purpose.

Load More