Kaj Sotala


2021 AI Alignment Literature Review and Charity Comparison

I would be happy to see you write a top-level post about this paper. :)

Biology-Inspired AGI Timelines: The Trick That Never Works

I had a pretty strong negative reaction to it. I got the feeling that the post derives much of its rhetorical force from setting up an intentionally stupid character who can be condescended to, and that this is used to sneak in a conclusion that would seem much weaker without that device.

DeepMind: Generally capable agents emerge from open-ended play

Didn't they train a separate MuZero agent for each game? E.g. the page you link only talks about being able to learn without pre-existing knowledge.

Cortés, Pizarro, and Afonso as Precedents for Takeover

However, I don't think this is the whole explanation. The technological advantage of the conquistadors was not overwhelming.

With regard to the Americas at least, I just happened to read this article by a professional military historian, who characterizes the Native American military technology as being "thousands of years behind their Old World agrarian counterparts", which sounds like the advantage was actually rather overwhelming.

There is a massive amount of literature to explain what is sometimes called ‘the Great Divergence‘ (a term I am going to use here as valuable shorthand) between Europe and the rest of the world between 1500 and 1800. Of all of this, most readers are likely only to be familiar with one work, J. Diamond’s Guns, Germs and Steel (1997), which is unfortunate because Diamond’s model of geographic determinism is actually not terribly well regarded in the debate (although, to be fair, it is still better than some of the truly trash nationalistic nonsense that gets produced on this topic). Diamond asks the Great Divergence question with perhaps the least interesting framing: “Why Europe and not the New World?” and so we might as well get that question out of the way first.

I am well aware that when EU4 was released, this particular question – and generally the relative power of New World societies as compared to Old World societies – was a point of ferocious debate among fans (particularly on Paradox’s own forums). What makes this actually a less central question (though still an important one) is that the answer is wildly overdetermined. That is to say, any of these causes – the germs, the steel (through less the guns; Diamond’s attention is on the wrong developments there), but also horses, ocean-going ships, and dense, cohesive, disciplined military formations would have been enough in isolation to give almost any complex agrarian Old-World society military advantages which were likely to prove overwhelming in the event. The ‘killer technologies’ that made the conquest of the New World possible were (apart from the ships) old technologies in much of Afroeurasia; a Roman legion or a Han Chinese army of some fifteen centuries earlier would have had many of the same advantages had they been able to surmount the logistical problem of actually getting there. In the face of the vast shear in military technology (though often not in other technologies) which put Native American armies thousands of years behind their Old World agrarian counterparts, it is hard not to conclude that whatever Afroeurasian society was the first to resolve the logistical barriers to putting an army in the New World was also very likely to conquer it.

(On these points, see J.F. Guilmartin, “The Cutting Edge: An Analysis of the Spanish Invasion and Overthrow of the Inca Empire, 1532-1539,” in Transatlantic Encounters: European and Andeans in the Sixteenth Century, eds. K. J. Andrien and R. Adorno (1991) and W.E. Lee, “The Military Revolution of Native North America: Firearms, Forts and Politics” in Empires and Indigenes: Intercultural Alliance, Imperial Expansion and Warfare in the Early Modern World, eds. W.E. Lee (2011). Both provide a good sense of the scale of the ‘technological shear’ between old world and new world armies and in particular that the technologies which were transformative were often not new things like guns, but very old things, like pikes, horses and metal axes.)

With regard to the Indian Ocean, he writes:

the Portuguese cartaz-system (c. 1500-c. 1700) [was] the main way that the Portuguese and later European powers wrested control over trade in the Indian Ocean; it only worked because Portuguese warships were functionally unbeatable by anything else afloat in the region due to differences in local styles of shipbuilding).

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over their competitors (p. 326-328).

Sotala & Yampolskiy 2015 (p. 18) previously argued that:

In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.

Testing The Natural Abstraction Hypothesis: Project Intro

Oh cool! I put some effort into pursuing a very similar idea earlier:

I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.

but wasn't sure of how exactly to test it or work on it so I didn't get very far.

One idea that I had for testing it was rather different; make use of brain imaging research that seems able to map shared concepts between humans, and see whether that methodology could be used to also compare human-AI concepts:

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

The farthest that I got with my general approach was "Defining Human Values for Value Learners". It felt (and still feels) to me like concepts are quite task-specific: two people in the same environment will develop very different concepts depending on the job that they need to perform...  or even depending on the tools that they have available. The spatial concepts of sailors practicing traditional Polynesian navigation are sufficiently different from those of modern sailors that the "traditionalists" have extreme difficulty understanding what the kinds of birds-eye-view maps we're used to are even representing - and vice versa; Western anthropologists had considerable difficulties figuring out what exactly it was that the traditional navigation methods were even talking about. 

(E.g. the traditional way of navigating from one island to another involves imagining a third "reference" island and tracking its location relative to the stars as the journey proceeds. Some anthropologists thought that this third island was meant as an "emergency island" to escape to in case of unforeseen trouble, an interpretation challenged by the fact that the reference island may sometimes be completely imagined, so obviously not suitable as a backup port. Chapter 2 of Hutchins 1995 has a detailed discussion of the way that different tools for performing navigation affect one's conceptual representations, including the difficulties both the anthropologists and the traditional navigators had in trying to understand each other due to having incompatible concepts.)

Another example are legal concepts; e.g. American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control?

Eventually, the law was altered so that landowners couldn't forbid airplanes from flying over their land. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. In that case, we can think that our concept for landownership existed for the purpose of some vaguely-defined task (enabling the things that are commonly associated with owning land); when technology developed in a way that the existing concept started interfering with another task we value (fast travel), the concept came to be redefined so as to enable both tasks most efficiently.

This seemed to suggest an interplay between concepts and values; our values are to some extent defined in terms of our concepts, but our values and the tools that we have available for furthering our values also affect that how we define our concepts. This line of thought led me to think that that interaction must be rooted in what was evolutionarily beneficial:

... evolution selects for agents which best maximize their fitness, while agents cannot directly optimize for their own fitness as they are unaware of it. Agents can however have a reward function that rewards behaviors which increase the fitness of the agents. The optimal reward function is one which maximizes (in expectation) the fitness of any agents having it. Holding the intelligence of the agents constant, the closer an agent’s reward function is to the optimal reward function, the higher their fitness will be. Evolution should thus be expected to select for reward functions that are closest to the optimal reward function. In other words, organisms should be expected to receive rewards for carrying out tasks which have been evolutionarily adaptive in the past. [...]

We should expect an evolutionarily successful organism to develop concepts that abstract over situations that are similar with regards to receiving a reward from the optimal reward function. Suppose that a certain action in state s1 gives the organism a reward, and that there are also states s2–s5 in which taking some specific action causes the organism to end up in s1. Then we should expect the organism to develop a common concept for being in the states s2–s5, and we should expect that concept to be “more similar” to the concept of being in state s1 than to the concept of being in some state that was many actions away.

In other words, we have some set of innate values that our brain is trying to optimize for; if concepts are task-specific, then this suggests that the kinds of concepts that will be natural to us are those which are beneficial for achieving our innate values given our current (social, physical and technological) environment. E.g. for a child, the concepts of "a child" and "an adult" will seem very natural, because there are quite a few things that an adult can do for furthering or hindering the child's goals that fellow children can't do. (And a specific subset of all adults named "mom and dad" is typically even more relevant for a particular child than any other adults are, making this an even more natural concept.)

That in turn seems to suggest that in order to see what concepts will be natural for humans, we need to look at fields such as psychology and neuroscience in order to figure out what our innate values are and how the interplay of innate and acquired values develops over time. I've had some hope that some of my later work on the structure and functioning of the mind would be relevant for that purpose.

How do we prepare for final crunch time?

Does any military use meditation as part of its training? 

. Yes, e.g.

This [2019] winter, Army infantry soldiers at Schofield Barracks in Hawaii began using mindfulness to improve shooting skills — for instance, focusing on when to pull the trigger amid chaos to avoid unnecessary civilian harm.

The British Royal Navy has given mindfulness training to officers, and military leaders are rolling it out in the Army and Royal Air Force for some officers and enlisted soldiers. The New Zealand Defence Force recently adopted the technique, and military forces of the Netherlands are considering the idea, too.

This week, NATO plans to hold a two-day symposium in Berlin to discuss the evidence behind the use of mindfulness in the military.

A small but growing group of military officials support the techniques to heal trauma-stressed veterans, make command decisions and help soldiers in chaotic battles.

“I was asked recently if my soldiers call me General Moonbeam,” said Maj. Gen. Piatt, who was director of operations for the Army and now commands its 10th Mountain Division. “There’s a stereotype this makes you soft. No, it brings you on point.”

The approach, he said, is based on the work of Amishi Jha, an associate professor of psychology at the University of Miami. She is the senior author of a paper published in December about the training’s effectiveness among members of a special operations unit.

The paper, in the journal Progress in Brain Research, reported that the troops who went through a monthlong training regimen that included daily practice in mindful breathing and focus techniques were better able to discern key information under chaotic circumstances and experienced increases in working memory function. The soldiers also reported making fewer cognitive errors than service members who did not use mindfulness.

The findings, which build on previous research showing improvements among soldiers and professional football players trained in mindfulness, are significant in part because members of the special forces are already selected for their ability to focus. The fact that even they saw improvement speaks to the power of the training, Dr. Jha said. [...]

Mr. Boughton has thought about whether mindfulness is anathema to conflict. “The purists would say that mindfulness was never developed for war purpose,” he said.

What he means is that mindfulness is often associated with peacefulness. But, he added, the idea is to be as faithful to compassionate and humane ideals as possible given the realities of the job.

Maj. Gen. Piatt underscored that point, describing one delicate diplomatic mission in Iraq that involved meeting with a local tribal leader. Before the session, he said, he meditated in front of a palm tree, and found himself extremely focused when the delicate conversation took place shortly thereafter.

“I was not taking notes. I remember every word she was saying. I wasn’t forming a response, just listening,” he said. When the tribal leader finished, he said, “I talked back to her about every single point, had to concede on some. I remember the expression on her face: This is someone we can work with.”

Fixing The Good Regulator Theorem

Appreciate this post! I had seen the good regulator theorem referenced every now and then, but wasn't sure what exactly the relevant claims were, and wouldn't have known how to go through the original proof myself. This is helpful.

(E.g. the result was cited by Frith & Metzinger as part of their argument that, as an agent seeks to avoid being punished by society, this constitutes an attempt to regulate society's behavior; and for the regulation be successful, the agent needs to internalize a model of the society's preferences, which once internalized becomes something like a subagent which then regulates the agent in turn and causes behaviors such as self-punishment. It sounds like the math of the theorem isn't very strongly relevant for that particular argument, though some form of the overall argument still sounds plausible to me regardless.)

The Case for a Journal of AI Alignment

IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.

Textbooks that cover a number of different approaches without taking a position on which one is the best are pretty much the standard in many fields. (I recall struggling with it in some undergraduate psychology courses, as previous schooling didn't prepare me for a textbook that would cover three mutually exclusive theories and present compelling evidence in favor of each. Before moving on and presenting three mutually exclusive theories about some other phenomenon on the very next page.)

Load More