Kaj Sotala

Comments

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over their competitors (p. 326-328).

Sotala & Yampolskiy 2015 (p. 18) previously argued that:

In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.

Testing The Natural Abstraction Hypothesis: Project Intro

Oh cool! I put some effort into pursuing a very similar idea earlier:

I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.

but wasn't sure of how exactly to test it or work on it so I didn't get very far.

One idea that I had for testing it was rather different; make use of brain imaging research that seems able to map shared concepts between humans, and see whether that methodology could be used to also compare human-AI concepts:

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

The farthest that I got with my general approach was "Defining Human Values for Value Learners". It felt (and still feels) to me like concepts are quite task-specific: two people in the same environment will develop very different concepts depending on the job that they need to perform...  or even depending on the tools that they have available. The spatial concepts of sailors practicing traditional Polynesian navigation are sufficiently different from those of modern sailors that the "traditionalists" have extreme difficulty understanding what the kinds of birds-eye-view maps we're used to are even representing - and vice versa; Western anthropologists had considerable difficulties figuring out what exactly it was that the traditional navigation methods were even talking about. 

(E.g. the traditional way of navigating from one island to another involves imagining a third "reference" island and tracking its location relative to the stars as the journey proceeds. Some anthropologists thought that this third island was meant as an "emergency island" to escape to in case of unforeseen trouble, an interpretation challenged by the fact that the reference island may sometimes be completely imagined, so obviously not suitable as a backup port. Chapter 2 of Hutchins 1995 has a detailed discussion of the way that different tools for performing navigation affect one's conceptual representations, including the difficulties both the anthropologists and the traditional navigators had in trying to understand each other due to having incompatible concepts.)

Another example are legal concepts; e.g. American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question: could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control?

Eventually, the law was altered so that landowners couldn't forbid airplanes from flying over their land. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. In that case, we can think that our concept for landownership existed for the purpose of some vaguely-defined task (enabling the things that are commonly associated with owning land); when technology developed in a way that the existing concept started interfering with another task we value (fast travel), the concept came to be redefined so as to enable both tasks most efficiently.

This seemed to suggest an interplay between concepts and values; our values are to some extent defined in terms of our concepts, but our values and the tools that we have available for furthering our values also affect that how we define our concepts. This line of thought led me to think that that interaction must be rooted in what was evolutionarily beneficial:

... evolution selects for agents which best maximize their fitness, while agents cannot directly optimize for their own fitness as they are unaware of it. Agents can however have a reward function that rewards behaviors which increase the fitness of the agents. The optimal reward function is one which maximizes (in expectation) the fitness of any agents having it. Holding the intelligence of the agents constant, the closer an agent’s reward function is to the optimal reward function, the higher their fitness will be. Evolution should thus be expected to select for reward functions that are closest to the optimal reward function. In other words, organisms should be expected to receive rewards for carrying out tasks which have been evolutionarily adaptive in the past. [...]

We should expect an evolutionarily successful organism to develop concepts that abstract over situations that are similar with regards to receiving a reward from the optimal reward function. Suppose that a certain action in state s1 gives the organism a reward, and that there are also states s2–s5 in which taking some specific action causes the organism to end up in s1. Then we should expect the organism to develop a common concept for being in the states s2–s5, and we should expect that concept to be “more similar” to the concept of being in state s1 than to the concept of being in some state that was many actions away.

In other words, we have some set of innate values that our brain is trying to optimize for; if concepts are task-specific, then this suggests that the kinds of concepts that will be natural to us are those which are beneficial for achieving our innate values given our current (social, physical and technological) environment. E.g. for a child, the concepts of "a child" and "an adult" will seem very natural, because there are quite a few things that an adult can do for furthering or hindering the child's goals that fellow children can't do. (And a specific subset of all adults named "mom and dad" is typically even more relevant for a particular child than any other adults are, making this an even more natural concept.)

That in turn seems to suggest that in order to see what concepts will be natural for humans, we need to look at fields such as psychology and neuroscience in order to figure out what our innate values are and how the interplay of innate and acquired values develops over time. I've had some hope that some of my later work on the structure and functioning of the mind would be relevant for that purpose.

How do we prepare for final crunch time?

Does any military use meditation as part of its training? 

. Yes, e.g.

This [2019] winter, Army infantry soldiers at Schofield Barracks in Hawaii began using mindfulness to improve shooting skills — for instance, focusing on when to pull the trigger amid chaos to avoid unnecessary civilian harm.

The British Royal Navy has given mindfulness training to officers, and military leaders are rolling it out in the Army and Royal Air Force for some officers and enlisted soldiers. The New Zealand Defence Force recently adopted the technique, and military forces of the Netherlands are considering the idea, too.

This week, NATO plans to hold a two-day symposium in Berlin to discuss the evidence behind the use of mindfulness in the military.

A small but growing group of military officials support the techniques to heal trauma-stressed veterans, make command decisions and help soldiers in chaotic battles.

“I was asked recently if my soldiers call me General Moonbeam,” said Maj. Gen. Piatt, who was director of operations for the Army and now commands its 10th Mountain Division. “There’s a stereotype this makes you soft. No, it brings you on point.”

The approach, he said, is based on the work of Amishi Jha, an associate professor of psychology at the University of Miami. She is the senior author of a paper published in December about the training’s effectiveness among members of a special operations unit.

The paper, in the journal Progress in Brain Research, reported that the troops who went through a monthlong training regimen that included daily practice in mindful breathing and focus techniques were better able to discern key information under chaotic circumstances and experienced increases in working memory function. The soldiers also reported making fewer cognitive errors than service members who did not use mindfulness.

The findings, which build on previous research showing improvements among soldiers and professional football players trained in mindfulness, are significant in part because members of the special forces are already selected for their ability to focus. The fact that even they saw improvement speaks to the power of the training, Dr. Jha said. [...]

Mr. Boughton has thought about whether mindfulness is anathema to conflict. “The purists would say that mindfulness was never developed for war purpose,” he said.

What he means is that mindfulness is often associated with peacefulness. But, he added, the idea is to be as faithful to compassionate and humane ideals as possible given the realities of the job.

Maj. Gen. Piatt underscored that point, describing one delicate diplomatic mission in Iraq that involved meeting with a local tribal leader. Before the session, he said, he meditated in front of a palm tree, and found himself extremely focused when the delicate conversation took place shortly thereafter.

“I was not taking notes. I remember every word she was saying. I wasn’t forming a response, just listening,” he said. When the tribal leader finished, he said, “I talked back to her about every single point, had to concede on some. I remember the expression on her face: This is someone we can work with.”

Fixing The Good Regulator Theorem

Appreciate this post! I had seen the good regulator theorem referenced every now and then, but wasn't sure what exactly the relevant claims were, and wouldn't have known how to go through the original proof myself. This is helpful.

(E.g. the result was cited by Frith & Metzinger as part of their argument that, as an agent seeks to avoid being punished by society, this constitutes an attempt to regulate society's behavior; and for the regulation be successful, the agent needs to internalize a model of the society's preferences, which once internalized becomes something like a subagent which then regulates the agent in turn and causes behaviors such as self-punishment. It sounds like the math of the theorem isn't very strongly relevant for that particular argument, though some form of the overall argument still sounds plausible to me regardless.)

The Case for a Journal of AI Alignment

IMO, a textbook would either overlook big chunks of the field or look more like an enumeration of approaches than a unified resource.

Textbooks that cover a number of different approaches without taking a position on which one is the best are pretty much the standard in many fields. (I recall struggling with it in some undergraduate psychology courses, as previous schooling didn't prepare me for a textbook that would cover three mutually exclusive theories and present compelling evidence in favor of each. Before moving on and presenting three mutually exclusive theories about some other phenomenon on the very next page.)

Why Subagents?

The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.

The ethics of AI for the Routledge Encyclopedia of Philosophy

You probably know of these already, but just in case: lukeprog wrote a couple of articles on the history of AI risk thought [1, 2] going back to 1863. There's also the recent AI ethics article in the Stanford Encyclopedia of Philosophy.

I'd also like to imagine that my paper on superintelligence and astronomical suffering might say something that someone might consider important, but that is of course a subjective question. :-)

AGI safety from first principles: Introduction

because who's talking about medium-size risks from AGI?

Well, I have talked about them... :-)

The capability claim is often formulated as the possibility of an AI achieving a decisive strategic advantage (DSA). While the notion of a DSA has been implicit in many previous works, the concept was first explicitly defined by Bostrom (2014, p. 78) as “a level of technological and other advantages sufficient to enable [an AI] to achieve complete world domination.”

However, assuming that an AI will achieve a DSA seems like an unnecessarily strong form of the capability claim, as an AI could cause a catastrophe regardless. For instance, consider a scenario where an AI launches an attack calculated to destroy human civilization. If the AI was successful in destroying humanity or large parts of it, but the AI itself was also destroyed in the process, this would not count as a DSA as originally defined. Yet, it seems hard to deny that this outcome should nonetheless count as a catastrophe.

Because of this, this chapter focuses on situations where an AI achieves (at least) a major strategic advantage (MSA), which we will define as “a level of technological and other advantages sufficient to pose a catastrophic risk to human society.” A catastrophic risk is one that might inflict serious damage to human well-being on a global scale and cause 10 million or more fatalities (Bostrom & Ćirković 2008).

What Decision Theory is Implied By Predictive Processing?

I read you to be asking "what decision theory is implied by predictive processing as it's implemented in human brains". It's my understanding that while there are attempts to derive something like a "decision theory formulated entirely in PP terms", there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.

Let's say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020) (current sci-hub) note that this makes it hard to explain cases where the mind acts according to a prediction that has a low probability of being true, but a high cost if it were true. 

For example, the sound of rustling grass might be indicative either of the wind or of a lion; if wind is more likely, then predictive processing says that wind should become the predominant prediction. But for your own safety it can be better to predict that it's a lion, just in case. "Predict a lion" is also what standard Bayesian decision theory would recommend, and it seems like the correct solution... but to get that correct solution, you need to import Bayesian decision theory as an extra ingredient, it doesn't fall naturally out of the predictive processing framework.

That sounds to me like PP, or at least PP as it exists, is something that's compatible with implementing different decision theories, rather than one that implies a specific decision theory by itself.

Load More