[AN #122]: Arguing for AGI-driven existential risk from first principles

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Note: I (Rohin) have just started as a Research Scientist at DeepMind! I expect to publish the newsletter as normal, but if previously you’ve been taking my CHAI summaries with a grain of salt due to the conflict of interest, it’s now time to apply that to DeepMind summaries. Of course, I try not to be biased towards my employer, but who knows how well I succeed at that.

HIGHLIGHTS

AGI safety from first principles (Richard Ngo) (summarized by Rohin): This sequence presents the author’s personal view on the current best arguments for AI risk, explained from first principles (that is, without taking any previous claims for granted). The argument is a specific instantiation of the second species argument that sufficiently intelligent AI systems could become the most intelligent species, in which case humans could lose the ability to create a valuable and worthwhile future.

We should clarify what we mean by superintelligence, and how it might arise. The author considers intelligence as quantifying simply whether a system “could” perform a wide range of tasks, separately from whether it is motivated to actually perform those tasks. In this case, we could imagine two rough types of intelligence. The first type, epitomized by most current AI systems, trains an AI system to perform many different tasks, so that it is then able to perform all of those tasks; however, it cannot perform tasks it has not been trained on. The second type, epitomized by human intelligence and GPT-3 (AN #102), trains AI systems in a task-agnostic way, such that they develop general cognitive skills that allow them to solve new tasks quickly, perhaps with a small amount of training data. This second type seems particularly necessary for tasks where data is scarce, such as the task of being a CEO of a company. Note that these two types should be thought of as defining a spectrum, not a binary distinction, since the type of a particular system depends on how you define your space of “tasks”.

How might we get AI systems that are more intelligent than humans? Besides improved algorithms, compute and data, we will likely also see that interactions between AI systems will be crucial to their capabilities. For example, since AI systems are easily replicated, we could get a collective superintelligence via a collection of replicated AI systems working together and learning from each other. In addition, the process of creation of AI systems will be far better understood than that of human evolution, and AI systems will be easier to directly modify, allowing for AI systems to recursively improve their own training process (complementing human researchers) much more effectively than humans can improve themselves or their children.

The second species argument relies on the argument that superintelligent AI systems will gain power over humans, which is usually justified by arguing that the AI system will be goal-directed. Making this argument more formal is challenging: the EU maximizer framework doesn’t work for this purpose (AN #52) and applying the intentional stance only helps when you have some prior information about what goals the AI system might have, which begs the question.

The author decides to instead consider a more conceptual, less formal notion of agency, in which a system is more goal-directed the more its cognition has the following properties: (1) self-awareness, (2) planning, (3) judging actions or plans by their consequences, (4) being sensitive to consequences over large distances and long time horizons, (5) internal coherence, and (6) flexibility and adaptability. (Note that this can apply to a single unified model or a collective AI system.) It’s pretty hard to say whether current training regimes will lead to the development of these capabilities, but one argument for it is that many of these capabilities may end up being necessary prerequisites to training AI agents to do intellectual work.

Another potential framework is to identify a goal as some concept learned by the AI system, that then generalizes in such a way that the AI system pursues it over longer time horizons. In this case, we need to predict what concepts an AI system will learn and how likely it is that they generalize in this way. Unfortunately, we don’t yet know how to do this.

What does alignment look like? The author uses intent alignment (AN #33), that is, the AI system should be “trying to do what the human wants it to do”, in order to rule out the cases where the AI system causes bad outcomes through incompetence where it didn’t know what it was supposed to do. Rather than focusing on the outer and inner alignment decomposition, the author prefers to take a holistic view in which the choice of reward function is just one (albeit quite important) tool in the overall project of choosing a training process that shapes the AI system towards safety (either by making it not agentic, or by shaping its motivations so that the agent is intent aligned).

Given that we’ll be trying to build aligned systems, why might we still get an existential catastrophe? First, a failure of alignment is still reasonably likely, since (1) good behavior is hard to identify, (2) human values are complex, (3) influence-seeking may be a useful subgoal during training, and thus incentivized, (4) it is hard to generate training data to disambiguate between different possible goals, (5) while interpretability could help it seems quite challenging. Then, given a failure of alignment, the AI systems could seize control via the mechanisms suggested in What failure looks like (AN #50) and Superintelligence. How likely this is depends on factors like (1) takeoff speed, (2) how easily we can understand what AI systems are doing, (3) how constrained AI systems are at deployment, and (4) how well humanity can coordinate.

Rohin's opinion: I like this sequence: I think it’s a good “updated case” for AI risk that focuses on the situation in which intelligent AI systems arise through training of ML models. The points it makes are somewhat different from the ones I would make if I were writing such a case, but I think they are still sufficient to make the case that humanity has work to do if we are to ensure that AI systems we build are aligned.

TECHNICAL AI ALIGNMENT

MESA OPTIMIZATION

The Solomonoff Prior is Malign (Mark Xu) (summarized by Rohin): This post provides a more accessible explanation of the argument that when we use the Solomonoff prior to make decisions, the predictions could be systematically chosen to optimize for something we wouldn’t want.

LEARNING HUMAN INTENT

Toy Problem: Detective Story Alignment (John Wentworth) (summarized by Rohin): We can generate toy problems for alignment by replacing the role of the human by that of a weak AI system, as in the MNIST debate task (AN #5). With the advent of GPT-3, we can have several new such problems. For example, suppose we used topic modelling to build a simple model that can detect detective stories (though isn’t very good at it). How can we use this to finetune GPT-3 to output detective stories, using GPT-3’s concept of detective stories (which is presumably better than the one found by the weak AI system)?

Rohin's opinion: I am a big fan of working on toy problems of this form now, and then scaling up these solutions with the capabilities of our AI systems. This depends on an assumption that no new problems will come up once the AI system is superintelligent, which I personally believe, though I know other people disagree (though I don’t know why they disagree).

PREVENTING BAD BEHAVIOR

Avoiding Side Effects By Considering Future Tasks (Victoria Krakovna et al) (summarized by Rohin): We are typically unable to specify all of the things that the agent should not change about the environment. So, we would like a generic method that can penalize these side effects in arbitrary environments for an arbitrary reward function. Typically, this is done via somehow preserving option value, as with relative reachability (AN #10) and attainable utility preservation (AN #39).

This paper aims to encode the goal of “option value preservation” in a simpler and more principled manner: specifically, at some point in the future we will randomly choose a new task to give to the agent, so that the agent must maintain its ability to pursue the possible tasks it can see in the future. However, if implemented as stated, this leads to interference incentives -- if something were going to restrict the agent’s option value, such as a human irreversibly eating some food, the agent would be incentivized to interfere with that process in order to keep its option value for the future. The authors provide a formal definition of this incentive.

To fix this problem, the authors introduce a baseline policy (which could be set to e.g. noop actions), and propose a future task reward that only provides reward if after the baseline policy had been executed, it would still have been possible to complete the future task. Thus, the agent is only incentivized to preserve options that would have been available had it done whatever the baseline policy does, eliminating the interference incentive in the deterministic case. The authors demonstrate on simple gridworlds that the future task approach with the baseline allows us to avoid side effects, while also not having interference incentives.

Normally we would also talk about how to remove the offsetting incentive, where the agent may be incentivized to undo effects it did as part of the task to avoid being penalized for them. (The example from relative reachability is of an agent that is rewarded for taking a vase off of a conveyor belt, and then puts it back on to minimize its impact.) However, the authors argue that offsetting is often desirable. For example, if you open the door to go to the grocery store, you do want to “offset” your impact by closing the door as you leave, even though opening the door was important for the task of buying groceries. They argue that offsetting incentives should be left in, and the burden is on the reward designer to ensure that anything that shouldn’t be offset is specified as such in the reward function. In the original conveyor belt example, we shouldn’t reward the action of taking the vase off the conveyor belt, but instead the state in which the vase is not on the conveyor belt.

Rohin's opinion: I liked this exploration of a somewhat more principled underpinning to impact measures, and it is encouraging that this formalization of option value preservation gives similar results as previous formalizations.

MISCELLANEOUS (ALIGNMENT)

Measurement in AI Policy: Opportunities and Challenges (Saurabh Mishra et al) (summarized by Flo): This paper is itself a summary of a 2019 Stanford workshop on the measurement of AI systems and contains summaries of all of the 33 talks given at the workshop. The workshop featured three in-depth breakout sessions, one on R&D and performance, one on economic impact and policy, and one on AI for sustainable development and human rights. Based on the discussions, the authors identify six central problems in measuring AI progress and the impacts of AI systems:

First, the exact definition of AI is hard to get down given the ongoing evolution of the field. The lack of clear definitions makes it tricky to combine results on different aspects like investments into "AI" and the effects of "AI" on productivity both with each other and across different countries or sectors.

Second, measuring progress in AI is hard for a variety of reasons: We don't just care about narrow benchmark performance but about many factors like robustness, transferability and compute-efficiency, and it is not clear how the tradeoff between performance and these factors should look like. Apart from that, progress on popular benchmarks might be faster than overall progress as methods overfit to the benchmark, and the rise and fall of benchmark popularity make it hard to track progress over longer time intervals. Still, focusing on specific benchmarks and subdomains seems like an important first step.

Third, bibliometric data is an important tool for better understanding the role of different actors in a scientific field. More precise definitions of AI could help with getting better bibliometric data and such data could shine some light on aspects like the lifecycle of AI techniques and the demographics of AI researchers.

Fourth, we would like to measure the impact of AI on the economy, especially on inequality and the labour market. This requires a better understanding of the relationship between inputs like skilled workers and data, and outputs, which is difficult to obtain because many of the involved factors are quite intangible and effects on outputs can be quite delayed. Short-term indicators that are strong predictors of longer-term effects would be very useful in this context. Lastly, even figuring out which businesses are deploying AI can be hard, especially if the applications are inward-focused.

The fifth problem is concerned with the measurement of societal impacts of AI with a special focus on developing countries: While a large number of metrics for impacts of AI systems on human rights and the UN's sustainable development goals have been proposed, there is little data on the deployment of AI systems for social good and in developing countries, so far.

Sixth, there is a need for better assessment of risks posed by and other negative impacts of AI systems, both before and after deployment. To that extent, a better understanding of risks posed by general classes of applications like autonomous weapons, surveillance and fake videos would be helpful. One barrier here is that many of the riskier applications are in the domain of governmental action such that detailed information is often classified.

Flo's opinion: If we cannot even measure AI progress and the impacts of AI systems right now, how are we supposed to accurately forecast them? As better forecasts are crucial for prioritizing the right problems and solutions, I am glad that the measurement of AI progress and impacts is getting broader attention. While the focus on developing countries might be less important from the perspective of AI existential safety, increased attention on measuring AI from a diverse set of communities is likely very useful for bringing the topic to the mainstream.

Knowledge, manipulation, and free will (Stuart Armstrong) (summarized by Rohin): This post considers the concepts of free will, manipulation, and coercion in the setting where we have a superintelligent AI system that is able to predict human behavior very accurately. The main point I’d highlight is that the concept of manipulation seems pretty hard to pin down, since anything the AI system does probably does affect the human in some way that the AI system could predict and so could count as “manipulation”.

NEWS

PhD Studentships in Safe and Trusted Artificial Intelligence (summarized by Rohin): The UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence is offering 12 fully funded PhD Studentships. They focus on the use of symbolic AI techniques for ensuring the safety and trustworthiness of AI systems. There are multiple application periods; the application deadline for the first round is November 22.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

AI ALIGNMENT FORUM
AF

[AN #122]: Arguing for AGI-driven existential risk from first principles

15