[AN #59] How arguments for AI risk have changed over time

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback;

Highlights

A shift in arguments for AI risk (Tom Sittler): Early arguments for AI safety focus on existential risk cause by a failure of alignment combined with a sharp, discontinuous jump in AI capabilities. The discontinuity assumption is needed in order to argue for a treacherous turn, for example: without a discontinuity, we would presumably see less capable AI systems fail to hide their misaligned goals from us, or to attempt to deceive us without success. Similarly, in order for an AI system to obtain a decisive strategic advantage, it would need to be significantly more powerful than all the other AI systems already in existence, which requires some sort of discontinuity.

Now, there are several other arguments for AI risk, though none of them have been made in great detail and are spread out over a few blog posts. This post analyzes several of them and points out some open questions.

First, even without a discontinuity, a failure of alignment could lead to a bad future: since the AIs have more power and intelligence their values will determine what happens in the future, rather than ours. (Here it is the difference between AIs and humans that matters, whereas for a decisive strategic advantage it is the difference between the most intelligent agent and the next-most intelligent agents that matters.) See also More realistic tales of doom (AN #50) and Three impacts of machine intelligence. However, it isn't clear why we wouldn't be able to fix the misalignment at the early stages when the AI systems are not too powerful.

Even if we ignore alignment failures, there are other AI risk arguments. In particular, since AI will be a powerful technology, it could be used by malicious actors; it could help ensure robust totalitarian regimes; it could increase the likelihood of great-power war, and it could lead to stronger competitive pressures that erode value. With all of these arguments, it's not clear why they are specific to AI in particular, as opposed to any important technology, and the arguments for risk have not been sketched out in detail.

The post ends with an exhortation to AI safety researchers to clarify which sources of risk motivate them, because it will influence what safety work is most important, it will help cause prioritization efforts that need to determine how much money to allocate to AI risk, and it can help avoid misunderstandings with people who are skeptical of AI risk.

Rohin's opinion: I'm glad to see more work of this form; it seems particularly important to gain more clarity on what risks we actually care about, because it strongly influences what work we should do. In the particular scenario of an alignment failure without a discontinuity, I'm not satisfied with the solution "we can fix the misalignment early on", because early on even if the misalignment is apparent to us, it likely will not be easy to fix, and the misaligned AI system could still be useful because it is "aligned enough", at least at this low level of capability.

Personally, the argument that motivates me most is "AI will be very impactful, and it's worth putting in effort into making sure that that impact is positive". I think the scenarios involving alignment failures without a discontinuity are a particularly important subcategory of this argument: while I do expect we will be able to handle this issue if it arises, this is mostly because of meta-level faith in humanity to deal with the problem. We don't currently have a good object-level story for why the issue won't happen, or why it will be fixed when it does happen, and it would be good to have such a story in order to be confident that AI will in fact be beneficial for humanity.

I know less about the non-alignment risks, and my work doesn't really address any of them. They seem worth more investigation; currently my feeling towards them is "yeah, those could be risks, but I have no idea how likely the risks are".

Technical AI alignment

Learning human intent

Learning biases and rewards simultaneously (Rohin Shah et al): Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of how the demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator (AN #31). So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result (AN #31) here, we might hope that it is only a problem in theory, not in practice.

In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality -- but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.

Rohin's opinion: Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning (AN #31), and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.

I do think that this is a contribution to the field of inverse reinforcement learning -- it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX (AN #54).

Cognitive Model Priors for Predicting Human Decisions (David D. Bourgin, Joshua C. Peterson et al) (summarized by Cody): Human decision making is notoriously difficult to predict, being a combination of expected value calculation and likely-not-fully-enumerated cognitive biases. Normally we could predict well using a neural net with a ton of data, but data about human decision making is expensive and scarce. This paper proposes that we pretrain a neural net on lots of data simulated from theoretical models of human decision making and then finetune on the small real dataset. In effect, we are using the theoretical model as a kind of prior, that provides the neural net with a strong inductive bias. The method achieves better performance than existing theoretical or empirical methods, without requiring feature engineering, both on existing datasets and a new, larger dataset collected via Mechanical Turk.

Cody's opinion: I am a little cautious to make a strong statement about the importance of this paper, since I don't have as much domain knowledge in cognitive science as I do in machine learning, but overall this "treat your theoretical model like a generative model and sample from it" idea seems like an elegant and plausibly more broadly extensible way of incorporating theoretical priors alongside real data.

Miscellaneous (Alignment)

Self-confirming prophecies, and simplified Oracle designs (Stuart Armstrong): This post presents a toy environment to model self-confirming predictions by oracles, and demonstrates the results of running a deluded oracle (that doesn't realize its predictions affect the world), a low-bandwidth oracle (that must choose from a small set of possible answers), a high-bandwidth oracle (that can choose from a large set of answers) and a counterfactual oracle (that chooses the correct answer, conditional on us not seeing the answer).

Existential Risks: A Philosophical Analysis (Phil Torres): The phrase "existential risk" is often used in different ways. This paper considers the pros and cons of five different definitions.

Rohin's opinion: While this doesn't mention AI explicitly, I think it's useful to read anyway, because often which of the five concepts you use will affect what you think the important risks are.

AI strategy and policy

AGI will drastically increase economies of scale (Wei Dai): Economies of scale would normally mean that companies would keep growing larger and larger. With human employees, the coordination costs grow superlinearly, which ends up limiting the size to which a company can grow. However, with the advent of AGI, many of these coordination costs will be removed. If we can align AGIs to particular humans, then a corporation run by AGIs aligned to a single human would at least avoid principal-agent costs. As a result, the economies of scale would dominate, and companies would grow much larger, leading to more centralization.

Rohin's opinion: This argument is quite compelling to me under the assumption of human-level AGI systems that can be intent-aligned. Note though that while the development of AGI systems removes principal-agent problems, it doesn't remove issues that arise due to different agents having different (non-value-related) information.

The argument probably doesn't hold with CAIS (AN #40), where each AI service is optimized for a particular task, since there would be principal-agent problems between services.

It seems like the argument should mainly make us more worried about stable authoritarian regimes: the main effect based on this argument is a centralization of power in the hands of the AGI's overseers. This is less likely to happen with companies, because we have institutions that prevent companies from gaining too much power, though perhaps competition between countries could weaken such institutions. It could happen with government, but if long-term governmental power still rests with the people via democracy, that seems okay. So the risky situation seems to be when the government gains power, and the people no longer have effective control over government. (This would include scenarios with e.g. a government that has sufficiently good AI-fueled propaganda that they always win elections, regardless of whether their governing is actually good.)

Where are people thinking and talking about global coordination for AI safety? (Wei Dai)

Other progress in AI

Reinforcement learning

Unsupervised State Representation Learning in Atari (Ankesh Anand, Evan Racah, Sherjil Ozair et al) (summarized by Cody): This paper has two main contributions: an actual technique for learning representations in an unsupervised way, and an Atari-specific interface for giving access to the underlying conceptual state of the game (e.g. the locations of agents, locations of small objects, current remaining lives, etc) by parsing out the RAM associated with each state. Since the notional goal of unsupervised representation learning is often to find representations that can capture conceptually important features of the state without having direct access to them, this supervision system allows for more meaningful evaluation of existing methods by asking how well conceptual features can be predicted by learned representation vectors. The object-level method of the paper centers around learning representations that capture information about temporal state dynamics, which they do by maximizing mutual information between representations at adjacent timesteps. More specifically, they have both a local version of this, where a given 1/16th patch of the image has a representation that is optimized to be predictive of that same patches next-timestep representation, and a local-global version, where the global representation is optimized to be predictive of representations of each patch. They argue this patch-level prediction makes their method better at learning concepts attached to small objects, and the empirical results do seem to support this interpretation.

Cody's opinion: The specific method is an interesting modification of previous Contrastive Predictive Coding work, but what I found most impressive about this paper was the engineering work involved in pulling metadata supervision signals out of the game by reading comments on disassembled source code to see exactly how metadata was being stored in RAM. This seems to have the potential of being a useful benchmark for Atari representation learning going forward (though admittedly Atari games are fairly conceptually straightforward to begin with).

Deep learning

XLNet: Generalized Autoregressive Pretraining for Language Understanding (Zhilin Yang, Zihang Dai et al): XLNet sets significantly improved state-of-the-art scores on many NLP tasks, beating out BERT. This was likely due to pretraining on significantly more data, though there are also architectural improvements.

News

Funding for Study and Training Related to AI Policy Careers: The Open Philanthropy Project has launched an AI policy scholarships program; the deadline for the first round is October 15.

Research Scholars Project Coordinator (Rose Hadshar): FHI is looking to hire a coordinator for the Research Scholars Programme. Application deadline is July 10.

Contest: $1,000 for good questions to ask to an Oracle AI (Stuart Armstrong)

17