Listen to this newsletter on The Alignment Newsletter Podcast.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.









Draft report on existential risk from power-seeking AI (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument: We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans, leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging and we are not capable of the necessary coordination.

There’s a lot of vague words in the argument above, so let’s introduce some terminology to make it clearer:

- Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation).

- Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition and doesn’t have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm.

- Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment.

- PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned).

The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe. Of course, we will try to prevent this -- why should we expect that we can’t fix the problem? The author considers possible remedies, and argues that they all seem quite hard:

- We could give AI systems the right objectives (alignment), but this seems quite hard -- it’s not clear how we would solve either outer or inner alignment.

- We could try to shape objectives to be e.g. myopic, but we don’t know how to do this, and there are strong incentives against myopia.

- We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too.

- We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience.

- We could try to control the deployment situations to be within some set of circumstances where we know the AI system won’t seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available.

- We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots).

- We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparable or superior power to the misaligned systems in question, though; and even if we are able to correct the problem at one level of capability, we need solutions that scale as our AI systems become more powerful.

The author breaks the overall argument into six conjunctive claims, assigns probabilities to each of them, and ends up computing a 5% probability of existential catastrophe from misaligned, power-seeking AI by 2070. This is a lower bound, since the six claims together add a fair number of assumptions, and there can be risk scenarios that violate these assumptions, and so overall the author would shade upward another couple of percentage points.

Rohin's opinion: This is a great investigation of the typical argument for existential risk from AI systems adversarially optimizing against humans. When I put my own numbers in without looking at Joe’s numbers, I got a 3% chance of existential catastrophe by 2070 through the argument in this post, though I think I underestimated the probability for claim (4) so I’d now get something more like 4%. (The main difference from Joe’s 5% is that I am more optimistic about possible remedies, though of course these differences are tiny relative to our high overall uncertainty.)

Comments on Carlsmith's “Is power-seeking AI an existential risk?” (Nate Soares) (summarized by Rohin): This response to the report above touches on many topics, but has three main object-level disagreements and one meta-level disagreement:

1. The author has significantly shorter timelines, though this is based on a very different argument structure than the one presented in the report above, and so it is hard to turn this into more concrete disagreements with the report.

2. The author expects that alignment is hard enough that we won’t solve it in time (which is not to say that it is harder than every other technical problem humanity has ever faced). It’s also not clear how to turn this into more concrete disagreements with the report.

3. The author does not expect to have warning shots where misaligned AI systems cause trillions of dollars of damage but don’t cause an existential catastrophe, because this seems like too narrow a capability range for us to hit in practice. Even if there are warning shots, he expects that civilization will continue to deploy risky AI systems anyway, similarly to how we are not banning gain-of-function research despite the warning shot of COVID-19.

4. On the meta level, the author expects that the decomposition of the AI risk argument into six conjunctive claims will typically bias you towards giving too low a probability on the overall conjunction.



The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models (Anonymous) (summarized by Zach): Reward hacking occurs when RL agents exploit the difference between a true reward and a proxy. Reward hacking has been observed in practice (AN #1), and as reinforcement learning agents are trained with better algorithms, more data, and larger policies, they are at increased risk of overfitting their proxy objectives. However, reward hacking has not yet been systematically studied.

This paper fills this gap by constructing four example environments with a total of nine proxy rewards to investigate how reward hacking changes as a function of optimization power. They increase optimization power in several different ways, such as increasing the size of the neural net, or providing the model with more fine-grained observations.

Overall, the authors find that reward hacking occurs in five of the nine cases. Moreover, the authors observed phase transitions in four of these cases. These are stark transitions where a moderate increase in optimization power leads to a drastic increase in reward hacking behavior. This poses a challenge in monitoring the safety of ML systems. To address this the authors suggest performing anomaly detection to notice reward hacking and offer several baselines.

Zach's opinion: It is good to see an attempt at formalizing reward hacking. The experimental contributions are interesting and the anomaly detection method seems reasonable. However, the proxy rewards chosen to represent reward hacking are questionable. In my opinion, these rewards are obviously 'wrong' so it is less surprising that they result in undesired behavior. I look forward to seeing more comprehensive experiments on this subject.

Rohin’s opinion: Note that on OpenReview, the authors say that one of the proxy rewards (maximize average velocity for the driving environment) was actually the default and they only noticed it was problematic after they had trained large neural nets on that environment. I do agree that future proxy objectives will probably be less clearly wrong than most of the ones in this paper.



Shaking the foundations: delusions in sequence models for interaction and control (Pedro A. Ortega et al) (summarized by Robert): Delusions in language models (LMs) like GPT-3 occur when an incorrect generation early on throws the LM off the rails later. Specifically, if there is some unobserved context that influences how humans generate text that the LM is unaware of, then the LM will generate some plausible text -- and then take that text as evidence about what the unobserved context must be. This can be especially likely when the desired context or task for the generation is difficult to infer from the input. In these settings the human generating the text has access to a lot more information than the model, making generation harder for the model and delusions more likely: an incorrect generation will make it more likely that the model infers the task or context incorrectly. This also applies to sequence modelling approaches in RL like Decision Transformer (AN #153) and Trajectory Transformer (AN #153), where incorrectly chosen actions could change the model's beliefs about optimal future actions.

This work explains this problem using tools from causality and argues that these models should act as if their previous actions are causal interventions rather than observations. However, training a model in this way requires access to a model of the environment and the expert demonstrating trajectories in an online way, and the authors don't describe a way to do this with purely offline data (it may be fundamentally impossible). The authors do argue that in settings where the context or task information can be easily extracted from the observations so far, then delusions are less likely. This points to the importance of prompt engineering, or providing context information in another way to sequence models, so that they don't delude themselves.

Robert's opinion: Understanding specific failure modes of large language model generation seems useful, and the detailed mathematical explanation here makes it easier to understand what exactly the problem is, and what we can do to fix it. I'd be interested to see whether we can distinguish delusions from other failure modes and measure what proportion of failures are delusions (although failure modes likely can’t be as cleanly divided as I’m implying here). However, it seems fundamentally very difficult to train using offline data in a way that the model does learn to understand its own actions as interventions, so other solutions may need to be found.


GovAI Summer 2022 Fellowships (summarized by Rohin): Applications are now open for the GovAI 2022 Summer Fellowship! This is an opportunity for early-career individuals to spend three months working on an AI governance research project, learning about the field, and making connections with other researchers and practitioners. Application deadline is Jan 1.

Foundations of Cooperative AI Lab (summarized by Rohin): This new lab at CMU aims to create foundations of game theory appropriate for advanced, autonomous AI agents -- think of work on agent foundations and cooperative AI (AN #133). Apply for a PhD here (deadline Dec 9) or for a postdoc here.

Public reports are now optional for EA Funds grantees (Asya Bergal and Jonas Vollmer) (summarized by Rohin): This is your regular reminder that you can apply to the Long-Term Future Fund (and the broader EA Funds) for funding for a wide variety of projects. They have now removed the requirement for public reporting of your grant. They encourage you to apply if you have a preference for private funding.

Sydney AI Safety Fellowship (casebash) (summarized by Rohin): This 7-week fellowship will provide fellows from Australia and New Zealand the opportunity to pursue projects in AI Safety or spend time upskilling. Applications are due December 14.

New Comment