Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
The "most important century" series (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build transformative AI and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity for individuals to have an impact. Given the sheer number of centuries there are, this is an extraordinary claim; it should really have extraordinary evidence. This series argues that while the claim does seem extraordinary, all views seem extraordinary -- there isn’t some default baseline view that is “ordinary” to which we should be assigning most of our probability.
Specifically, consider three possibilities for the long-run future:
1. Radical: We will have a productivity explosion by 2100, which will enable us to become technologically mature. Think of a civilization that sends spacecraft throughout the galaxy, builds permanent settlements on other planets, harvests large fractions of the energy output from stars, etc.
2. Conservative: We get to a technologically mature civilization, but it takes hundreds or thousands of years. Let’s say even 100,000 years to be ultra conservative.
3. Skeptical: We never become technologically mature for some reason. Perhaps we run into fundamental technological limits, or we choose not to expand into the galaxy, or we’re in a simulation, etc.
It’s pretty clear why the radical view is extraordinary. What about the other two?
The conservative view implies that we are currently in the most important 100,000-year period. Given that life is billions of years old, and would presumably continue for billions of years to come once we reach a stable galaxy-wide civilization, that would make this the most important 100,000 year period out of tens of thousands of such periods. Thus the conservative view is also extraordinary, for the same reason that the radical view is extraordinary (albeit it is perhaps only half as extraordinary as the radical view).
The skeptical view by itself does not seem obviously extraordinary. However, while you could assign 70% probability to the skeptical view, it seems unreasonable to assign 99% probability to such a view -- that suggests some very strong or confident claims about what prevents us from colonizing the galaxy, which we probably shouldn’t have given our current knowledge. So, we need to have a non-trivial chunk of probability on the other views, which still opens us up to critique of having extraordinary claims.
Okay, so we’ve established that we should at least be willing to say something as extreme as “there’s a non-trivial chance we’re in the most important 100,000-year period”. Can we tighten the argument, to talk about the most important century? In fact, we can, by looking at the economic growth rate.
You are probably aware that the US economy grows around 2-3% per year (after adjusting for inflation), so a business-as-usual, non-crazy, default view might be to expect this to continue. You are probably also aware that exponential growth can grow very quickly. At the lower end of 2% per year, the economy would double every ~35 years. If this continued for 8200 years, we'd need to be sustaining multiple economies as big as today's entire world economy per atom in the galaxy. While this is not a priori impossible, it seems quite unlikely to happen. This suggests that we’re in one of fewer than 82 centuries that will have growth rates at 2% or larger, making it far less “extraordinary” to claim that we’re in the most important one, especially if you believe that growth rates are well correlated with change and ability to have impact.
The actual radical view that the author places non-trivial probability on is one we’ve seen before in this newsletter: it is one in which there is automation of science and technology through advanced AI or whole brain emulations or other possibilities. This allows technology to substitute for human labor in the economy, which produces a positive feedback loop as the output of the economy is ploughed back into the economy creating superexponential growth and a “productivity explosion”, where the growth rate increases far beyond 2%. The series summarizes and connects together many (AN #105), past (AN #154), Open (AN #121), Phil (AN #118) analyses (AN #145), which I won't be summarizing here (since we've summarized these analyses previously). While this is a more specific and “extraordinary” claim than even the claim that we live in the most important century, it seems like it should not be seen as so extraordinary given the arguments above.
This series also argues for a few other points important to longtermism, which I’ll copy here:
1. The long-run future is radically unfamiliar. Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.
2. The long-run future could come much faster than we think, due to a possible AI-driven productivity explosion. (I briefly mentioned this above, but the full series devotes much more space and many more arguments to this point.)
3. We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, we aren't ready for this.
Read more: 80,000 Hours podcast on the topic
Rohin's opinion: I especially liked this series for the argument that 2% economic growth very likely cannot last much longer, providing quite a strong argument for the importance of this century, without relying at all on controversial facts about AI. At least personally I was previously uneasy about how “grand” or “extraordinary” AGI claims tend to be, and whether I should be far more skeptical of them as a result. I feel significantly more comfortable with these claims after seeing this argument.
Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk.
TECHNICAL AI ALIGNMENT
Why AI alignment could be hard with modern deep learning (Ajeya Cotra) (summarized by Rohin): This post provides an ELI5-style introduction to AI alignment as a major challenge for deep learning. It primarily frames alignment as a challenge in creating Saints (aligned AI systems), without getting Schemers (AI systems that are deceptively aligned (AN #58)) or Sycophants (AI systems that satisfy only the letter of the request, rather than its spirit, as in Another (outer) alignment failure story (AN #146)). Any short summary I write would ruin the ELI5 style, so I won’t attempt it; I do recommend it strongly if you want an introduction to AI alignment.
LEARNING HUMAN INTENT
B-Pref: Benchmarking Preference-Based Reinforcement Learning (Kimin Lee et al) (summarized by Zach): Deep RL has become a powerful method to solve a variety of sequential decision tasks using a known reward function for training. However, in practice, rewards are hard to specify making it hard to scale Deep RL for many applications. Preference-based RL provides an alternative by allowing a teacher to indicate preferences between a pair of behaviors. Because the teacher can interactively give feedback to an agent, preference-based RL has the potential to help address this limitation of Deep RL. Despite the advantages of preference-based RL it has proven difficult to design useful benchmarks for the problem. This paper introduces a benchmark (B-Pref) that is useful for preference-based RL in various locomotion and robotic manipulation tasks.
One difficulty with designing a useful benchmark is that teachers may have a variety of irrationalities. For example, teachers might be myopic or make mistakes. The B-Pref benchmark addresses this by emphasizing measuring performance under a variety of teacher irrationalities. They do this by providing various performance metrics to introduce irrationality into otherwise deterministic reward criteria. While previous approaches to preference-based RL work well when the teacher responses are consistent, experiments show they are not robust to feedback noise or teacher mistakes. Experiments also show that how queries are selected has a major impact on performance. With these results, the authors identify these two problems as areas for future work.
Zach's opinion: While the authors do a good job advocating for the problem of preference-based RL, I'm less convinced their particular benchmark is a large step forward. In particular, it seems the main contribution is not a suite of tasks, but rather a collection of different ways to add irrationality to the teacher oracle. The main takeaway of this paper is that current algorithms don't seem to perform well when the teacher can make mistakes, but this is quite similar to having a misspecified reward function. Beyond that criticism, the experiments support the areas suggested for future work.
Redwood Research’s current project (Buck Shlegeris) (summarized by Rohin): This post introduces Redwood Research’s current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions. Their hope is to learn a general method for enforcing such constraints on models, such that they could then quickly train the model to, say, never mention anything about food.
Distinguishing AI takeover scenarios (Sam Clarke et al) (summarized by Rohin): This post summarizes several AI takeover scenarios that have been proposed and categorizes them according to three main variables. Speed refers to the question of whether there is a sudden jump in AI capabilities. Uni/multipolarity asks whether a single AI system takes over, or many. Alignment asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A followup post investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors.
Since these posts are themselves summaries and comparisons of previously proposed scenarios that we’ve covered in this newsletter, I won’t summarize them here, but I do recommend them for an overview of AI takeover scenarios.
Beyond fire alarms: freeing the groupstruck (Katja Grace) (summarized by Rohin): It has been claimed that there’s no fire alarm for AGI, that is, there will be no specific moment or event at which AGI risk becomes sufficiently obvious and agreed upon, so that freaking out about AGI becomes socially acceptable rather than embarrassing. People often implicitly argue for waiting for an (unspecified) future event that tells us AGI is near, after which everyone will know that it’s okay to work on AGI alignment. This seems particularly bad if no such future event (i.e. fire alarm) exists.
This post argues that this is not in fact the implicit strategy that people typically use to evaluate and respond to risks. In particular, it is too discrete. Instead, people perform “the normal dance of accumulating evidence and escalating discussion and brave people calling the problem early and eating the potential embarrassment”. As a result, the existence of a “fire alarm” is not particularly important.
Note that the author does agree that there is some important bias at play here. The original fire alarm post is implicitly considering a fear shame hypothesis: people tend to be less cautious in public because they expect to be negatively judged for looking scared. The author ends up concluding that there is something broader going on and proposes a few possibilities, many of which still suggest that people will tend to be less cautious around risks when they are observed.
Some points made in the very detailed, 15,000-word article:
1. Literal fire alarms don’t work by creating common knowledge, or by providing evidence of a fire. People frequently ignore fire alarms. In one experiment, participants continued to fill out questionnaires while a fire alarm rang, often assuming that someone will lead them outside if it is important.
2. They probably instead work by a variety of mechanisms, some of which are related to the fear shame hypothesis. Sometimes they provide objective evidence that is easier to use as a justification for caution than a personal guess. Sometimes they act as an excuse for cautious or fearful people to leave, without the implication that those people are afraid. Sometimes they act as a source of authority for a course of action (leaving the building).
3. Most of these mechanisms are amenable to partial or incremental effects, and in particular can happen with AGI risk. There are many people who have already boldly claimed that AGI risk is a problem. There exists person-independent evidence; for example, surveys of AI researchers suggest a 5% chance of extinction.
4. For other risks, there does not seem to have been a single discrete moment at which it became acceptable to worry about them (i.e. no “fire alarm”). This includes risks where there has been a lot of caution, such as climate change, the ozone hole, recombinant DNA, COVID, and nuclear weapons.
5. We could think about building fire alarms; many of the mechanisms above are social ones rather than empirical facts about the world. This could be one out of many strategies that we employ against the general bias towards incaution (the post suggests 16).
Rohin's opinion: I enjoyed this article quite a lot; it is really thorough. I do see a lot of my own work as pushing on some of these more incremental methods for increasing caution, though I think of it more as a combination of generating more or better evidence and communicating arguments in a manner more suited to a particular audience. Perhaps I will think of new strategies that aim to reduce fear shame instead.
Seeking social science students / collaborators interested in AI existential risks (Vael Gates) (summarized by Rohin): This post presents a list of research questions around existential risk from AI that can be tackled by social scientists. The author is looking for collaborators to expand the list and tackle some of the questions on it, and is aiming to provide some mentorship for people getting involved.
[Job ad] Research important longtermist topics at Rethink Priorities! (Linch Zhang) (summarized by Rohin): Of particular interest to readers, there are roles available in AI governance and strategy. The application deadline is Oct 24.
I don't think I agree with this—in particular, it seems like even given the simulation hypothesis, there could still be quite a lot of value to be had from influencing how that simulation goes. For example, if you think you're in an acausal trade simulation, succeeding in building aligned AI would have the effect of causing the simulation runner to trade with an aligned AI rather than a misaligned one, which could certainly have an “extraordinary impact.”
Yeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.