Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


AI Governance: Opportunity and Theory of Impact (Allan Dafoe) (summarized by Rohin): What is the theory of change for work on AI governance? Since the world is going to be vastly complicated by the broad deployment of AI systems in a wide variety of contexts, several structural risks will arise. AI governance research can produce “assets” (e.g. policy expertise, strategic insights, important networking connections, etc) that help humanity make better decisions around these risks. Let’s go into more detail.

A common perspective about powerful AI is the “superintelligence” perspective, in which we assume there is a single very cognitively powerful AI agent. This leads people to primarily consider “accident” and “misuse” risks, in which either the AI agent itself “wants” to harm us, or some bad actor uses the AI agent to harm us.

However, it seems likely that we should think of an ecology of AI agents, or AI as a general purpose technology (GPT), as in e.g. CAIS (AN #40) or Age of Em. In this case, we can examine the ways in which narrow AI could transform social, military, economic, and political systems, and the structural risks that may arise from that. Concrete examples of potential existential structural risks induced by AI include nuclear instability, geopolitical turbulence, authoritarianism, and value erosion through competition.

A key point about the examples above is that the relevant factors for each are different. For example, for nuclear instability, it is important to understand nuclear deterrence, first strike vulnerability and how it could change with AI processing of satellite imagery, undersea sensors, cyber surveillance and weapons, etc. In contrast, for authoritarianism, relevant processes include global winner-take-all-markets, technological displacement of labor, and authoritarian surveillance and control.

This illustrates a general principle: unlike in the superintelligence perspective, the scope of both risks and solutions in the ecology / GPT perspectives is very broad. As a result, we need a broad range of expertise and lots of connections with existing fields of research. In particular, “we want to build a metropolis -- a hub with dense connections to the broader communities of computer science, social science, and policymaking -- rather than an isolated island”.

Another important aspect here is that in order to cause better decisions to be made, we need to focus not just on generating the right ideas, but also on ensuring the right ideas are in the right places at the right time (e.g. by ensuring that people with the right tacit knowledge are part of the decision-making process). Instead of the "product model" of research that focuses on generating good ideas, we might instead want a “field-building model”, which also places emphasis on improving researcher’s competence on a variety of issues, bestowing prestige and authority on those who have good perspectives on long-term risks, improving researcher’s networks, and training junior researchers. However, often it is best to focus on the product model of research anyway, and get these benefits as a side effect.

To quote the author: “I think there is a lot of useful work that can be done in advance, but most of the work involves us building our competence, capacity, and credibility, so that when the time comes, we are in position and ready to formulate a plan. [...] Investments we make today should increase our competence in relevant domains, our capacity to grow and engage effectively, and the intellectual credibility and policy influence of competent experts.”

Rohin's opinion: See the next summary. Note also that the author is organizing the Cooperative AI Workshop (AN #116) to tackle some of these issues.

Andrew Critch on AI Research Considerations for Human Existential Safety (Lucas Perry and Andrew Critch) (summarized by Rohin): This podcast discusses the recent ARCHES (AN #103) document, and several thoughts surrounding it. There’s a lot in here that I won’t summarize, including a bunch of stuff that was in the summary of ARCHES. I’m going to focus primarily on the (substantial) discussion of how to prioritize within the realm of possible risks related in some way to AI systems.

Firstly, let’s be clear about the goal: ensuring existential safety, that is, making sure human extinction never happens. Note the author means literal extinction, as opposed to something like “the loss of humanity’s long-term potential”, because the former is clearer. While it is not always clear whether something counts as “extinction” (what if we all become uploads?), it is a lot clearer than whether a scenario counts as a loss of potential.

Typical alignment work focuses on the “single-single” case, where a single AI system must be aligned with a single human, as in e.g. intent alignment (AN #33). However, this isn’t ultimately what we care about: we care about multi-multi existential safety, that is, ensuring that when multiple AI systems act in a world with multiple humans, extinction does not happen. There are pretty significant differences between these: in particular, it’s not clear whether multi-multi “alignment” even has meaning, since it is unclear whether it makes sense to view humanity as an agent to which an AI system could be “aligned”.

Nonetheless, single-single alignment seems like an important subproblem of multi-multi existential safety: we will be delegating to AI systems in the future; it seems important that we know how to do so. How do we prioritize between single-single alignment, and the other subproblems of multi-multi existential safety? A crucial point is that single-single work will not be neglected, because companies have strong incentives to solve single-single alignment (both in the sense of optimizing for the right thing, and for being robust to distributional shift). In contrast, in multi-multi systems, it is often the case that there is a complex set of interacting effects that lead to some negative outcome, and there is no one actor to blame for the negative outcome, and as a result it doesn’t become anybody’s job to prevent that negative outcome.

For example, if you get a huge medical bill because the necessary authorization forms hadn’t been filled out, whose fault is it? Often in such cases there are many people to blame: you could blame yourself for not checking the authorization, or you could blame the doctor’s office for not sending the right forms or for not informing you that the authorization hadn’t been obtained, etc. Since it’s nobody’s job to fix such problems, they are and will remain neglected, and so work on them is more impactful.

Something like transparency is in a middle ground: it isn’t profitable yet, but probably will be soon. So, if someone were indifferent between a bunch of areas of research, the author would advise for e.g. multi-stakeholder delegation over transparency over robustness. However, the author emphasizes that it’s far more important that people work in some area of research that they find intellectually enriching and relevant to existential safety.

The podcast has lots of other points, here is an incomplete quick selection of them:

- In a multi-multi world, without good coordination you move the world in a “random” direction. There are a lot of variables which have to be set just right for humans to survive (temperature, atmospheric composition, etc) that are not as important for machines. So sufficiently powerful systems moving the world in a “random” direction will lead to human extinction.

- One response to the multi-multi challenge is to have a single group make a powerful AI system and “take over the world”. This approach is problematic since many people will oppose such a huge concentration of power. In addition, it is probably not desirable even if possible, since it reduces robustness by creating a single point of failure.

- Another suggestion is to create a powerful AI system that protects humanity (but is still uncontrollable in that humanity cannot stop its operation). The author does not like the solution much, because if we get it wrong and deploy a misaligned uncontrollable AI system, then we definitely die. The author prefers that we instead always have control over the AI systems we deploy.

Rohin's opinion: Both this and the previous summary illustrate an increasingly common perspective:

1. The world is not going to look like “today’s world plus a single AGI agent”: instead, we will likely have a proliferation of many different AI systems specialized for different purposes.

2. In such a world, there are a lot of different challenges that aren’t standard intent alignment.

3. We should focus on these other challenges because [a variety of reasons].

If you have technical CS skills, how should you prioritize between this perspective and the more classical intent alignment perspective?

Importance. I’ve estimated (AN #80) a 10% chance of existential catastrophe via a failure of intent alignment, absent intervention from longtermists to address intent alignment. Estimates vary quite a lot, even among people who have thought about the problem a lot; I’ve heard as low as < 1% and as high as 80% (though these usually don’t assume “no intervention from longtermists”).

It’s harder to estimate the importance of structural risks and extinction risks highlighted in the two summaries above, but the arguments in the previous two posts seem reasonably compelling and I think I’d be inclined to assign a similar importance to it (i.e. similar probability of causing an existential catastrophe).

Note that this means I’m disagreeing with Critch: he believes that we are far more likely to go extinct through effects unique to multi-multi dynamics; in contrast I find the argument less persuasive because we do have governance, regulations, national security etc. that would already be trying to mitigate issues that arise in multi-multi contexts, especially things that could plausibly cause extinction.

Neglectedness. I’ve already taken into account neglectedness outside of EA in estimating the probabilities for importance. Within EA there is already a huge amount of effort going into intent alignment, and much less in governance and multi-multi scenarios -- perhaps a difference of 1-2 orders of magnitude; the difference is even higher if we only consider people with technical CS skills.

Tractability. I buy the argument in Dafoe’s article that for AI governance due to our vast uncertainty we need a “metropolis” model where field-building is quite important; I think that implies that solving the full problem (at today's level of knowledge) would require a lot of work and building of expertise. In contrast, with intent alignment, we have a single technical problem with significantly less uncertainty. As a result, I expect that currently in expectation a single unit of work goes further to solving intent alignment than to solving structural risks / multi-multi problems, and so intent alignment is more tractable.

I also expect technical ideas to be a bigger portion of "the full solution" in the case of intent alignment -- as Dafoe argues, I expect that for structural risks the solution looks more like "we build expertise and this causes various societal decisions to go better" as opposed to "we figure out how to write this piece of code differently so that it does better things". This doesn't have an obvious impact on tractability -- if anything, I'd guess it argues in favor of the tractability of work on structural risks, because it seems easier to me to create prestigious experts in particular areas than to make progress on a challenging technical problem whose contours are still uncertain since it arises primarily in the future.

I suspect that I disagree with Critch here: I think he is more optimistic about technical solutions to multi-multi issues themselves being useful. In the past I think humanity has resolved such issues via governance and regulations and it doesn’t seem to have relied very much on technical research; I’d expect that trend to continue.

Personal fit. This is obviously important, but there isn’t much in general for me to say about it.

Once again, I should note that this is all under the assumption that you have technical CS skills. I think overall I end up pretty uncertain which of the two areas I’d advise going in (assuming personal fit was equal in both areas). However, if you are more of a generalist, I feel much more inclined to recommend choosing some subfield of AI governance, again subject to personal fit, and Critch agrees with this.



Comparing Utilities (Abram Demski) (summarized by Rohin): This is a reference post about preference aggregation across multiple individually rational agents (in the sense that they have VNM-style utility functions), that explains the following points (among others):

1. The concept of “utility” in ethics is somewhat overloaded. The “utility” in hedonic utilitarianism is very different from the VNM concept of utility. The concept of “utility” in preference utilitarianism is pretty similar to the VNM concept of utility.

2. Utilities are not directly comparable, because affine transformations of utility functions represent exactly the same set of preferences. Without any additional information, concepts like “utility monster” are type errors.

3. However, our goal is not to compare utilities, it is to aggregate people’s preferences. We can instead impose constraints on the aggregation procedure.

4. If we require that the aggregation procedure produces a Pareto-optimal outcome, then Harsanyi’s utilitarianism theorem says that our aggregation procedure can be viewed as maximizing some linear combination of the utility functions.

5. We usually want to incorporate some notion of fairness. Different specific assumptions lead to different results, including variance normalization, Nash bargaining, and Kalai-Smorodinsky.


How Much Computational Power It Takes to Match the Human Brain (Joseph Carlsmith) (summarized by Asya): In this blog post, Joseph Carlsmith gives a summary of his longer report estimating the number of floating point operations per second (FLOP/s) which would be sufficient to perform any cognitive task that the human brain can perform. He considers four different methods of estimation.

Using the mechanistic method, he estimates the FLOP/s required to model the brain’s low-level mechanisms at a level of detail adequate to replicate human task-performance. He does this by estimating that ~1e13 - 1e17 FLOP/s is enough to replicate what he calls “standard neuron signaling” — neurons signaling to each other via using electrical impulses (at chemical synapses) — and learning in the brain, and arguing that including the brain’s other signaling processes would not meaningfully increase these numbers. He also suggests that various considerations point weakly to the adequacy of smaller budgets.

Using the functional method, he identifies a portion of the brain whose function we can approximate with computers, and then scales up to FLOP/s estimates for the entire brain. One way to do this is by scaling up models of the human retina: Hans Moravec's estimates for the FLOP/s of the human retina imply 1e12 - 1e15 FLOP/s for the entire brain, while recent deep neural networks that predict retina cell firing patterns imply 1e16 - 1e20 FLOP/s.

Another way to use the functional method is to assume that current image classification networks with known FLOP/s requirements do some fraction of the computation of the human visual cortex, adjusting for the increase in FLOP/s necessary to reach robust human-level classification performance. Assuming somewhat arbitrarily that 0.3% to 10% of what the visual cortex does is image classification, and that the EfficientNet-B2 image classifier would require a 10x to 1000x increase in frequency to reach fully human-level image classification, he gets 1e13 - 3e17 implied FLOP/s to run the entire brain. Joseph holds the estimates from this method very lightly, though he thinks that they weakly suggest that the 1e13 - 1e17 FLOP/s estimates from the mechanistic method are not radically too low.

Using the limit method, Joseph uses the brain’s energy budget, together with physical limits set by Landauer’s principle, which specifies the minimum energy cost of erasing bits, to upper-bound required FLOP/s to ~7e21. He notes that this relies on arguments about how many bits the brain erases per FLOP, which he and various experts agree is very likely to be > 1 based on arguments about algorithmic bit erasures and the brain's energy dissipation.

Lastly, Joseph briefly describes the communication method, which uses the communication bandwidth in the brain as evidence about its computational capacity. Joseph thinks this method faces a number of issues, but some extremely preliminary estimates suggest 1e14 FLOP/s based on comparing the brain to a V100 GPU, and 1e16 - 3e17 FLOP/s based on estimating the communication capabilities of brains in traversed edges per second (TEPS), a metric normally used for computers, and then converting to FLOP/s using the TEPS to FLOP/s ratio in supercomputers.

Overall, Joseph thinks it is more likely than not that 1e15 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create). And he thinks it's unlikely (<10%) that more than 1e21 FLOP/s is required. For reference, an NVIDIA V100 GPU performs up to 1e14 FLOP/s (although FLOP/s is not the only metric which differentiates two computational systems.)

Read more: Full Report: How Much Computational Power Does It Take to Match the Human Brain?

Asya's opinion: I really liked this post, although I haven't gotten a chance to get through the entire full-length report. I found the reasoning extremely legible and transparent, and there's no place where I disagree with Joseph's estimates or conclusions. See also Import AI's summary.


The "Backchaining to Local Search" Technique in AI Alignment (Adam Shimi) (summarized by Rohin): This post explains a technique to use in AI alignment, that the author dubs “backchaining to local search” (where local search refers to techniques like gradient descent and evolutionary algorithms). The key idea is to take some proposed problem with AI systems, and figure out mechanistically how that problem could arise when running a local search algorithm. This can help provide information about whether we should expect the problem to arise in practice.

Rohin's opinion: I’m a big fan of this technique: it has helped me notice that many of my concepts were confused. For example, this helped me get deconfused about wireheading and inner alignment. It’s an instance of the more general technique (that I also like) of taking an abstract argument and making it more concrete and realistic, which often reveals aspects of the argument that you wouldn’t have previously noticed.


The Open Phil AI Fellowship (summarized by Rohin): We’re now at the fourth cohort of the Open Phil AI Fellowship (AN #66)! Applications are due October 22.

Navigating the Broader Impacts of AI Research (summarized by Rohin): This is a workshop at NeurIPS; the title tells you exactly what it's about. The deadline to submit is October 12.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
6 comments, sorted by Click to highlight new comments since:

Regarding ARCHES, as an author:

  • I disagree with Critch that we should expect single/single delegation(/alignment) to be solved "by default" because of economic incentives.  I think economic incentives will not lead to it being solved well-enough, soon enough (e.g. see:  I guess Critch might put this in the "multi/multi" camp, but I think it's more general (e.g. I attribute a lot of the risk here to human irrationality/carelessness)
  • RE: "I find the argument less persuasive because we do have governance, regulations, national security etc. that would already be trying to mitigate issues that arise in multi-multi contexts, especially things that could plausibly cause extinction"... 1) These are all failing us when it comes to, e.g. climate change.  2) I don't think we should expect our institutions to keep up with rapid technological progress (you might say they are already failing to...).  My thought experiment from the paper is: "imagine if everyone woke up 1000000x smarter tomorrow."  Our current institutions would likely not survive the day and might or might not be improved quickly enough to keep ahead of bad actors / out-of-control conflict spirals.

I disagree with Critch that we should expect single/single delegation(/alignment) to be solved "by default" because of economic incentives.  I think economic incentives will not lead to it being solved well-enough, soon enough

Indeed, this is where my 10% comes from, and may be a significant part of the reason I focus on intent alignment whereas Critch would focus on multi/multi stuff.

My thought experiment from the paper is: "imagine if everyone woke up 1000000x smarter tomorrow."

Basically all of my arguments for "we'll be fine" rely on not having a huge discontinuity like that, so while I roughly agree with your prediction in that thought experiment, it's not very persuasive.

(The arguments do not rely on technological progress remaining at its current pace.)

keep up with rapid technological progress (you might say they are already failing to...)

At least in the US, our institutions are succeeding at providing public infrastructure (roads, water, electricity...), not having nuclear war, ensuring children can read, and allowing me to generally trust the people around me despite not knowing them. Deepfakes and facial recognition are small potatoes compared to that.

These are all failing us when it comes to, e.g. climate change.

I agree this is overall a point against my position (though I probably don't think it is as strong as you think it is).

Link to ARCHES is broken, current URL is this.

Thanks, fixed.

these usually don’t assume “no intervention from longtermists”

I think the "don't" is a typo?

No, I meant it as written. People usually give numbers without any assumptions attached, which I would assume means "I predict that in our actual world there is an X% chance of an existential catastrophe due to AI".