The Center on Long-Term Risk (CLR) is focused on reducing risks of astronomical suffering, or s-risks, from transformative artificial intelligence (TAI). S-risks are defined as risks of cosmically significant amounts of suffering. As has been discussed elsewhere, s-risks might arise by malevolence, by accident, or in the course of conflict.
We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or AI-assisted humans against altruistic values may be among the largest sources of expected suffering. Strategic threats have historically been a source of significant danger to civilization (the Cold War being a prime example). And the potential downsides from such threats, including those involving large amounts of suffering, may increase significantly with the emergence of transformative AI systems. For this reason, our current focus is technical and strategic analysis aimed at addressing these risks.
There are many other important interventions for s-risk reduction which are beyond the scope of this agenda. These include macrostrategy research on questions relating to s-risk; reducing the likelihood of s-risks from hatred, sadism, and other kinds of malevolent intent; and promoting concern for digital minds. CLR has been supporting work in these areas as well, and will continue to do so.
In this sequence of posts, we will present our research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence. It is a standalone document intended to be interesting to people working in AI safety and strategy, with academics working in relevant subfields as a secondary audience. With a broad focus on issues related to cooperation in the context of powerful AI systems, we think the questions raised in the agenda are beneficial from a range of both normative views and empirical beliefs about the future course of AI development, even if at CLR we are particularly concerned with s-risks.
The purpose of this sequence is to
- communicate what we think are the most important, tractable, and neglected technical AI research directions for reducing s-risks;
- communicate what we think are the most promising directions for reducing downsides from threats more generally;
- explicate several novel or little-discussed considerations concerning cooperation and AI safety, such as surrogate goals;
- propose concrete research questions which could be addressed as part of an CLR Fund-supported project, by those interested in working as a full-time researcher at CLR, or by researchers in academia, or at other EA organizations, think tanks, or AI labs;
- contribute to the portfolio of research directions which are of interest to the longtermist EA and AI safety communities broadly.
The agenda is divided into the following sections:
AI strategy and governance. What does the strategic landscape at time of TAI development look like (e.g., unipolar or multipolar, balance between offensive and defensive capabilities?), and what does this imply for cooperation failures? How can we shape the governance of AI so as to reduce the chances of catastrophic cooperation failures?
Credibility. What might the nature of credible commitment among TAI systems look like, and what are the implications for improving cooperation? Can we develop new theory (such as open-source game theory) to account for relevant features of AI?
Peaceful bargaining mechanisms. Can we further develop bargaining strategies which do not lead to destructive conflict (e.g., by implementing surrogate goals)?
Contemporary AI architectures. How can we make progress on reducing cooperation failures using contemporary AI tools — for instance, learning to solve social dilemmas among deep reinforcement learners?
Humans in the loop. How do we expect human overseers or operators of AI systems to behave in interactions between humans and AIs? How can human-in-the-loop systems be designed to reduce the chances of conflict?
Foundations of rational agency, including bounded decision theory and acausal reasoning.
We plan to post two sections every other day. The next post in the sequence, "Sections 1 & 2: Introduction, Strategy and Governance" will be posted on Sunday, December 15.
By "cosmically significant", we mean significant relative to expected future suffering. Note that it may turn out that the amount of suffering we can influence is dwarfed by suffering that we can't influence. By "expected suffering in the future" we mean "expectation of action-relevant suffering in the future". ↩︎
I find myself someone confused by s-risks as defined here; it's easy to generate clearly typical cases that very few would want, and hard to figure out where the boundaries are, and thus hard to figure out how much I should imagine this motivation impacting the research.
That is, consider the "1950s sci-fi prediction," where a slightly-more-competent version of humanity manages to colonize lots of different planets in ways that make them sort of duplicates of Earth. This seems like it would count as an s-risk if each planet has comparable levels of suffering to modern Earth and there are vastly more such planets. While this feels to me like "much worse than is possible," I'm not yet sold it's below the "ok" bar in the maxipok sense, but also it wouldn't seem too outlandish to think it's below that bar (depending on how bad you think life on Earth is now).
Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations? I don't get that sense from the six headings, but I can imagine how it might add different constraints or different focus for some of them. For example, on the point of AI strategy and governance, it seems easiest to encourage cooperation when there are no external forces potentially removing participants from a coalition, but adding in particular ethical views possibly excludes people who could have been included. You might imagine, say, a carnivorous TAI developer who wants factory farming to make it to the stars.
This isn't necessarily a point against this view, according to me; it definitely is the case that focusing on alignment at all implies having some sort of ethical view or goal you want to implement, and it may be the case that being upfront about those goals simplifies or directs the technical work, as opposed to saying "we'll let future-us figure out what the moral goals are, first let's figure out how to implement any goals at all." But it does make me interested in how much disagreement you think there is on the desirability of future outcomes, weighted by their likelihood or something, between people primarily motivated by continued existence of human civilization and people primarily motivated by avoiding filling the universe with suffering or whatever other categories you think are worth considering.
I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think some other EAF researchers have more developed views on this than myself.) But, in this document and my own current research I’ve tried to choose directions that are especially important from the s-risk perspective but which are also valuable by the lights of non-s-risk-focused folks working in the area.
[Just speaking for myself here]
For what it’s worth, EAF is currently deliberating about this definition and it might change soon.
Thanks, that helps!
Cool; if your deliberations include examples, it might be useful to include them if you end up writing an explanation somewhere.
We are now using a new definition of s-risks. I've edited this post to reflect the change.
S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.
Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.
Flo's summary for the Alignment Newsletter:
It seems to me that at least some parts of this research agenda are relevant for some special cases of "the failure mode of an amoral AI system that doesn't care about you". A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like "under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?". Such questions seem relevant under a very wide range of moral systems (including ones that don't place much weight on s-risks).
I still wouldn't recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I'm claiming that the agenda is totally useless for anything besides s-risks, which I certainly don't believe. I've changed that second paragraph to: