cross-posted to the EA forum here.

Introduction

As in 2016, 2017, 2018, 2019 and 2020 I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments.

My aim is basically to judge the output of each organisation in 2021 (technically: 2020-12-01 to 2021-11-30) and compare it to their budget. This should give a sense of the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2021 budgets to get a sense of urgency.

This document aims to be sufficiently broad that someone who has not paid any attention to the space all year could read it (and the linked documents) and be as well-informed to make donation decisions as they could reasonably be without personally interviewing researchers and organisations.

I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) other projects, 2) the miracle of life and 3) computer games.

This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues which I attempt to cover but only very cursorily.

How to read this document

This document is fairly extensive, and some parts (particularly the methodology section) are largely the same as last year, so I don’t recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. You should also read the Conflict of Interest Section.

If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. Papers listed as ‘X researchers contributed to the following research lead by other organisations’ are included in the section corresponding to their first author and you can Cntrl+F to find them.

If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories.

Here are the un-scientifically-chosen hashtags:

  • AgentFoundations
  • Amplification
  • Capabilities
  • Community
  • Ethics
  • Fiction
  • Forecasting
  • GameTheory
  • Interpretability
  • IRL
  • NearTerm
  • Obstruction
  • Other
  • Overview
  • Policy
  • Robustness
  • Scenarios
  • ShortTerm
  • Strategy
  • Translation
  • ValueLearning

You might also be interested in the ‘Organisation Second Preference’ section, which is new this year.

New to Artificial Intelligence as an existential risk?

If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper, or for a more technical version this by Richard Ngo.

If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinhardt, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation, or this from Hendrycks et al., or this from Critch & Krueger, or this from Everitt et al, though it is a few years old now.

Conflict of Interest

In the past I have had very demanding standards around Conflicts of Interest, including being critical of others for their lax treatment of the issue. Historically this was not an issue because I had very few conflicts. However this year I have accumulated a large number of such conflicts, and worse, conflicts that cannot all be individually publically disclosed due to another ethical constraint.

As such the reader should assume I could be conflicted on any and all reviewed organisations.

Research Organisations

FHI: The Future of Humanity Institute

FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom.  They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here.

Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work - as well as work on other Xrisks.

They ran a Research Scholars Program, where people could join them to do research at FHI. There is a fairly good review of this here, albeit from before the pandemic / hiring freeze.

The EA Meta Fund supported a special program for providing infrastructure and support to FHI, called the Future of Humanity Foundation. This reminds me somewhat of what BERI does.

In the past I have been very impressed with their research.

They didn’t share any information with me about hiring or departures.

Research

Cohen et al.'s Fully General Online Imitation Learning is a detailed paper presenting an online imitator learner that operates with bounds on how harmful it can be despite no prior training phase. The imitator keeps track of the most plausible models of the teacher, and for each possible action gives it the minimum probability of any teacher-model ascribed. At the beginning, when you have many plausible models, this means a lot of probability mass is undefined, and hence you request feedback a lot (i.e. ersatz training period), but over time you should narrow down on the true model (though perhaps this is very slow?) They prove some results to suggest that this allows us to keep the probability of catastrophes low, proportionate to their risk under the true model, assuming the true model was in our 'most plausible' set at the beginning (which is not clear to me). Incidentally, I think this is a good example of the problems of academic research. The alignment forum blog post quite clearly lays out that this is about mesa-optimisers… but the paper (perhaps because of reviewers?) literally does not include the string 'mesa' at all, making it a lot harder to understand the significance! See also the discussion here. Overall I thought this was an excellent paper. Researchers from Deepmind were also named authors on the paper. #IRL

Evans et al.'s Truthful AI: Developing and governing that does not lie is a detailed and lengthy piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false) rather than honesty as a standard both to avoid intentionality issues and also because, relative to humans, the costs of punishing unintended mistakes are much lower / less unfair. Especially hard topics include the truthfulness of AIs that are more expert than humans (and hence cannot be directly subject to human oversight) and cases where the truth is contested - for example, tech platforms choice to suppress as misleading various claims about covid which contradicted official advice, even though the official advice was frequently clearly wrong. I'm not convinced that their approach would end up being significantly different than 'dominant ideology censors rivals' - even suggesting explicit warnings about controversy has the issue that what is controversial is itself controversial. See also the discussion here and here. Researchers from GPI, OpenAI were also named authors on the paper. #Strategy

Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. Of course, these models are trying to copy human responses, not be honest, so because many of the questions allude to common misconceptions, the more advanced models 'lie' more often. Interestingly they also used GPT-3 to evaluate the truth of these answers. See also the discussion here. Researchers from OpenAI were also named authors on the paper. #Other

Ord et al.'s Future Proof: The Opportunity to Transform the UK's Resilience to Extreme Risks is a flashy policy document recommending steps HMG could take to prepare for future risks. This includes things like having a Chief Risk Officer and avoiding AI control of the nuclear weapons, as well as non-AI-relevant but still good recommendations for bio. In general I thought it did a significantly better job, and represented a more realistic theory of change, than many previous policy pieces Researchers from CSER, Gov.AI were also named authors on the paper. #Policy

Manheim & Sandberg’s What is the Upper Limit of Value? argues that the total amount of value/growth humanity can create/experience is finite. This is related to typical 'limits to growth' arguments, except much better and a much higher ceiling: rather than worrying about peak oil, they discuss the speed of light and the difficulties extracting infinite value from a single electron. As they note, it may not apply in exotic models of physics though, and I do not understand why they think that you cannot assign a probability to something so low that nothing could ever convince you it happened. See also the discussion here. #Other

Hammond et al.'s Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice extends previous work on representing games as causal networks rather an payoff tables. This has the advantage of facilitating decomposability, making them much more easily understable; the hope is that this type of work will help us to understand when agents have bad incentives to e.g. deceive. Researchers from Deepmind were also named authors on the paper. #GameTheory

Finnveden's Extrapolating GPT-N performance examines the performance scaling for GPT on a variety of tasks. He finds generally relatively smooth scaling, and interestingly comes to relatively similar conclusions to Ajeya's work despite the different methodology. #Forecasting

FHI researchers contributed to the following research led by other organisations:

They also produced a variety of pieces on biorisk and other similar subjects, which I am sure are very good and important but I have not read.

Finances

FHI is apparently currently banned from fundraising by the university, and hence cannot share financial information. I would guess their budget is relatively large. Apparently they have sufficient funds to last until the expected resumption of fundraising in the new year. I do not know the exact reason for this ban; the absence of disclosure makes due diligence of them very difficult.

If you wanted to donate to them anyway, here is the relevant web page.

GovAI: The Center for the Governance of AI

GovAI is an Oxford based AI Governance Research organisation founded in 2021 by Allan Dafoe; Ben Garfinkel became Acting Director in 2021.  They are affiliated with CEA. They were formerly a research center as part of FHI, but spun out this year to allow Allan to take up a position at Deepmind, and for increased operational independence from the university. Their research can be found here. Their declaration of independence can be found here.

I generally regarded the work they did as part of FHI as quite good, and presumably that is likely to continue.

The one possible downside of leaving the university affiliation is the loss of associated prestige.

Research

Zaidi & Dafoe's International Control of Powerful Technology: Lessons from the Baruch Plan for Nuclear Weapons attempts to draw conclusions relevant for AGI control. It's a very detailed account of negotiations, but I'm not sure how much we can learn from it, given that the plan failed, and it seems plausible that neither side was really negotiating in earnest anyway. #Strategy

Fischer et al.'s AI Policy Levers: A Review of the U.S. Government’s Tools to Shape AI Research, Development, and Deployment lays out various policy levers the USG can use to control AI. These generally focus on AI race issues—e.g. techniques to undermine chinese competition—rather than AGI control, which would be an issue even with a unified world government. It focuses on tools based in current law, which I think makes sense, as even in previous crisis (e.g. 2008, March 2020) the government response has leant heavily on repurposing existing programs and permissions. #Policy

Dafoe et al.'s Open Problems in Cooperative AI and Cooperative AI: Machines Must Learn to Find Common Ground give an overview of different ways to think about cooperation issues. This is not about principle-agent issues with getting an AI to do what its human wants, but about how to deal with multiple humans/AIs with different goals and knowledge. Much of the piece was effectively about human coordination - while there were some AI-specific ideas, like pre-commitment in lane merging for autonomous cars, ideas like AIs self-modifying to a joint utility function didn't get much discussion. Researchers from FHI, Deepmind were also named authors on the paper. #Strategy

Zhang's Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers asked a bunch of people at NeuroIPS etc. what they thought of various ethical issues. Sort of a follow up to Katja's previous researcher survey and BaoBao's survey of the public, many of the results are not surprising - e.g. researchers trust science organisations and distrust the military and china. I was surprised/disappointed to see that experts were less concerned about Value Alignment than the general public (to the extent we can interpret these surveys literally). Researchers from Gov.AI were also named authors on the paper. #Strategy

Ding's China's Growing Influence over the Rules of the Digital Road describes China's approach to influencing technology standards, and suggests some policies the US might adopt.  #Policy

Garkinkel's A Tour of Emerging Cryptographic Technologies provides an overview of various cryptography techniques (not primarily currency) and their relevance for issues like coordination problems and surveillance.  #Overview

Dafoe et al.'s Reputations for Resolve and Higher-Order Beliefs in Crisis Bargaining conducts a semi-expert survey to try to evaluate foreign policy decision makers' use of reputation for resolve when evaluating geopolitical strategy. Their work suggests find that reputation does matter, and the Domino Theory is true: a history of steadfastness causes others to expect you to be resolute in the future, and hence they are more likely to back down. #Strategy

Ding & Dafoe's Engines of Power: Electricity, AI, and General-Purpose Military Transformations argues that for countries to take advantage of AI In warfare they will require lot of domestic civilian AI capabilities. This is in contrast to people who have argued that rapid AI technology diffusion would upset US dominance. However, I'm not sure the conclusions really pull through for AGI. #Forecasting

Zwetsloot, Remvo; Zhang, Baobao; Anderljung, Marksu; Horowitz, Michael; Dafoe, Allan - The Immigration Preferences of Top AI Researchers: New Survey Evidence - 2021-10-22 - https://www.governance.ai/research-paper/the-immigration-preferences-of-top-ai-researchers-new-survey-evidence

Researchers from GovAI were also named contributors to the following papers:

Despite the name they have done a fair bit of work on non-AI related governance; these papers are not reviewed here.

Finances

They currently have around $3.5m cash, which they expect constitutes around 2-3 years of runway (less if they grow faster).

They are not currently actively soliciting donations but if you would like to donate to them, you can do so here.

CHAI: The Center for Human-Compatible AI

CHAI is a UC Berkeley based AI Safety Research organisation founded in 2016 by Stuart Russell. They do ML-oriented safety research, especially around inverse reinforcement learning, and cover both near and long-term future issues.

As an academic organisation their members produce a very large amount of research; I have only tried to cover the most relevant below. It seems they do a better job engaging with academia than many other organisations, especially in terms of interfacing with the cutting edge of non-safety-specific research. The downside of this, from our point of view, is that not all of their research is focused on existential risks.

They have expanded somewhat to other universities outside Berkeley and have people at places like Princeton and Cornell.

Research

CHAI and their associated academics produce a huge quantity of research. Far more so than other organisations their output is under-stated by my survey here; if they were a small organisation that only produced one report, there would be 100% coverage, but as it is this is just a sample of those pieces I felt most interested in. On the other hand academic organisations tend to produce some slightly less relevant work also, and I have focused on what seemed to me to be the top pieces.

Hendrycks et al.'s Unsolved Problems in ML Safety provides an overview of ML safety issues: Robustness, Monitoring, Alignment and 'External Safety'. It's basically an updated version of Concrete Problems, with one of the same authors. I generally think these pieces are quite good for helping provide easy on-ramps (with Google/OpenAI credentials) for mainstream researchers. There is probably not a huge amount of novel content here for readers of this article, though I thought the introductory motivation section was well written. See also the discussion here. Researchers from OpenAI were also named authors on the paper. #Overview

Laidlaw & Russell's Uncertain Decisions Facilitate Better Preference Learning presents an approach for learning utility functions from the behaviour of humans acting under uncertainty. They argue that uncertainty can actually make it easier to infer values, for similar reasons (I think) that utility functions are derived from preferences over bets: by being less extremal there is more prior uncertainty about what agents will take, and hence more information in their actions. #IRL

Roman et al.'s Accumulating Risk Capital Through Investing in Cooperation is a game theory paper about promoting cooperation while limiting downside. Basically they prove results about the tradeoff between the two goals, and show that you can get asymptotically good (i.e. maximally cooperative) behaviour. #GameTheory

Hendrycks et al.'s What Would Jiminy Cricket Do? Towards Agents That Behave Morally annotate a series of text-based games with action ethics scores and use them for policy shaping. They note that many of the games actively rewarded immoral behaviour, and attempted to correct this with pro tanto ethical labelling. This was then used to adjust a model at the last step (not retraining using ethical judgements as rewards). #ValueLearning

Filan et al.'s Clusterability in Neural Networks produces a definition of clustering to identify potentially meaningful subcomponents in neural networks, and shows that actually trained (and hence meaningful in aggregate) nets have these subcomponents more than randomly generated nets. This basically tries to produce subcomponents of neurons that have strong intraconnections and weak interconnections, relative to their size. They also suggest two methods for promoting clustering in a network, for when you want to design interpretability into a system: eigenvector regularisation and initialising the weights with pre-existing clusters. #Interpretability

Zhuang & Hadfield-Menell's Consequences of Misaligned AI provide a simple model of mis-alignment due to utility function mis-specification. Essentially the true utility function is a linear function of N attributes, but the agent's utility function only covers M<N, and the resource constraint is concave, so the robot ends up sacrificing the invisible N-M attributes. They then discuss some solutions, including classic ones like Armstrong's impact minimisation or the human constantly providing new local utility functions—both of which unfortunately require you to know what the other attributes are. #AgentFoundations

Critch's What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) argues for a focus on multi-polar AI scenarios, and on the mechanisms by which they interact over the designs of individual AIs. The main part of the post is a series of scenarios, showing bad outcomes from otherwise aligned AIs because competition forces them to sacrifice alignment. Basically AI companies are super successful at making profits, but then eventually they become too powerful and we die. I found the models a bit confusing - they didn't seem to really explain how this competition goes from 'the best thing in the history of the world' to 'extinction'. Standard microeconomics suggests that the perfect competition you need to 'force' all firms to be maximally efficient is infact great for consumers (humans). The article claims that humans will lose control because the AIs have become extremely powerful and well defended, but I don't see how this makes sense: fortified compounds are an expensive liability that firms in perfect competition cannot afford, and antagonising humanity (which is extremely rich and powerful because of all the stuff the AI firms have made for us) also sounds like a very expensive mistake to make. I think my steelman would be that these firms are exploiting insecure property rights, in which case the solution to AI alignment turns out to be... carbon taxes? #Scenarios

Lindner et al.'s Learning What To Do by Simulating the Past is an extension of Rohin's previous paper. Both papers basically try to learn from the latent information embodied in the world that already significantly reflects human preferences - e.g. if an AI discovers a fence in the woods, it can reasonably infer some human prefered a fence be there. This paper basically aims to move from girdworlds with trivial 'physics' to more realistic settings where you can't precisely compute the histories. Researchers from CHAI were also named authors on the paper. #IRL

Shah et al.'s The MineRL BASALT Competition on Learning from Human Feedback introduces a competition for learning human intent in a noisy environment: for tasks like “building waterfalls in Minecraft”. The idea here is that while humans have a strong sense for what is a 'nice looking' waterfall, we don't have a good formal specification of the task, so you are competing to design agents that are best at extracting human intent. By using Minecraft they provide a huge space of possible strategies (vs. e.g. Atari games) and by using human feedback you reduce (eliminate?) the potential for excess hyperparameter tuning. See also the discussion here. Researchers from OpenAI were also named authors on the paper. #ValueLearning

Hod's Detecting Modularity in Deep Neural Networks provide two criteria for whether a neural network subcomponent identified through Filan-style spectral analysis is capturing a 'real', human-intuitive subtask. The idea is that the correlation between the neurons, and the criticality of the subcomponent to overall performance, can be automatically calculated without requiring human input. #Interpretability

Lee et al.'s PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training proposes a model whereby agents initially learn in an unsupervised fashion to better economise on the scarce resource of human oversight later. This is in contrast to some other approaches where the teacher will provide near-constant feedback in the very early stages when uncertainty is high; presumably the cost is that this exploration is not safe. They also change the way the agent updates based on the samples shown to the humans, but I didn't quite follow that bit. #ValueLearning

Gates et al.'s A rational model of people’s inferences about others’ preferences based on  response times suggests using how long people take to respond as a hidden measure of preference intensity. To my knowledge this 'free' bit of information has not been proposed previously. #ValueLearning

CHAI researchers contributed to the following research led by other organisations:

Finances

They have been funded by various EA organisations including the Open Philanthropy Project.

They spent $1,650,000 in 2020 and $1,250,000 in 2021, and plan to spend around $1,750,000 in 2022. They have around $11,000,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 6.2 years of runway, or close to 5 if they grow. Their 2021 spending was significantly below plan due to the pandemic.

If you wanted to donate to them, here is the relevant web page.

MIRI: The Machine Intelligence Research Institute

MIRI is a Berkeley based independent AI Safety Research organisation founded in 2000 by Eliezer Yudkowsky and currently led by Nate Soares. They were responsible for much of the early movement building for the issue, but have refocused to concentrate on research for the last few years. With a fairly large budget now, they are the largest pure-play AI alignment shop. Their research can be found here

In general they do very ‘pure’ mathematical work, in comparison to other organisations with more ‘applied’ ML or strategy focuses. I think this is especially notable because of the irreplaceability of the work. It seems quite plausible that some issues in AI safety will arise early on and in a relatively benign form for non-safety-orientated AI ventures (like autonomous cars or Minecraft helpers) – however the work MIRI does largely does not fall into this category. I have also historically been impressed with their research and staff.

Their agent foundations work is basically trying to develop the correct way of thinking about agents and learning/decision making by spotting areas where our current models fail and seeking to improve them. This includes things like thinking about agents creating other agents.

In December 2020 (hence in scope for this year’s review) they announced that their new research direction projects they launched in 2017 had been a disappointment, and they were winding down those programs. As a result most of their engineering staff have left. Given that I did not give them much credit in the past for this secret research program this does not significantly change my opinion of them.

MIRI, in collaboration with CFAR, ran a series of four-day workshop/camps, the AI Risk for Computer Scientists workshops, which gather mathematicians/computer scientists who are potentially interested in the issue in one place to learn and interact. This sort of workshop seems very valuable to me as an on-ramp for technically talented researchers, which is one of the major bottlenecks in my mind. In particular they have led to hires for MIRI and other AI Risk organisations in the past. However, the website suggests these have been discontinued due to the pandemic.

They also support MIRIx workshops around the world, for people to come together to discuss and hopefully contribute towards MIRI-style work.

MIRI continue their policy of nondisclosure-by-default, something I’ve discussed in the past, which despite having some strong arguments in favour unfortunately makes it very difficult for me to evaluate them. I’ve included some particularly interesting blog posts some of their people have written below, but many of their researchers produce little to no public facing content.

They decided not to leave the Bay Area.

Research

Most of their work is non-public.

Garrabrant's Temporal Inference with Finite Factored Sets introduces a new way of doing temporal/causal influence via combinatorics. To summarise greatly, it introduces the idea of a set factorization, which is sort of dual to a set partition, and uses this to introduce an alternative to Judea Pearl's directed acyclic graph approach to causality. The appeal here, apart from being a neat new definition, is that this could help us infer causality without needing the graph, which does feel a bit like cheating - you basically get temporal direction from factor subset relations. See also the discussion here. Overall I thought this was an excellent paper. #AgentFoundations

Yudkowsky's Discussion with Eliezer Yudkowsky on AGI interventions is a transcript of a Chatman House discussion Eliezer hosted on his views about the future. Very interesting in general; a lot of it describes trying to create the situation to be able to exploit a future miracle that might occur. He is generally pessimistic—even if one research org can be persuaded to behave sensibly, the code will just be stolen by rivals. One section I didn't understand was his desire for a secret team of 5 good ML researchers to try out various (probably unsuccessful) ideas—isn't that MIRI? There is also extended discussion on issues like convergence, AI deception, and the comments are also worth reading. #Strategy

Yudkowsky's Yudkowsky and Christiano discuss "Takeoff Speeds" combines an extended reply from Eliezer arguing for a rapid takeoff with some back-and-forth with Paul about various forecasts. Many of the arguments will not be very surprising to those who have read enough Eliezer, but the dialogue is very interesting to read, even if at times they struggled to pin down exactly the source of the disagreement.#Forecasting

Soares's Visible Thoughts Project and Bounty Announcement describes a prize MIRI are putting out for creating training datasets for them. Basically they want text runthroughs of a D&D campaign with every thought the dungeon master might have explicitly written out. Ultimately they want to use #Interpretability

Finances

They spent $7,500,000 in 2020 and a 'similar' amount in 2021, and plan to spend around $6m in 2022. They have around $30,000,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 5.2 years of runway. This large amount of runway is due to some big recent crypto donations.

They have been supported by a variety of EA groups in the past, including OpenPhil.

They are not running a fundraiser this year due to strong reserves, but if you wanted to donate to them anyway here is the relevant web page.

GCRI: The Global Catastrophic Risks Institute

GCRI is a globally-based independent Existential Risk Research organisation founded in 2011 by Seth Baum and Tony Barrett. They cover a wide variety of existential risks, including artificial intelligence, and do policy outreach to governments and other entities. Their research can be found here. Their annual summary can be found here.

They run an advising and collaboration program where they gave guidance to people from around the world who wanted to help work on catastrophic risks, and potentially write papers with them.

In 2021 they hired Andrea Owe as Research Associate, and Robert de Neufville left earlier this month.

They have an AMA on the EA forum.

Research

de Neufville & Baum's Collective Action on Artificial Intelligence: A Primer and Review provides ways in which economists break collective action problems down and applies them to AI. These apply to both near and AGI issues. #Strategy

Owe & Baum's The Ethics of Sustainability for Artificial Intelligence discusses a variety of issues with the concept of sustainability, and links them to AI.  #Strategy

Owe & Baum's Moral Consideration of Nonhumans in the Ethics of Artificial Intelligence argues for animals (and nature/artificial agents) in AI ethics. It surveys existing statements of AI principles; few explicitly mention animals, and argues against anthropocentrism - at least that we should give *some* weight to animals, even if less than humans. #Ethics

Fitzgerald et al.'s 2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy is basically a massive list of currently existing AGI projects. Interestingly the paper argues that for-profit AGI projects at least claim to have ethical/humanitarian concerns significantly more frequently than academic ones. Obviously this is in places a fairly subjective exercise but I'm not aware of anything else of this nature since their earlier 2017 work.  #Overview

Baum & Owe's Artificial Intelligence Needs Environmental Ethics discusses environmental perspectives on a number of near-term AI issues, including whether artificial life counts for biodiversity.  #NearTerm

Baum et al.'s GCRI Statement on the January 6 US Capitol Insurrection condemns the protest. I thought the link to global genocide was a bit of a stretch. #Policy

GCRI researchers contributed to the following research led by other organisations:

Finances

They spent $300,000 in 2020 and $415,000 in 2021, and plan to spend around $350,000 in 2022. They have around $600,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.7 years of runway.

If you want to donate to GCRI, here is the relevant web page.

CSER: The Center for the Study of Existential Risk

CSER is a Cambridge based Existential Risk Research organisation founded in 2012 by Jaan Tallinn, Martin Rees and Huw Price, and then established by Seán Ó hÉigeartaigh with the first hire in 2015. After an intermission they are once again lead by him Seán and are affiliated with Cambridge University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach, including to the UK and EU parliaments - e.g. this. Their research can be found here.

Seán returns as leader this year, replacing Catherine Rhodes. Jess Whittlestone is leaving for CLTR/Alpenglow, and they hired Jessica Bland (who I heard good things about) and Paul Ingram.

In Nuño Sempere’s shallow review he suggested that many CSER researchers were not really focused on longtermist work, which is also my impression.

They had a submission on the EU AI act here.

Research

Hua & Belfield's AI & Antitrust: Reconciling Tensions Between Competition Law and Cooperative AI Development analyses various AI safety governance proposals for their compatibility with EU antitrust rules. The paper focuses on EU competition law, because even though the EU has no big AI firms, it is very enthusiastic about regulating US tech firms. Worryingly (but, to my mind, correctly) it argues that the OpenAI Assist clause, where they promise to help rather than compete with any other firm who gets close to AGI, could be in violation of the EU's Article 101(1), especially if it occurs late in the race and between market leaders (both likely) and the unilateral nature of the proposal only partially mitigates it. The conclusion to the paper is optimistic, but my reading of the specific arguments is quite negative; I think it would be very hard for an AI company to e.g. persuade a hostile regulator to give them credit for the speculative efficiency gains of collusion. Three strategies it doesn't consider are 1) avoid the EU (viable for OpenAI, not Google), 2) rely on EU enforcement being so slow it is simply irrelevant (seems plausible) and 3) pushing for reforms to weaken antitrust laws. Overall I thought this was an excellent paper. #Policy

Whittlestone & Clark's Why and How Governments Should Monitor AI Development recommends that governments build capabilities for monitoring the development of AI technologies. The examples in the paper are all nearterm things, but presumably the motivation is general readiness for AGI. Researchers from Anthropic were also named authors on the paper. #Policy

Maas & Stix's Bridging the gap: the case for an ‘Incompletely Theorized Agreement’ on AI policy argues in a similar line to some previous papers that people concerned with AI Xrisk have common cause on various current policy issues with those concerned with short term AI. It suggests that public disagreement between near and long term people is bad because it reduces both their legitimacy, and hence supports the rival 'AI race' competition framing. #Strategy

Maas's AI, Governance Displacement, and the (De)Fragmentation of International Law discusses the potential impacts of AI technologies on how international law is made and enforced.  #NearTerm

Maas's Aligning AI Regulation to Sociotechnical Change argues we should focus on the impacts of technologies, rather than the specific technologies themselves, when deciding whether to regulate.  #Policy

They also did work on various non-AI issues, which I have not read, but you can find on their website.

CSER researchers contributed to the following research led by other organisations:

Finances

They spent $854,000 in 2020 and $1,300,000 in 2021, and plan to spend around $1,300,000 in 2022. It seems that similar to GPI maybe ‘runway’ is not that meaningful - they suggested their grants begin to end in early 2022 and all end by mid-2024, the same dates as last year.

If you want to donate to them, here is the relevant web page.

OpenAI

OpenAI is a San Francisco based independent AI Research organisation founded in 2015 by Sam Altman. They are one of the leading AGI research shops, with a significant focus on safety. Initially they planned to make all their research open, but changed plans and are now significantly more selective about disclosure - see for example here.

One of their biggest achievements is GPT-3, a massive natural language algorithm that generates highly plausible continuations from prompts, which seems to be very versatile. GPT-3 continues to be one of OpenAI (and in fact anyone’s) biggest AI capabilities achievements. In 2021 they released DALL-E, which is similar except instead of creating text based on prompts it creates pictures. They initially pioneered a delayed release program for GPT to allow people to adapt to the perceived risks of this technology (and to normalise doing so for future technologies) GPT-3 is now generally available for anyone to use.

A notable GPT-3 derivative this year is the OpenAI CoPilot, which helps programers in an intelligently way based off opensource code.

They have also done work on iteratively summarising books (summarising, and then summarising the summary, etc.) as a method for scaling human oversight.

In December 2020 the majority of the OpenAI safety team left, with most of them (Dario Amodei, Chris Olah) going to found Anthropic and Paul Christiano going to found the Alignment Research Center. Ostensibly this were unrelated exits and not because of any problems at OpenAI; however their stated motivations for leaving do not fully make sense to me (why couldn’t they do the safety work they want to within OpenAI?) and OpenAI did not seem to have replacements lined up, though they do have Jan Leike now.

Helen Toner, of CSET and similar things, joined their board this year. OpenPhil’s Holden Karnofsky, who previously had joined the board after OpenPhil made a $30m donation which expired this year, has now left the board. Will Hurd, a US politician, also joined the board.

They have a submission on the EU AI act here.

Research

Cammarata et al.'s Curve Circuits is a very cool piece that shows the ability to understand a (moderately) large neural network in the Feynman sense: to recreate it. The authors are able to identify what individual neurals (and families of neurons) in a 50k+ curve-recognition network 'mean'. To test this they describe the network's 'strategy' in english sentences, and then are able to re-implement (more or less) the network based on this description. This is much more explanable than I'd've expected a neural net to be! I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model? In any case you should read the paper for the psychedelic pictures if nothing else. Overall I thought this was an excellent paper. #Interpretability

Barnes & Christiano's Debate update: Obfuscated arguments problem describes a problem they weren't able to solve with their tests of AI safety through Debate. Basically there are arguments where, even if you know it is wrong, it is very hard to narrow down exactly where the error is. This means that the honest debator can't present the judge with the sort of knock-down evidence they want. Clearly this was always going to be a theoretical issue; the main update here is that these obfuscated error arguments can arise quite commonly. #Amplification

Chen et al.'s Evaluating Large Language Models Trained on Code introduces and evaluates the programming-language GPT implementation that preceded GitHub CoPilot. They basically take a massive pretrained GPT model and point it at github, and then see how good it is at writing python functions based on natural language descriptions, tested with unit tests. They also attempted to write docstrings from code. It displays some of the same smooth scaling curves we see elsewhere with GPT. Some interesting behaviour emerges, including that it 'knows' how to write both good and bad code, and if your prompt includes bad code (e.g. because you are a bad programmer who really needs a copilot) it will assume you want to carry on writing bad code! That seems potentially like a good safety lesson, but overall this does seem rather like capacity-enhancing research to me. There's also some more 'silly' concerns, like that the AI might prefer some open-source packages over others, which would be unfair on the authors of the disfavoured packages. Researchers from OpenPhil were also named authors on the paper. #Capabilities

OpenAI Researchers also contributed to the following papers lead by other organisations:

  • Unsolved Problems in ML Safety
  • The MineRL BASALT Competition on Learning from Human Feedback
  • Truthful AI: Developing and governing AI that does not lie

Finances

OpenAI was initially funded with money from Elon Musk as a not-for-profit. They have since created an unusual corporate structure including a for-profit entity, in which Microsoft is investing a billion dollars.

Given the strong funding situation at OpenAI, as well as their safety team’s position within the larger organisations, I think it would be difficult for individual donations to appreciably support their work. However it could be an excellent place to apply to work.

Google Deepmind

Deepmind is a London based AI Research organisation founded in 2010 by Demis Hassabis, Shane Legg and Mustafa Suleyman and currently lead by Demis Hassabis.  They are affiliated with Google. As well as being arguably the most advanced AI research shop in the world, Deepmind has a very sophisticated AI Safety team, covering both ML safety and AGI safety.

We discussed AlphaFold last year, and there was significant further progress on protein folding this year with AlphaFold 2. Long-time followers of the space will recall this is a development Eliezer highlighted back in 2008. See also this speculation that Deepmind might have been trying to avoid publishing it until a competitor ‘forced’ their hand.

Rohin Shah and team continue to produce the AI Alignment Newsletter, covering in detail a huge number of interesting new developments, especially new papers. I really cannot praise these newsletters highly enough.

Research

Stooke et al.'s Open-Ended Learning Leads to Generally Capable Agents is a worryingly-titled paper showing a technique for training agents to deal with a wide variety of environments and objectives. While AlphaZero showed the same algorithm could learn Chess/Shoji/Go etc., learning any one of these games didn't directly help with the others. In this case they produce a 3D environment that can be configured in a wide variety of ways, with the idea that the agents will learn quite general lessons - at least relative to their XLand environment, which seems much more general than the board games. The training process is very involved - like with AlphaX, there are multiple generations, combined with clever ways of judging how hard a task is (so that agents are presented with hard-but-not-impossible things to learn from), and evaluation based on dominance rather than average scores. See also the discussion here. #Capabilities

Welbl et al.'s Challenges in Detoxifying Language Models tests and discusses various issues with automated 'toxicity' (rudeness/political correctness) filters for language models. Unsurprisingly these filters can generate a lot of false positives, and degrade the quality of the responses on other axis. I think this paper is a good illustration of the problems with 'Ethical Considerations' sections: while they list a number of issues, the fact that their approach by design requires the suppression of entire classes of true and important statements is not mentioned. #NearTerm

Gabriel's Towards a Theory of Justice for Artificial Intelligence argues it is not impossible to apply considerations of justice to AI, and then applies Rawlsianism to the issue. Unfortunately I don’t find literal reading of Rawls very convincing (highly arbitrary collection of liberties and principles that seem like Rawls was just trying to work backwards from his conclusion, and the implausibly high level of risk aversion required to support maximin). #Ethics

Researchers from Deepmind were also named on the following papers:

Finances

Being part of Google, I think it would be difficult for individual donors to directly support their work. However it could be an excellent place to apply to work.

Anthropic

Anthropic is a San Fransisco based for-profit AI Startup organisation founded in 2021 by Dario Amodei & Daniela Amodei. They are a highly safety aligned firm founded by people who left the OpenAI safety team in 2020. Their website is here.

Research

Their first publication falls outside the time frame for this document by one day, and hence will go in next year’s review.

Finances

As a well-funded for-profit startup I would not expect them to need or want donations, but they could be a good place to work.

ARC: Alignment Research Center

ARC is a Berkeley based independent AI Safety Research organisation founded in 2021 by Paul Christiano. They work on Paul's agenda of trying to develop systems for scaling human oversight to allow for (commercially competitive) well controlled systems. Their research can be found here.

Research

You can read about their work on aligning human and AI ontologies here, and discussion here.

Christiano's Teaching ML to answer questions honestly instead of predicting human answers presents a possible approach to the problem described in Teaching ML to answer questions honestly instead of predicting human answers.  Essentially he is attempting to bias our training algorithm away from the 'copy human explanation' and towards the 'give true explanation' approach in a variety of ways, including sequential training and producing a mini training set of ground extra-truth. I must admit I don't exactly understand Step 2. See also the discussion here. #Interpretability

Christiano's Another (outer) alignment failure story describes a possible future misalignment scenario, where AIs become more and more influential, but we understand them less and less. We defer more and more of our decision-making to them, and things generally get better, though people worry about the loss of control. Eventually we see a treacherous turn and the AIs suddenly turn off all the cameras on us, though I don't exactly understand how this step fits with the rest of the story. #Scenarios

Christiano's A naive alignment strategy and optimism about generalization is a simple post describing a problem with a 'naïve' strategy of making AIs 'explain' what they are doing. The concern is that rather than learning to give the true explanation for their actions, they will instead learn how to give persuasive accounts. #Interpretability

ARC Researchers also contributed to the following papers lead by other organisations:

·       Yudkowsky's Yudkowsky and Christiano discuss "Takeoff Speeds"

Finances

They are not looking for donations at this time; however they are hiring.

Redwood Research

Redwood is a Berkeley based independent AI Safety Research organisation that started doing public AI alignment research in 2021, founded by Nate Thomas, Bill Zito, and Buck Shlegeris. They aim to do highly practical safety work - taking theoretical safety insights from their own work and from other organisations (e.g. ARC) and proving it out in practical ML systems to ease adoption by non-alignment-focused AI teams.

The team members I know are pretty talented.

They have an extended and very informative AMA here.

Research

Shlegeris's Redwood Research’s current project provides an overview of Redwood's first research project (in progress). They are trying to 'handicap' GPT-3 to only produce non-violent completions; the idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like "don't be deceitful", and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner. #Obstruction

Shlegeris's The alignment problem in different capability regimes provides a disambiguation between a couple of different AI scenarios and the types of alignment problems and solutions that would be relevant.  #Overview

Finances

Redwood apparently has ample funding at the present time (They recently suggested that they didn’t expect to be able to produce a lot more output with more funding) and hence is not currently looking for donations from the general EA public.

Ought

Ought is a San Francisco based independent AI Safety Research organisation founded in 2018 by Andreas Stuhlmüller and run by Andreas and Jungwon Byun. They research methods of breaking up complex, hard-to-verify tasks into simple, easy-to-verify tasks - to ultimately allow us effective oversight over AIs. This includes building computer systems and recruiting test subjects. Apparently one of the best places to find their research is the mailing list here.

In the past they worked on factored generation – trying to break down questions into context-free chunks  so that distributed teams could produce the answer (Christiano style) – and then factored evaluation – using similar distributed ideas to try to evaluate existing answers, which seemed a significantly easier task (by analogy to P<=NP). 

They are now working on a system called Elicit, an automated research assistant, which uses language modelling to do things like try to propose new research directions and literature review.

James Brady will start as Head of Engineering in January 2022.

Research

Alex et al.'s RAFT: A Real-World Few-Shot Text Classification Benchmark provides a benchmark of real-world tasks from the Elicit community, like classifying NeuroIPS ethics statements, for few-shot learning, and tests various models on them. The benchmark aims to measure how far current models are from automating economically valuable work. Researchers from Gov.AI were also named authors on the paper. #Capabilities

Finances

They spent $1,200,000 in 2020 and $1,400,000 in 2021, and plan to spend around $2,000,000 in 2022. They have around $3,800,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.9 years of runway.

If you want to donate you can do so here.

AI Impacts

AI Impacts is a Berkeley based AI Strategy organisation founded in 2014 by Katja Grace and Paul Christiano. They are affiliated with (a project of, with independent financing from) MIRI. They do various pieces of strategic background work, especially on AI Timelines - it seems their previous work on the relative rarity of discontinuous progress has been relatively influential. A lot of their work is in the form of a private wiki collecting potentially useful background information. Their research can be found here. You can see a description of the questions they work on here.

For most of this year they have been down to 1-2 people, but are planning on hiring back up in 2022.

Research

They have produced a series of pieces on how long it has historically taken for AIs to cover the human range (from beginner to expert to superhuman) for different tasks. This seems relevant because people only seem to really pay attention to AI progress in a field when it starts beating humans. These pieces include Starcraft, ImageNet, Go, Chess and Draughts.

Grace's Beyond fire alarms: freeing the groupstruck is a detailed response to Eliezer's classic post. She argues, contra Eliezer, that the main purpose of fire alarms is not to create common knowledge and overcome awkwardness; they also do normal things like provide evidence about the existence of fires, and make staying inside unpleasant. I thought this was persuasive, but also that Eliezer's main conclusion still held: even if he didn't understand fire alarms (indeed, prior to reading this post I didn't realise that I didn't understand fire alarms) it is still true and bad that there is no fire alarm, and it is worth making people aware of this. See also the discussion here. #Forecasting

The AI Vignettes Project was a series of exercises where people wrote short 'stories' for how AI development might unfold. Others then critiqued them to try to improve their plausibility. See also here. #Forecasting

Fernandez's How energy efficient are human-engineered flight designs relative to natural ones? finds that animal flight is significantly more energy-efficient than human flight.  #Forecasting

Grace's Argument for AI x-risk from large impacts lays out this argument for the importance of AGI safety and some responses.  #Forecasting

Grace's Coherence arguments imply a force for goal-directed behavior argues that coherence arguments do indeed show that agents which start out being weakly goal seeking will end up being strongly goal seeking.   See also the discussion here. #AgentFoundations

Finances

They spent $280,000 in 2020 and $240,000 in 2021, and plan to spend around $650,000 in 2022 (twice 2019 peak of $316,000). They have around $340,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 0.5 years of runway. In the past they have received support from EA organisations like OpenPhil and FHI.

MIRI administers their finances on their behalf; donations can be made here.

GPI: The Global Priorities Institute

GPI is an Oxford-based Academic Priorities Research organisation founded in 2018 by Hilary Greaves and part of Oxford University. They do work on issues in philosophy and economics likely to be very important for global prioritisation, much of which is, in my opinion, relevant to AI Alignment work. Their research can be found here.

They recently took on two new philosophy postdocs (Hayden Wilkinson and Adam Bales) and will be joined by Timothy William soon; they didn’t disclose any departures.

Research

I cover only the more AI Xrisk relevant papers; notably we do not include the temporal discounting papers, even though they are relevant.

Mogensen's Do not go gentle: why the Asymmetry does not support anti-natalism argues that even if you have the view that it is bad to create unhappy people but not good to create happy ones, it still doesn't follow that it would be good for humanity to go extinct. This is because in order to avoid the well known transitivity problem with the Asymmetry, you should adopt an incommensurability principle, which in turn means that combining neutral and bad things can make them neutral over all. This is pretty counterintuitive, but I think this is basically just because the Asymmetry is counterintuitive to start with.  #Ethics

Greaves & MacAskill's The case for strong longtermism argues that, for our most important decisions, they cannot be the best without being the best for the long term future. This is actually somewhat weaker than previous discussions, because the scope is only for the most important decisions (career and donation), and hence does not cover much 'everyday' behaviour. It is persuasive and diligent; I'd expect most readers here to already agree with the conclusions. #Ethics

Thorstad's The scope of longtermism discusses how many types of decisions Strong Longtermism is true for. He is pretty sceptical - due to knowledge problems plus 'washing out' he argues that while Strong Longtermism applies for a small number of Xrisk related questions, for virtually any other question (e.g. malara funding) it fails to hold. #Ethics

Thomas's Simulation Expectation presents a refinement to the Bostrom's simulation argument, by moving from 'there are a lot of people in Sims' to 'there are a lot of people like me in sims'.  #Forecasting

Researchers from GPI were also named on the following papers:

Finances

They spent £850,000 in 2019/2020 (academic year) and £1,000,000 in 2020/21, below their plan of £1,400,000, and intend to spend around £1,800,000 in 2021/2022. They suggested that as part of Oxford University ‘cash on hand’ or ‘runway’ were not really meaningful concepts for them, as they need to fully-fund all employees for multiple years.

If you want to donate to GPI, you can do so here.

CLR: The Center on Long Term Risk

CLR is a London (previously Germany) based Existential Risk Research organisation founded in 2013 and lead by Stefan Torges and Jesse Clifton. Until last year they were known as FRI (Foundational Research Institute) and were part of the Effective Altruism Foundation (EAF). They do research on a number of fundamental long-term issues, with AI as one of their top focus areas. 

In general they adopt what they refer to as ‘suffering-focused’ ethics, which I think is a quite misguided view, albeit one they seem to approach thoughtfully. A lot of their work is about avoiding conflict between different agents.

Research

Oesterheld & Conitzer's Safe Pareto Improvements for Delegated Game Playing presents an approach to delegated game playing where each agent 're-shapes' the incentives for their delegates. Basically each principle can exclude options and give their agent a different utility function in an attempt to push them towards not-less-globally-optimal play. It seems to assume a high degree of competence on behalf of the principles though, at which point do they really need agents? #GameTheory

Stastny et al.'s Multi-agent learning in mixed-motive coordination problems discusses games where agents have different preferences and there is no single cooperative equilibrium. Presumably this is a situation that seems likely to occur if we have multiple AGIs. They show that various existing techniques struggle in this setting. #GameTheory

Clifton's Collaborative game specification: arriving at common models in bargaining works on the issue of agents arriving at conflict 'unnecessarily' due to differing world models. The option it suggests is for the agents to try to share their models ahead of time, even though they don't trust each other, and use this resulting common model to optimise from. #GameTheory

Clifton's Weak identifiability and its consequences in strategic settings discusses the potential for unintended conflict in ultimatum games (and the like) due to uncertainty about the other agent's strategy. In the same way that a value learning agent can struggle to disambiguate preferences and beliefs of the humans it is studying, similarly it can be hard to distinguish resolve for agents. #GameTheory

Kokotajlo's Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain (part of a sequence) argues that, because evolution tends to produce complicated and messy designs, the fact that we are a long way from brain-level performance in some aspects doesn't necessarily mean we're a long way away from TAI.  #Forecasting

Researchers from CLR were also named on the following papers:

Finances

They have a collaboration with the Swiss-based Center for Emerging Risk Research, who fund part of their costs.

If you wanted to donate to CLR, you could do so here.

CSET: The Center for Security and Emerging Technology

CSET is a Washington based Think Tank founded in 2019 by Jason Matheny (ex IARPA), affiliated with the University of Georgetown. They analyse new technologies for their security implications and provide advice to the US government. At the moment they are mainly focused on near-term AI issues. Their research can be found here.

They seem to have good connections to the US government, especially the Democratic Party, who are currently in power; their cofounder Jason Matheny left to take up multiple senior roles in the Administration.

Most of the people they hire seem to be politics people, not EA people. 

Nuño Sempere’s evaluation of their work is available here.

Research

Arnold & Toner's AI Accidents: An Emerging Threat introduces some of the ideas around AI safety for policymakers. This includes robustness, specification problems and oversight. It basically tries to link them to near-term threats.  #Policy

Buchanan et al.'s How Language Models Could Change Disinformation investigates the potential for using GPT-3 for disinformation/propaganda campaigns. They run a series of experiments to generate plausible tweets to push narratives, add partisan slants to articles, and so on. The paper is obviously written with a left-wing audience in mind, which makes sense given they are attempting to influence the current US administration. #Policy

Fedasiuk et al.'s Harnessed Lightning discusses the ways in which the Chinese PLA is using AI.  #Forecasting

Murphy's Translation: Ethical Norms for New Generation Artificial Intelligence Released is a translation CSET did of a chinese policy document on ethics in AI. Given the importance of China, and how few people speak Chinese, I think this is a pretty useful general activity, but it's hard to understand the significance of the document by itself; most of it is concerned with fairly high level ethical goals. #Translation

Murphy's Translation: White Paper on Trustworthy Artificial Intelligence is a translation CSET did of a chinese policy document on trustworthiness in AI. Unfortunately it doesn't seem to have a lot of discussion of Xrisk. #Translation

Baker's Ethics and Artificial Intelligence: A Policymaker's Introduction provides an overview of the ways governments could encourage the use of ethical standards, IRBs etc. for AI. It's mainly focused on near-term AI issues. #Policy

Mittelsteadt's Mechanisms to Ensure AI Arms Control Compliance describes various technical methods governments could use to ensure compliance with regulations on the use of AI. One of the suggestions is van eck phreaking, which remains very cool, but doesn't seem very practical. #Policy

Rudner & Toner's Key Concepts in AI Safety: An Overview is a very basic introduction to some of the issues in AI safety, not xrisk-focused but with readthrough, for policymakers.  #Overview

Zwetsloot et al.'s The Immigration Preferences of Top AI Researchers: New Survey Evidence asked researchers who had published in NeuroIPS etc. about their migration plans. As a first approximation, people want to live in the US (or UK) but have legal problems; other countries like China (and France) only really attract their own people back. Researchers from Gov.AI also appeared as authors on the paper. #Policy

Imbrie et al.'s Evaluating Rhetorical Dynamics in AI analyses the frequency of four different framings for AI futures in media articles. They find that the 'Killer Robots' framing peaked in 2015 (Musk/OpenAI?) and has decreased significantly since then. #Strategy

Aiken's Classifying AI Systems suggests some simplified classification schemas to make it easier for laymen to classify AI systems based on e.g. their autonomy and inputs.  #Overview

Crawford & Wulkan's Federal Prize Competitions discusses using prices to incentivise AI development.  #Policy

Rudner & Toner's Key Concepts in AI Safety: Robustness and Adversarial Examples is a very basic introduction to (non-AGI) adversaial examples.  #Overview

Rudner & Toner's Key Concepts in AI Safety: Interpretability in Machine Learning is a very basic introduction to (non-AGI) model explanability.  #Overview

Finances

As they raised $50m from OpenPhil (a,b,c) this year, and have had similar successes in the past, I am assuming they do not need more donations at this time.

AI Safety camp

AISC is a globally based residential research camp organisation founded in 2018 by Linda Linsefors and currently lead by Remmelt Ellen. They are affiliated with AI Safety Support. They bring together people who want to start doing technical AI research, hosting a 10-day camp aiming to produce publishable research. Their research can be found here. Their annual summary can be found here.

To the extent they can provide an on-ramp to get more technically proficient researchers into the field I think this is potentially very valuable. But I haven’t personally experienced the camps, and though I spoke to two people who found them valuable and seem good, these people were not randomly selected.

In the past each camp was run by different volunteers; they are in the process of transitioning to more consistent (and hence experienced) leaders.

Research

Koch et al.'s Objective Robustness in Deep Reinforcement Learning provides a series of toy examples demonstrating objective robustness failure. In each case the agent's capabilities are robust, so it can still navigate the environment, but it has failed to learn the objective properly. See also the discussion here. Researchers from CLR were also named authors on the paper. #Robustness

Finances

They spent $11,162 in 2020 and $29,665 in 2021, and plan to spend around $153,400 in 2022. They have around $236,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.5 years of runway.

If you want to donate, the web page is here.

FLI: The Future of Life Institute

FLI is a Boston-based independent existential risk organization, focusing on outreach, founded in large part to help organise the regranting of $10m from Elon Musk. They cover nuclear, biological and AI risks; one of their major projects is trying to ban Lethal Autonomous Weapons.

They had a submission on the EU AI act here.

Readers might be interested in their podcasts like here, here here.

FLI received a large grant - $25m at time of donation, but crypto has rallied since then so probably more – from Vitalik Buterin, which they are using to fund a new grant program (similar to the previous ones funded by Elon Musk). These grants will fund both projects (the Shiba Inu Grants) and talent development like school programs or postdocs (the Vitalik Buterin Fellowships).

Nuño Sempere’s evaluation of FLI is available here.

Lightcone Infrastructure

Lightcone Infrastructure is a Berkeley based independent Meta Longtermist organisation founded in 2021 by Oliver Habryka. They provide a range of infrastructure and support to the Longtermist movement, most prominently the LessWrong website, but also the Lightcone office, workshops and retreats etc. Their slightly-out-of-date intro can be found here.

In general I have been pretty impressed with the team’s seriousness and strategic sense. Nuño Sempere’s evaluation of LessWrong is available here. Zvi’s views on Lightcone here.

Finances

They spent $500,000 in 2020 and $1,300,000 in 2021, and plan to spend around $2,000,000 in 2022. They have around $1,900,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1 years of runway.

The increase in budget is partly driven by their decision to pay nearly (-30%) market salaries for talent; to my knowledge they are basically the first EA org to do so. Obviously this increases their cost base a lot, but I think maybe I support it – paying a lot less is basically like mandatory donations to your employer, which seems inefficient.

CLTR: Center for Long Term Resilience (formerly Alpenglow)

The CLTR is a London based independent policy think tank founded by Angus Mercer & Sophie Dannreuther. They work to connect top Xrisk researchers and ideas to the UK government. My impression is they are unusually skilful at this. Their website is here.

Research

CLTR researchers contributed to the following research led by other organisations:

Finances

If you wanted to donate you could do so here.

Rethink Priorities

Rethink Priorities is an internationally based independent EA research consultancy organisation founded in 2018 by Peter Wildeford & Marcus A Davis. They provide research on important EA issues for other EA organisations, and the broader movement. Their annual summary can be found here. You can read their research here.

Research

A lot of their work is contract work, where they are hired by another EA organisation to research specific topics, as described here. While historically their work has focused on other issues, they are currently ramping up their Longtermism work, which includes a significant AI governance component. Given the contract nature of much of their work, they will have a fair bit of non-public output, which obviously makes external evaluation a bit harder, though presumably this work is supported by organisations paying for that specific work anyway.

They haven’t published a lot on AI yet, but I have often been impressed with their work on other subjects in the past, and their description of planned projects (privately shared) seems sensible.

Finances

They spent $883,000 in 2020 and $2,100,000 in 2021, of which around $329,000 was for Longtermism, and tentatively plan to spend $1.5 - $4m in 2022 on Longtermism. (Note that these figures are higher than what they indicated in the budget section of their strategy post published in November 2021.) They have around $5,480,000 in cash and pledged funding, of which around $400,000 is earmarked for Longtermism. They suggested they had around 16 months of runway (because restricted funds cannot be used to run operations).

If you wanted to donate you could do so here.

Convergence

Convergence is a globally based independent Existential Risk Research organisation founded (incorporated and first grant) in 2018 by Justin Shovelain and David Kristoffersson. They do strategic research on x-risk reduction decision making. Their research can be found here

They plan to hire several more people in 2022.

In 2021 they advised Lionheart Ventures on investing in AGI-related companies ethically, including evaluating 4 such firms. 

Research

No relevant public research for 2021.

Finances

They spent $14,000 in 2020 and $10,000 in 2021, and plan to spend around $100,000-300,000 in 2022. 

They recently received ‘substantial’ funding, and hence are not actively seeking donations at the moment, though if you wanted to donate anyway you could donate here.

SERI: The Stanford Existential Risk Initiative

SERI is a Stanford based student-faculty collaboration working on existential risk issues, founded in 2020; their website is here.

Research

GAA's Nuclear Espionage and AI Governance provides an overview of the impact of communist spies on the Manhattan project, and some potential lessons for AI safety.  It suggests that spying is more important if the scaling hypothesis is false and if AI projects are nationalised (as then nationalism could be a motivator, and groups might need to steal hardware rather if they can't buy it). It seems that generally spying is bad, but he does note that secrecy tends to beget secrecy, and could be hard to combine with interpretability, which might be important for alignment. See also the discussion here. #Strategy

Other Research

I would like to emphasize that there is a lot of research I didn't have time to review, especially in this section, as I focused on reading organisation-donation-relevant pieces. In particular there is a lot of good work on the Alignment Forum. So please do not consider it an insult that your work was overlooked!

Filan's AXRP - the AI X-risk Research Podcast is a new podcast dedicated to discussing AI safety work.  #Overview

lifelonglearner and Hase's Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers is a ridiculously comprehensive overview of the work that has been done on making ML systems human-comprehensible over the last few years. I am going to have to admit I didn't read it all. #Interpretability

Turner's Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability argues that a wide range of policies, not just optimisers, are motivated to seek to control their environment, Omohundro-style. This is bad news insomuch as it presents a problem with various attempts to make AI 'unambitious' and hence safe. #AgentFoundations

Wentworth's Utility Maximization = Description Length Minimization shows that utility maximizers can be modelled as attempting to make the world simpler, according to a model of the world which assigns probability in accordance to utility. The maths is not complicated and once I read it the idea was obvious. Unfortunately it is now impossible for my to tell if it was obvious prior to reading - probably not! #AgentFoundations

Jiang et al.'s Delphi: Towards Machine Ethics and Norms is transformer model trained on various ethical judgement datasets with a fun web frontend. They use five datasets, including Hendrycks et al.'s Aligning AI with Shared Human Values we discussed last year. There's been a lot of criticism of the model for producing absurd results (e.g. here) but it gave good responses to all but one of the ~30 prompts I gave it, including some attempted tricky ones; the only exceptions were I suspect side effects of their 'fix' for its prior political incorrectness. #ValueLearning

Wentworth's How To Get Into Independent Research On Alignment/Agency describes in a quite practical way John's experiences and advice for doing useful AI work outside of a research org. Given recent dramatic increase in the viability of this as a career (largely due to the LTFF) I thought this was a good post, for making more people aware of this possibility if nothing else. #Overview

Cihon et al.'s Corporate Governance of Artificial Intelligence in the Public Interest expands on Belfield's work last year to show a very wide variety of ways in which AI corporations can be influenced, collaboratively and adversarially, to change their behaviour. My main concern is it is not clear how to make sure these structures actually do useful work, as opposed to filling up with grifters and ideologues; they list some past 'success stories', but it is not clear to me that many of these instances of influencing corporate behaviour actually had a *positive* influence. (Also, more mundanely, they misunderstood why shareholders have more influence than bondholders: because shareholders are the residual claimant on cashflows). Researchers from GCRI, Legal Priorities Project were also named authors on the paper. #Policy

Cihon et al.'s AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries surveys the existing landscape for AI ethical certifications (basically all near-term) and discusses the potential for Xrisk relevant certification.  Researchers from GCRI,Legal Priorities Project were also named authors on the paper. #Policy

Prunkl et al.'s Institutionalizing ethics in AI through broader impact requirements compares the NeuroIPS ethics statements to similar things in other fields and considers their impact. They also include a number of suggestions for ameliorating their weaknesses. Researchers from Gov.AI were also named authors on the paper. #Policy

Ashurst et al.'s AI Ethics Statements: Analysis and lessons learnt from NeurIPS Broader Impact Statements provides descriptive statistics around the statements, and discusses why they were discontinued. It seems that in general they were not very successful at causing researchers to pay attention to important things. Researchers from Gov.AI were also named authors on the paper. #Policy

Davis's Feature Selection is a very well written short story about what it feels like on the inside to be a ML algorithm. I don't want to spoil it, but it does a good job illustrating various related points around e.g. Robustness. #Fiction

Niconiconi's Whole Brain Emulation: No Progress on C. elgans After 10 Years gives an update on progress (or lack thereof) in whole brain emulation for C. elgans. It seems that basically no-one was motivated to fund it so little progress has been made in the last 10 years. #Forecasting

Guterres's Our Common Agenda is a report by the UN secretary general. It is largely a list of pieties, but mentions existential risks and AI (but not directly AI Xrisk). #Policy

Brown et al.'s Value Alignment Verification proposes a number of tests to determine if an agent is aligned with a human. Somewhat implausibly to me they suggest this works even in cases where both human and AI are black boxes, so long as they share an ontology.  Researchers from CHAI were also named authors on the paper. #ValueLearning

Liu & Maas's 'Solving for X?' Towards a problem-finding framework to ground long-term governance strategies for artificial intelligence argues we should spend less time trying to solve AI governance problems and more time looking for new problems. mentions supply chains highly vulnerable to pandemics but they actually held up pretty well? Heavy on buzz words. Researchers from CSER were also named authors on the paper. #Strategy

Andrus et al.'s AI Development for the Public Interest: From Abstraction Traps to Sociotechnical Risks makes some comments about the relationship between technical and social problems and grad school.  Researchers from CHAI were also named authors on the paper. #Strategy

Chatila et al.'s Trustworthy AI provides some high-level discussion of issues like interpretability and human rights.  I was surprised by the confident assertion that because machines 'can only decide and act within a bounded set of possibilities' that they cannot make ethical decisions. Researchers from CHAI were also named authors on the paper. #Strategy

Maclure & Russell's AI for Humanity: The Global Challenges describes some possible applications of AI to the Sustainable Development Goals.  Researchers from CHAI were also named authors on the paper. #ShortTerm

Cave et al.'s Using AI ethically to tackle covid-19 raises various standard NearTerm objections against using AI to combat covid. I found this pretty unconvincing; none of the 'harms' it raises seem material compared to a the pandemic. If an ML system for diagnosing covid based on the sound of coughs can save people's lives, the fact that not everyone has a smartphone doesn't seem like a good reason to ban it. Researchers from CSER were also named authors on the paper. #NearTerm

Klinova & Korinek's AI and Shared Prosperity recommends AI developers analyse the labour market impacts of their work. I think asking ML engineers - not even economists! - to attempt to micro-manage the economy in this way is a mistake, for reasons described by Lawson here. Researchers from Gov.AI were also named authors on the paper. #NearTerm

Capital Allocators & Other Organisations

One of my goals with this document is to help donors make an informed choice between the different organisations. However, it is quite possible that you regard this as too difficult, and wish instead to donate to someone else who will allocate on your behalf. This is of course much easier; now instead of having to solve the Organisation Evaluation Problem, all you need to do is solve the dramatically simpler Organisation Evaluator Organisation Evaluation Problem.

It’s worth noting that many of the orgs in this category, being supported by large endowments, do not really take outside money.

LTFF: Long-term future fund

LTFF is a globally based EA grantmaking organisation founded in 2017, currently lead by Asya Bergal and part of EA Funds. They are one of four funds set up by CEA (but now operationally independent, though they still report to the CEA board) to allow individual donors to benefit from specialised capital allocators; this one focuses on long-term future issues, including a large focus on AI Alignment. Their website is here. In 2021 they did a May grant round (writeup, discussion). At time of writing there are no public write-ups for the grants from rest of the year.

Grant applications are now accepted on a rolling basis: you can apply at any time of year.

The fund is now run by four people (plus advisors), and the grants have gone to a wide variety of causes, many of which would simply not be accessible to individual donors.

The fund managers are currently:

●        Asya Bergal

●        Adam Gleave

●        Oliver Habryka

●        Evan Hubinger

Evan is new, replacing Helen Toner, who left, and Matt Wage, who became an advisor. I know Asya and Habryka reasonably well and think they will make generally good grants; Adam and Evan I know less well but seem also good. There has been a fair bit of manager turnover, and this will probably continue.

Not mentioned on the website there were also several part-time managers in 2021, a practice which seems likely to continue with different people:

  • Daniel Eth
  • Ozzie Gooen
  • Luisa Rodriguez
  • [one other non-voting ‘advisor’ who requested anonymity]
  • (sort of) Jonas Vollmer

 The managers now have the option to be paid for their work by CEA (on an OpenPhil grant).

In total for 2021, based on my calculations, they granted around $4.96m. In general most of the grants seem at least plausibly valuable to me, and many seemed quite good indeed. There weren’t any in 2021 that seemed significantly negative. I estimate that 66% of the dollars went to AI-relevant activities (including partial credit for some things), and 85% were to grants I would have made (again including partial credit).

I attempted to classify the recommended by type. Note that ‘training’ means paying an individual to self-study. One type of funding I’m not really sure how to classify is income support, where a researcher already has a stipend, but the LTFF thinks they could be more effective if they didn’t have to worry so much about (in the grand scheme of things) relatively small amounts of money.

I have deliberately omitted the exact percentages because this is an informal classification.

 

Of these categories, I am most excited by the Individual Research, Event and Platform projects. I am generally somewhat sceptical of paying people to ‘level up’ their skills. (Many) individual donors are perfectly capable of evaluating large organisations that publicly advertise for donations. In donating to the LTFF, I think (many) donors are hoping to be funding smaller projects that they could not directly access themselves. As it is, such donors will probably have to consider such organisation allocations a mild ‘tax’ – to the extent that different large organisations are chosen then they would have picked themselves.

Tetlock et al.’s forecasting work PhD, which the fund managers recommended $572,000, was the largest single grant (around 12% of the 2021 total), followed by EA Geneva on $310,000, and Amon Elders (PhD), BERI and Kristaps Zilgalvis (PhD), all on $250,000 each.

I was able to view what the LTFF regarded as its ‘marginal but rejected’ applications from the May round; in general there were some decent projects there I’d be happy to fund.

In the past all grants had to be approved by CEA before they are made; my understanding is this requirement is more pro forma now due to the increased independence. I only know of one grant previously vetoed, and this was widely agreed to have been a bad grant, so losing this aspect of quality control seems like a shame to me.

The EA Funds have previously discussed trying to adopt an active grantmaking approach, where instead of just reactively evaluating proposals they receive they will actively search for good opportunities. However this does not seem to have happened to a significant degree yet.

Nuño Sempere did an excellent retrospective on outcomes from the LTFF’s 2018-2019 grant rounds here. My interpretation of his evaluation was generally quite positive for the LTFF:

Wentworth's How To Get Into Independent Research On Alignment/Agency suggests that the LTFF has been crucial to enabling the emergence of independent safety researcher as a viable occupation; this seems like a very major positive for the LTFF.

I am quite concerned about the lack of transparency the LTFF provides donors. In the past there have been a number of issues around this (e.g. incorrect numbers on the website, releasing or not releasing information and then claiming otherwise), but as a small volunteer-run organisation I figured these were to be expected. With the shift towards professional (paid) management, and a stated intention to provide better disclosure, I expected things to become significantly better.

However, this has not been the case.

Partly this is the result of deliberate policy change. In 2020 they made an anonymous grant (roughly 3% of the total), and they have now adopted a policy of allowing people to apply anonymously. I understand why this could be appealing for applicants, and why the LTFF would want to not exclude potentially good but publicity-shy applicants. However by doing so they undermine the ability of the donor community to provide oversight, which is definitely a bit concerning to me.

More concerning to me however what appears to be a lack of disclosure due to simple oversight. At time of writing (2021-12-20) the latest grant round mentioned on the LTFF website is April, despite apparently there also having been two since then – something donors would have no way of knowing without personally reaching out. The ‘Fund Payouts’ number, despite claiming to be ‘to date’, is around 8 months, 55 grants and ~$3.5m out of date. And despite having had many temporary Fund Managers this year, as far as I can see nowhere on the website are these mentioned.

As a result overall my impression is that donors have much less accurate information available to them to evaluate the LTFF this year than they did in prior years. While I’m grateful to them for personally sharing drafts about their recent activities with me, ideally this would be shared directly with all donors.

If you wish to donate to the LTFF you can do so here.

OpenPhil: The Open Philanthropy Project

The Open Philanthropy Project (separated from Givewell in 2017) is an organisation dedicated to advising Cari and Dustin Moskovitz on how to give away over $15bn to a variety of causes, including existential risk. They have made extensive donations in this area and probably represent both the largest pool of EA-aligned capital (at least pre-FTX) and the largest team of EA capital allocators.

They described their strategy for AI governance, at a very high level, here.

Grants

They have a large and extremely capable grant evaluation team, though arguably small on a evaluator/dollar basis. In general I think they do a very good job of thinking strategically, analysing individual grants, and giving feedback to fundees for improvement.

You can see their grants for AI Risk here. It lists 26 AI Risk grants in the last 12 months, plus 2 other highly relevant ‘other’ grants. In total I estimate they spent about $68.5m on AI (giving partial credit for related grants).

This was dominated by two large grants:

  • CSET: $47m
  • CHAI: $12m

This compares to $324m for 135 in total grants over the period, so AI related work was around 21%.

They put out an RFP for various Longtermist outreach programs here.

The OpenPhil AI Fellowship basically fully funds AI PhDs for students who want to work on the long term impacts of AI. Looking back at the 2018 and 2019 classes (who presumably will have had enough time to do significant work since receiving the grants), scanning the abstracts of their publications on their websites suggests that over half have no AI safety relevant publications in 2019, 2020 and 2021, and only two are [co]authors on what I would consider a highly relevant paper. Apparently it is somewhat intentional that these fellowships are not intended to be specific to AI safety, though I do not really understand what they are intended for. OpenPhil suggested that part of the purpose was to build a community, which I don’t really understand, because there is limited evidence that the winners work together, at least for the first two cohorts.

They also launched a scholarship program last year which seems more tailored to people focused on the long-term future, though it is not AI specific, and they recommend AI longtermists to apply to the AI one first. There is also a separate scholarship program for technology policy people as well.

Their funding is sufficiently dominant in the EA movement that, as Linch pointed out, it can make independent evaluation difficult. Virtually everyone capable of doing so either has received OpenPhil money in the past or might want to do so in the future.

Research

Most of their research concerns their own granting, and is often non-public.

Cotra's The case for aligning narrowly superhuman models  suggests we should work on making large, 'general' AI models better able to achieve specific human goals. The one-sentence description makes this sound pretty bad but it's actually an interesting idea. You take systems like GPT-3, which seem like they have the 'power' to solve many interest problems, but aren't 'motivated' to do so, and try to give them that motivation, *without* making them more powerful (e.g. scaling up). This could, perhaps, function as something of a dry-run for the big alignment task. One particular example she refers to as 'sandwiching'; aligning an AI to help a group of laymen reach expert-level performance on a task, because the experts provide you with a ground-truth for performance evaluation. See also the discussion here. Overall I thought this was an excellent paper. #Strategy

Davidson's Could Advanced AI Drive Explosive Economic Growth? discusses some simple economic growth models and what they suggest for future growth. The core insight - that human-level AI could un-do the demographic transition and unlock super-exponential growth again - should I think not be very surprising. #Forecasting

Karnofsky's All Possible Views About Humanity's Future Are Wild, and the related sequence of posts, argues that all plausible inside views suggest the future is going to be weird - e.g. massive growth, extinction, value lock in etc. Many of these ideas are not new but they are well presented. #Forecasting

Beckstead & Thomas's A paradox for tiny possibilities and enormous values discusses Pascalian arguments, and the problems that arise if you try to reject small-probability-massive-payoff cases.  In particular, they generalise beyond the expected utility framework. Researchers from GPI were also named authors on the paper. #Ethics

Finances

To my knowledge they are not currently soliciting donations from the general public, as they have a lot of money from Dustin and Cari, so incremental funding is less of a priority than for other organisations. They could be a good place to work however.

SFF: The Survival and Flourishing Fund

SFF (website) is a donor advised fund, taking over activities previously run by BERI, but now with a separate team. SFF was initially funded in 2019 by a grant of approximately $2 million from BERI, which in turn was funded by donations from philanthropist Jaan Tallinn; Jaan remains the largest funder.

You can read Zvi’s experience of being an evaluator for the fund here.

Grants

In its grantmaking SFF uses an innovative allocation process to combine the views of many grant evaluators (described here). SSF has published the results of two grantmaking rounds this year (described here and here), where they donated around $19.4m, of which I estimate around $13.8m (73%) was AI related, and 75% were to things I would have funded (giving partial credit both times).

The largest donations in the year were to:

  • LTFF: $2.1m
  • Alpenglow/CLTR: 1.9m
  • Lesswrong/Lightcone: 1.9m
  • CLR: 1.2m
  • CFAR: 1.2m
  • ALLFED: 1.2m
  • David Krueger’s group at Cambridge: 1m

FTX Foundation

The FTX foundation is in the process of being launched to distribute some of the profits from FTX/Alameda, and hired Nick Beckstead (formerly a program officer at Open Phil making grants in this area) as CEO so I expect them to make large and thoughtful grants to highly relevant organisations.

 

BERI: The Berkeley Existential Risk Initiative

BERI is a (formerly Berkeley-based) independent Xrisk organisation, founded by Andrew Critch but now led by Sawyer Bernath. They provide support to various university-affiliated existential risk groups to facilitate activities (like hiring engineers and assistants) that would be hard within the university context, alongside other activities - see their FAQ for more details.

In 2019 they pivoted, dropping various non-core activities (e.g. grantmaking) and are now essentially entirely focused on providing support to researchers engaged in longtermist (mainly x-risk) work at universities and other institutions. They have five main collaborations:

  • FHI: The Future of Humanity Institute
  • CSER: The Center for the Study of Existential Risks
  • CHAI: The Center for Human Compatible AI
  • SERI: The Stanford Existential Risk Initiative (previously a trial collaboration)
  • ALL: The Autonomous Learning Laboratory at UMass Amherst (previously a trial collaboration)

In addition they have a large number of trial collaborations:

  • CLTC: The Center for Long-Term Cybersecurity
  • CTPL: The Tech Policy Lab at Cornell
  • David Krueger’s unnamed lab at Cambridge
  • Dylan Hadfield-Menell’s lab at MIT
  • InterAct – the Interactive Autonomy and Collaborative Technologies Lab (Anca Dragan)
  • Meir Freidenberg and Joe Halpern at Cornell
  • The Anh Han group at Teesside
  • The Safe Robotics Laboratory at Princeton
  • The Sculpting Evolution Group at the MIT Media Lab
  • Yale Effective Altruism

I think this is potentially a pretty attractive project. University affiliated organisations provide the connection to mainstream academia that we need, but run the risk of inefficiency both due to their lack of independence from the central university and also the relative independence of their academics. BERI potentially offers a way for donors to support the university affiliated ecosystem in a targeted fashion.

In general they operate on a pull model, where they provide resources to help their groups achieve their goals, and seem quite unlikely to say ‘no’ unless the request was literally illegal or similar. BERI seems to effectively exercise discernment at the level of which organisations they collaborate with, not the individual project level. So if you were not a fan of the groups they collaborated with, supporting BERI would probably not be the right choice for you.

They are apparently quite relaxed about getting credit for work, so not all the stuff they support will list them in the acknowledgments.

Finances

They spent $2,800,000 in 2020 and $2,300,000 in 2021, and plan to spend around $2,000,000 in 2022. They have around $2,400,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.2 years of runway.

BERI is now seeking support from the general public. If you wanted to donate you can do so here. Note that if you want to you can restrict the funding to specific collaborations if you wanted, though my guess is funging might be ~100% for small donors.

Nonlinear Fund

Nonlinear is an internationally based independent Meta AI Safety organisation founded in 2021 by Kat Woods and Emerson Spartz. They aim to provide similar services as Kat did with Charity Entrepreneurship: helping launch new projects that provide value to the AI safety community. You can read about them here, or on their website here.

One of the big projects they plan to work on is helping EAs hire personal assistants, which seems like a potentially pretty effective way of unlocking people’s time, as well as several other projects which all seemed like broadly good ideas.

Research

Woods's The Nonlinear Library provides automatically generated voice versions of top EA content. Given that a lot of people like listening to podcasts, this seems like potentially a huge accessability improvement, which I could imagine more conservative organisations like CEA being concerned about offering for legal reasons. #Community

Finances

They are not actively soliciting donations but if you wanted to anyway you can by reaching out to Kat.

80,000 Hours

80,000 Hours is a London based EA Movement-Building organisation founded in 2011 by Will MacAskill & Ben Todd and currently lead by Ben Todd.  They are affiliated with CEA. Provides career research, coaching and headhunting for the world's most important careers, of which AI safety is a significant focus. Their research can be found here. Their website is here.

During the year, Peters Hartree and McIntyre left, and María Gutiérrez Rojas might leave next year. They hired Bella Forristal, Benjamin H, Matt Reardon and Alex Lawsen.

Their ability to make connections for people seems extremely valuable.

80,000 Hours's AI/ML safety research job board collects various jobs that could be valuable for people interested in AI safety. At the time of writing it listed 128 positions, all of which seemed like good options that it would be valuable to have sensible people fill. I suspect most people looking for AI jobs would find some on here they hadn't heard of otherwise, though of course for any given person many will not be appropriate. They also have job boards for other EA causes. #Careers

They have a very good podcast, readers might be interested in these episodes from this year:

80k also produced what I now regard as my ‘default’ non-text-based EA intro link.

Finances

They spent $3,050,000 in 2020 and $3,032,000 in 2021, and plan to spend around $3,600,000 in 2022. They have around $6,600,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.8 years of runway.

If you wanted to donate you can do so here.

AISS: AI Safety Support

AISS is a globally based independent AI Safety Support organisation founded in 2020 by JJ Hepburn and Linda Linsefors, growing out of the AI Safety Camps which remain a project of their organisation. They aim to provide coaching and support services to people early in the AI safety researcher pipeline.

Finances

They spent $0 in 2020 and $170,000 in 2021, and plan to spend around $650,000 in 2022. They have around $440,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 0.7 years of runway.

If you wanted to donate you could do so here

Other News

Google continues to improve the performance of its ASICs.

EA Cambridge built an AI Safety Fundamentals curriculum to run in the new year that is accepting applications.

Facebook got a lot of (somewhat misleading) negative press over leaked reports that its products made users unhappy; Nir Eyal points out that this sort of forced-openness reduces the incentives for tech companies to try to address such problems: if they had never researched the issue in the first place there would have been nothing to leak.

The EU has a proposal for a big AI regulation, somewhat modelled after GDPR. It focuses on the use of AI in what they perceive to be ‘high risk’ areas, like biometrics, utility infrastructure, and personal vetting. For these use cases the requirements are quite restrictive, demanding technical documentation and human overrides. However, it does not seem to apply to pre-deployment systems, and focuses on AI use cases, rather than the power of the system, to the extent that AGI systems not used in high risk applications are explicitly exempted. So it seems that much of e.g. Deepmind would currently be triply exempted: the UK has left the EU, much of their work is pre-market, and AGI is explicitly not a focus. It does ban subliminal messaging, which seems good – we do not want AIs changing people’s values – but excludes military AIs. In theory the institutions set up by this could provide infrastructure for further future AGI regulation, but in practice often EU regulations are not amended despite clear deficiencies, and political attention may move elsewhere. I’ve been told that ‘it will hurt EU AI companies, slowing down progress and reducing competition’ is not the reason policy EAs like it. See also here.

Organisation Second Preferences

A new strategy I employed this year was to ask each organisation I contacted which organisation other than themselves they would be most excited to receive funding. I figured this could be a good way to take advantage of their domain specific knowledge, including of research directions, strategies and personal quality, in an efficient manner. There is however a potential bias towards well known and socially central organisations.

Not every organisation was willing to name other orgs they preferred funding to go to; I should probably have made explicit that I wouldn’t share this info except in aggregated (and hence largely anonymised) form. You should probably assume that organisations I had better social bonds with would be more likely to share this info.

The clear winner of this was the LTFF; no other organisation came close. Of course it is possible that some of these organisations may have thought that the LTFF might give them grants, but their expected share of such incremental dollars is likely small, and I think most of these reports were honest representations of their views. The LTFF was so far ahead of any other organisation that this seems like a significant data point in their favour. A distant second were non-specific sentiments along the lines of “fund something that seems underfunded”.

Methodological Thoughts

Inside View vs Outside View

This document is written mainly, but not exclusively, using publicly available information (as well as emailing the organisations with a few simple questions). In the tradition of active management, I hope to synthesise many pieces of individually well known facts into a whole which provides new and useful insight to readers. Advantages of this are that 1) it is relatively unbiased, compared to inside information which invariably favours those you are close to socially and 2) most of it is legible and verifiable to readers. The disadvantage is that there are probably many pertinent facts that I am not a party to! Wei Dai has written about how much discussion now takes place in private google documents – for example this Drexler piece apparently; in most cases I do not have access to these. If you want the inside scoop I am not your guy; all I can supply is exterior scooping.

We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it’s hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/political work.

Organisations vs Individuals

Many capital allocators seem to operate under a sort of Great Man theory of investment, whereby the most important thing is to identify a guy to invest in who is really clever and ‘gets it’. I think there is a lot of merit in this (as argued here for example); however, I think I believe in it less than they do. In particular, I worry that this approach leads to over-funding skilled rhetoricians and those the investor/donor is socially connected to. Perhaps as a result of my institutional investment background, I place a lot more weight on historical results. Also, as a practical matter, it is hard for individual donors to fund individual researchers. But as part of a concession to the individual-first view I’ve started asking organisations if anyone significant has joined or left recently, though in practice I think organisations are far more willing to highlight new people joining than old people leaving.

Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is the correct approach for donors who are not tightly connected to the organisations in question. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here. (I actually think that even those with close personal knowledge should use historical results more, to help overcome their biases.)

This judgement involves analysing a large number of papers relating to Xrisk that were produced during 2021. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric; I have tried to indicate cases where this doesn’t apply. I also attempted to include papers during December 2020, to take into account the fact that I'm missing the last month's worth of output from 2021, but I can't be sure I did this successfully.

Politics

My impression is that policy on most subjects, especially those that are more technical than emotional is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers in academia and industry) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant technophobic opposition to GM foods or other kinds of progress. We don't want the 'us-vs-them' situation that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as an imposition and encumbrance to be endured or evaded, will probably be harder to convince of the need to voluntarily be extra-safe - especially as the regulations may actually be totally ineffective.

The only case I can think of where scientists are relatively happy about punitive safety regulations, nuclear power, is one where many of those initially concerned were scientists themselves, and also had the effect of basically ending any progress in nuclear power (at great cost to climate change). Given this, I actually think policy outreach to the general population is probably negative in expectation.

If you’re interested in this, I’d recommend you read this blog post from a few years back.

Openness

I think there is a strong case to be made that openness in AGI capacity development is bad. As such I do not ascribe any positive value to programs to ‘democratize AI’ or similar.

One interesting question is how to evaluate non-public research. For a lot of safety research, openness is clearly the best strategy. But what about safety research that has, or potentially has, capabilities implications, or other infohazards? In this case it seems best if the researchers do not publish it. However, this leaves funders in a tough position – how can we judge researchers if we cannot read their work? Maybe instead of doing top secret valuable research they are just slacking off. If we donate to people who say “trust me, it’s very important and has to be secret” we risk being taken advantage of by charlatans; but if we refuse to fund, we incentivize people to reveal possible infohazards for the sake of money. (Is it even a good idea to publicise that someone else is doing secret research?)

For similar reasons I prefer research to not be behind paywalls or inside expensive books, but this is a significantly less important issue.

More prosaically, organisations should make sure to upload the research they have published to their website! Having gone to all the trouble of doing useful research it is a constant shock to me how many organisations don’t take this simple step to significantly increase the reach of their work. Additionally, several times I have come across incorrect information on organisation’s websites.

Research Flywheel

My basic model for AI safety success is this:

  1. Identify interesting problems
    1. As a byproduct this draws new people into the field through altruism, nerd-sniping, apparent tractability
  2. Solve interesting problems
    1. As a byproduct this draws new people into the field through credibility and prestige
  3. Repeat

One advantage of this model is that it produces both object-level work and field growth.

Over time, hopefully an increasingly large fraction of AI researchers will be safety conscious, such that they voluntarily choose to adopt safer techniques, due to the desires of workers, management and specialist investors. This theory of change does not operate via politicians, governments or voters. It does have some weak spots, e.g. China.

There is also some value in arguing for the importance of the field (e.g. Bostrom’s Superintelligence) or addressing criticisms of the field.

Noticeably absent are strategic pieces. I find that a lot of these pieces do not add terribly much incremental value. Additionally, my suspicion is that strategy research is, to a certain extent, produced exogenously by people who are interested / technically involved in the field. This does not apply to technical strategy pieces, about e.g. whether CIRL or Amplification is a more promising approach.

There is somewhat of a paradox with technical vs ‘wordy’ pieces however: as a non-expert, it is much easier for me to understand and evaluate the latter, even though I think the former are much more valuable.

Differential AI progress

There are many problems that need to be solved before we have safe general AI, one of which is not producing unsafe general AI in the meantime. If nobody was doing non-safety-conscious research there would be little risk or haste to AGI – though we would be missing out on the potential benefits of safe AI.

There are several consequences of this:

  • To the extent that safety research also enhances capabilities, it is less valuable.
  • To the extent that capabilities research re-orientates subsequent research by third parties into more safety-tractable areas it is more valuable.
  • To the extent that safety results would naturally be produced as a by-product of capabilities research (e.g. autonomous vehicles) it is less attractive to finance.

One approach is to research things that will make contemporary ML systems safer, because you think AGI will be a natural outgrowth from contemporary ML. This has the advantage of faster feedback loops, but is also more replaceable (as per the previous section).

Another approach is to try to reason directly about the sorts of issues that will arise with superintelligent AI. This work is less likely to be produced exogenously by unaligned researchers, but it requires much more faith in theoretical arguments, unmoored from empirical verification.

Near-term safety AI issues

Capacity building VS tolerating poor epistemics?

Many people want to connect AI existential risk issues to ‘near-term’ issues; I am generally sceptical of this. For example, autonomous cars seem to risk only localised tragedies (though if they were hacked and all crashed simultaneously that would be much worse), and private companies should have good incentives here. Unemployment concerns seem exaggerated to me, as they have been for most of history (new jobs will be created), at least until we have AGI, at which point we have bigger concerns. Similarly, I generally think concerns about algorithmic bias are essentially political - I recommend this presentation - though there is at least some connection to the value learning problem there.

Some people argue that work on these near AI issues is worthwhile because it can introduce people to the broader risks around poor AI alignment. It could also lead to the creation of AI governance institutions that could then do useful work later. However, it seems somewhat disingenuous, it risks attracting grifters while putting off people who recognise that these are bad concerns. For example, the paper mentioned above rejects the precautionary principle for AI on the basis of rejecting bad arguments about unemployment - had these pseudo-strawman views not been widespread, it would have been harder to reach this unfortunate conclusion.

It’s also the case many of the policies people recommend as a result of these worries are potentially very harmful. A good example is GDPR and similar privacy regulations (including HIPAA) which have made many good things much more difficult - including degrading our ability to track the pandemic.

Some interesting speculation I read is the idea that discussing near AI safety issues might be a sort of ‘greenwashing’ immune response to Xrisk concerns. The ability to respond to long-term AI safety concerns with “yes, we agree AI ethics is very importance, and that’s why we’re working on privacy and decolonising AI” seems like a very rhetorically powerful move.

Financial Reserves

Charities like having financial reserves to provide runway, and guarantee that they will be able to keep the lights on for the immediate future. This could be justified if you thought that charities were expensive to create and destroy, and were worried about this occurring by accident due to the whims of donors. Unlike a company which sells a product, it seems reasonable that charities should be more concerned about this.

Donors prefer charities to not have too much reserves. Firstly, those reserves are cash that could be being spent on outcomes now, by either the specific charity or others. Valuable future activities by charities are supported by future donations; they do not need to be pre-funded. Additionally, having reserves increases the risk of organisations ‘going rogue’, because they are insulated from the need to convince donors of their value.

As such, in general I do not give full credence to charities saying they need more funding because they want much more than 18 months or so of runway in the bank. If you have a year’s reserves now, after this December you will have that plus whatever you raise now, giving you a margin of safety before raising again next year.

I estimated reserves = (cash and grants) / (next year’s budget). In general I think of this as something of a measure of urgency. However despite being prima facie a very simple calculation there are many issues with this data. As such these should be considered suggestive only.

Donation Matching

In general I believe that charity-specific donation matching schemes are somewhat dishonest, despite my having provided matching funding for at least one in the past.

Obviously cause-neutral donation matching is different and should be exploited. Everyone should max out their corporate matching programs if possible, and things like the annual Facebook Match continue to be great opportunities.

Poor Quality Research

Partly thanks to the efforts of the community, the field of AI safety is considerably more well respected and funded than was previously the case, which has attracted a lot of new researchers. While generally good, one side effect of this (perhaps combined with the fact that many low-hanging fruits of the insight tree have been plucked) is that a considerable amount of low-quality work has been produced. For example, there are a lot of papers which can be accurately summarized as asserting “just use ML to learn ethics”. Furthermore, the conventional peer review system seems to be extremely bad at dealing with this issue.

The standard view here is just to ignore low quality work. This has many advantages, for example 1) it requires little effort, 2) it doesn’t annoy people. This conspiracy of silence seems to be the strategy adopted by most scientific fields, except in extreme cases like anti-vaxers.

However, I think there are some downsides to this strategy. A sufficiently large milieu of low-quality work might degrade the reputation of the field, deterring potentially high-quality contributors. While low-quality contributions might help improve Concrete Problems’ citation count, they may use up scarce funding.

Moreover, it is not clear to me that ‘just ignore it’ really generalizes as a community strategy. Perhaps you, enlightened reader, can judge that “How to solve AI Ethics: Just use RNNs” is not great. But is it really efficient to require everyone to independently work this out? Furthermore, I suspect that the idea that we can all just ignore the weak stuff is somewhat an example of typical mind fallacy. Several times I have come across people I respect according respect to work I found clearly pointless. And several times I have come across people I respect arguing persuasively that work I had previously respected was very bad – but I only learnt they believed this by chance! So I think it is quite possible that many people will waste a lot of time as a result of this strategy, especially if they don’t happen to move in the right social circles.

Having said all that, I am not a fan of unilateral action, and am somewhat selfishly conflict-averse, so will largely continue to abide by this non-aggression convention. My only deviation here is to make it explicit. If you’re interested in this you might enjoy this by 80,000 Hours.

The Bay Area

Much of the AI and EA communities, and especially the EA community concerned with AI, is located in the Bay Area, especially Berkeley and San Francisco. This is an extremely expensive place, and is dysfunctional both politically and socially. It seems to attract people who are extremely weird in socially undesirable ways, including numerous cult-like-things (though some are located elsewhere) –  though to be fair the people who are doing useful work in AI organisations seem to be drawn from a better distribution than the broader community. In general I think the centralization is bad, but if there must be centralization I would prefer it be almost anywhere other than Berkeley. Additionally, I think that, like VCs, some funders are geographically myopic, and biased towards funding things in the Bay Area. As such, I have a mild preference towards funding non-Bay-Area projects.

Conclusions

The size of the field continues to grow, both in terms of funding and researchers. Both make it increasingly hard for individual donors. I’ve attempted to subjectively weigh the productivity of the different organisations against the resources they used to generate that output, and donate accordingly.

An increasingly large amount of the best work is being done in places that are inside companies: Deepmind, OpenAI, Anthropic etc. While a good development overall - I am certainly very pleased that Deepmind has such productive a team - it means we can’t really do much here. And many of the not-for-profits are well funded.

My constant wish is to promote a lively intellect and independent decision-making among readers; hopefully my laying out the facts as I see them above will prove helpful to some readers. Here is my eventual decision, rot13'd so you can do come to your own conclusions first (which I strongly recommend):

V jnag gb er-vgrengr gung V unir n ybg bs pbasyvpgf bs vagrerfg, fb guvf fubhyq abg or pbafvqrerq n 'erpbzzraqngvba' be nalguvat bs gur fbeg. Guvf vf whfg jurer V nz qbangvat guvf lrne.

[Ha?]sbeghangryl n ybg bs gur betnavfngvbaf gung V guvax qb gur orfg erfrnepu qb abg frrz yvxr cnegvphyneyl nggenpgvir shaqvat bccbeghavgvrf guvf lrne. Znal ner rvgure sbe-cebsvg pbzcnavrf be nyernql unir fgebat svanapvat sbe gurve pheerag cynaf. Bguref fvzcyl qb abg cebivqr gur yriry bs qvfpybfher erdhverq sbe rinyhngvba.

V qb erznva bcgvzvfgvp nobhg gur YGSS. Rira gubhtu gurve choyvp qvfpybfher unf orra jrnx guvf lrne, jung gurl'ir funerq cevingryl unf orra nqrdhngr, naq V guvax gurve fhccbeg sbe vaqvivqhny fnsrgl erfrnepuref vf rkgerzryl inyhnoyr. V jbhyq yvxr guvf gb or fhssvpvragyl bire-shaqrq gung fhssvpvrayl fxvyyrq crbcyr pna pbasvqragnyyl rffragvnyyl znxr n pnerre bhg YGSS shaqvat. Fb gung'f jurer V'z qbangvat guvf lrne. OREV cebonoyl jbhyq unir orra zl frpbaq cvpx.

Ohg lbh fubhyq pbzr gb lbhe bja pbapyhfvbaf!

However, I wish to emphasize that all the above organisations seem to be doing good work on the most important issue facing mankind. It is the nature of making decisions under scarcity that we must prioritize some over others, and I hope that all organisations will understand that this necessarily involves negative comparisons at times.

Thanks for reading this far; hopefully you found it useful. Apologies to everyone who did valuable work that I excluded!

If you found this post helpful, and especially if it helped inform your donations, please consider letting me and any organisations you donate to as a result know.

Disclosures

I have not in general checked all the proofs in these papers, and similarly trust that researchers have honestly reported the results of their simulations.

I have a large number of conflicts of interest that I cannot individually disclose.

I shared drafts of the individual organisation sections with representatives from FHI, Gov.AI, CHAI, MIRI, GCRI, Redwood Research, BERI, Ought, AI Impacts, GPI, ARC, CSET, Lightcone, CLTR/Alpenglow, CLR, OpenPhil, FTX, LTFF, NonLinear, Rethink Priorities, 80k, CSER (and possibly others I forgot)

My eternal gratitude to my anonymous reviewers for their invaluable reviewing. Any remaining mistakes are of course my own. I would also like to thank my wife and daughter for tolerating all the time I have spent/invested/wasted on this.

Looking for Research Assistant for Next Year

Over time the amount of ground we need to cover here has increased and my time has become more scarce. I have been spending more time just collecting dispersed information and less time being analytical. As such I think it might make sense to take on a Research Assistant for next year; basically looking for someone who is diligent, reliable and interested in AI Xrisk to email organisations, find information on websites and enter into the spreadsheet. In the past CEA briefly expressed interest; possibly we could find funding for this.

Sources

This is a list of all the articles cited with their own individual paragraph.

Reputations for Resolve and Higher-Order Beliefs in Crisis Bargaining - 2021-03-11 - https://journals.sagepub.com/doi/full/10.1177/0022002721995549

AI Vignettes Project - 2021-06-25 - https://aiimpacts.org/ai-vignettes-project/

Aiken, Catherine - Classifying AI Systems - 2021-11-15 - https://cset.georgetown.edu/publication/classifying-ai-systems/

Alex, Neel; Lifland, Eli; Tunstall, Lewis; Thakur, Abhishek; Maham, Pegah; Riedel, C. Jess; Hine, Emmie; Ashurst, Carolyn; Sedille, Paul; Carlier, Alexis; Noetel, Michael; Stuhlmüller, Andreas - RAFT: A Real-World Few-Shot Text Classification Benchmark - 2021-10-28 - https://arxiv.org/abs/2109.14076

Andrus, McKane; Dean, Sarah; Gilbert, Thomas Krendl; Lambert, Nathan; Zick, Tom - AI Development for the Public Interest: From Abstraction Traps to Sociotechnical Risks - 2021-02-04 - https://arxiv.org/abs/2102.04255

Arnold, Zachary; Toner, Helen - AI Accidents: An Emerging Threat - 2021-07-15 - https://cset.georgetown.edu/publication/ai-accidents-an-emerging-threat/

Ashurst, Carolyn; Hine, Emmie; Sedille, Paul; Carlier, Alexis - AI Ethics Statements: Analysis and lessons learnt from NeurIPS Broader Impact Statements - 2021-11-02 - https://arxiv.org/abs/2111.01705

Baker, Jamie - Ethics and Artificial Intelligence: A Policymaker's Introduction - 2021-04-15 - https://cset.georgetown.edu/publication/ethics-and-artificial-intelligence/

Barnes, Beth; Christiano, Paul - Debate update: Obfuscated arguments problem - 2020-12-22 - https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem#comments

Baum, Seth; de Neufville, Robert; Barrett, Tony; Fitzgerald, McKenna - GCRI Statement on the January 6 US Capitol Insurrection - 2021-01-15 - https://gcrinstitute.org/gcri-statement-on-the-january-6-us-capitol-insurrection/

Baum, Seth; Owe, Andrea - Artificial Intelligence Needs Environmental Ethics - 2021-11-14 - https://gcrinstitute.org/artificial-intelligence-needs-environmental-ethics/

Beckstead, Nick; Thomas, Teruji - A paradox for tiny possibilities and enormous values - 2021-07-15 - https://globalprioritiesinstitute.org/nick-beckstead-and-teruji-thomas-a-paradox-for-tiny-probabilities-and-enormous-values/

Brown, Daniel S.; Schneider, Jordan; Dragan, Anca D.; Niekum, Scott - Value Alignment Verification - 2020-12-02 - https://arxiv.org/abs/2012.01557

Buchanan, Ben; Lohn, Andrew; Musser, Micah; Sedova, Katerina - How Language Models Could Change Disinformation - 2021-05-15 - https://cset.georgetown.edu/publication/truth-lies-and-automation/

Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris - Curve Circuits - 2021-01-30 - https://distill.pub/2020/circuits/curve-circuits/

Cave, Stephen; Whittlestone, Jess; Nyrup, Rune; Ó hÉigeartaigh, Seán; Calvo, Rafael - Using AI ethically to tackle covid-19 - 2021-03-16 - https://www.bmj.com/content/372/bmj.n364

Cebul, Matthew; Dafoe, Allan; Monteiro - Coercion and the Credibility of Assurances - 2021-07-15 - https://drive.google.com/file/d/1q-vRP19IznfPldBcaO6NglnSkyL7wYaL/view

Chatila, Raja; Dignum, Virginia; Fisher, Michael; Giannotti, Fosca; Morik, Katharina; Russell, Stuart; Yeung, Karen - Trustworthy AI - 2021-02-06 - https://smile.amazon.com/gp/product/B08W3XZ1TJ/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1

Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas; Brockman, Greg; Ray, Alex; Puri, Raul; Krueger, Gretchen; Petrov, Michael; Khlaaf, Heidy; Sastry, Girish; Mishkin, Pamela; Chan, Brooke; Gray, Scott; Ryder, Nick; Pavlov, Mikhail; Power, Alethea; Kaiser, Lukasz; Bavarian, Mohammad; Winter, Clemens; Tillet, Philippe; Such, Felipe Petroski; Cummings, Dave; Plappert, Matthias; Chantzis, Fotios; Barnes, Elizabeth; Herbert-Voss, Ariel; Guss, William Hebgen; Nichol, Alex; Paino, Alex; Tezak, Nikolas; Tang, Jie; Babuschkin, Igor; Balaji, Suchir; Jain, Shantanu; Saunders, William; Hesse, Christopher; Carr, Andrew N.; Leike, Jan; Achiam, Josh; Misra, Vedant; Morikawa, Evan; Radford, Alec; Knight, Matthew; Brundage, Miles; Murati, Mira; Mayer, Katie; Welinder, Peter; McGrew, Bob; Amodei, Dario; McCandlish, Sam; Sutskever, Ilya; Zaremba, Wojciech - Evaluating Large Language Models Trained on Code - 2021-07-07 - https://arxiv.org/abs/2107.03374

Christiano, Paul - A naive alignment strategy and optimism about generalization - 2021-06-09 - https://www.alignmentforum.org/posts/QvtHSsZLFCAHmzes7/a-naive-alignment-strategy-and-optimism-about-generalization

Christiano, Paul - Another (outer) alignment failure story - 2021-04-07 - https://www.alignmentforum.org/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story

Christiano, Paul - Teaching ML to answer questions honestly instead of predicting human answers - 2021-05-28 - https://ai-alignment.com/a-problem-and-three-ideas-800b42a14f66

Cihon, Peter; Kleinaltenkamp, Moritz; Schuett, Jonas; Baun, Seth - AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries - 2021-06-02 - https://gcrinstitute.org/ai-certification-advancing-ethical-practice-by-reducing-information-asymmetries/

Cihon, Peter; Schuett, Jonas; Baun, Seth - Corporate Governance of Artificial Intelligence in the Public Interest - 2021-07-05 - https://www.mdpi.com/2078-2489/12/7/275

Clifton, Jesse - Collaborative game specification: arriving at common models in bargaining - 2021-03-06 - https://longtermrisk.org/collaborative-game-specification/

Clifton, Jesse - Weak identifiability and its consequences in strategic settings - 2021-02-15 - https://longtermrisk.org/weak-identifiability-and-its-consequences-in-strategic-settings/

Cohen, Michael; Hutter, Marcus; Nanda, Neel - Fully General Online Imitation Learning - 2021-02-17 - https://arxiv.org/abs/2102.08686

Cotra, Ajeya - The case for aligning narrowly superhuman models - 2021-05-03 - https://www.alignmentforum.org/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models#Isn_t_this_not_neglected_because_lots_of_people_want_useful_AI_

Crawford, Ali; Wulkan, Ido - Federal Prize Competitions - 2021-11-15 - https://cset.georgetown.edu/publication/federal-prize-competitions/

Critch, Andrew - What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) - 2021-03-31 - https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic

Dafoe, Allan; Hatz, Sophia; Zhang, Baobao - Coercion and Provocation - 2019-11-14 - https://ora.ox.ac.uk/objects/uuid:fc9c9bd4-1cd1-45c4-9e3e-4cd9826171e4

Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo, Joel Z.; Larson, Kate; Graepel, Thore - Open Problems in Cooperative AI - 2020-12-15 - https://arxiv.org/abs/2012.08630

Dafoe, Allan; Zwetsloot, Remco; Cebul, Matthew  - Reputations for Resolve and Higher-Order Beliefs in Crisis Bargaining - 2021-03-11 - https://journals.sagepub.com/doi/10.1177/0022002721995549

Daniels, Matthew; Murphy, Ben - National Power After AI - 2021-07-15 - https://cset.georgetown.edu/publication/national-power-after-ai/

Davidson, Tom - Could Advanced AI Drive Explosive Economic Growth? - 2021-06-25 - https://www.openphilanthropy.org/could-advanced-ai-drive-explosive-economic-growth

Davis, Zach - Feature Selection - 2021-10-31 - https://www.lesswrong.com/posts/dYspinGtiba5oDCcv/feature-selection

de Neufville, Robert; Baum, Seth - Collective Action on Artificial Intelligence: A Primer and Review - 2021-07-15 - https://gcrinstitute.org/collective-action-on-artificial-intelligence-a-primer-and-review/

Ding, Jeffrey - China's Growing Influence over the Rules of the Digital Road - 2021-04-15 - https://sci-hubtw.hkvisa.net/10.1353/asp.2021.0015

Ding, Jeffrey; Dafoe, Allan - Engines of Power: Electricity, AI, and General-Purpose Military Transformations - 2021-06-08 - https://arxiv.org/abs/2106.04338

Drexler, Eric - QNRs: Toward Language for Intelligent Machines - 2021-08-27 - https://www.fhi.ox.ac.uk/qnrs/

Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William - Truthful AI: Developing and governing AI that does not lie - 2021-10-13 - https://arxiv.org/abs/2110.06674

Everitt, Tom; Carey, Ryan; Langlois, Eric; Ortega, Pedro A; Legg, Shane - Agent Incentives: A Causal Perspective - 2021-02-02 - https://arxiv.org/abs/2102.01685

Fedasiuk, Ryan; Melot, Jennifer; Murphy, Ben - Harnessed Lightning - 2021-10-15 - https://cset.georgetown.edu/publication/harnessed-lightning/

Fernandez, Ronny - How energy efficient are human-engineered flight designs relative to natural ones? - 2020-12-10 - https://aiimpacts.org/are-human-engineered-flight-designs-better-or-worse-than-natural-ones/

Filan, Daniel - AXRP - the AI X-risk Research Podcast - 2020-12-23 - https://axrp.net/

Filan, Daniel; Casper, Stephen; Hod, Shlomi; Wild, Cody; Critch, Andrew; Russell, Stuart - Clusterability in Neural Networks - 2021-03-04 - https://arxiv.org/abs/2103.03386

Finnveden, Lukas - Extrapolating GPT-N performance - 2020-12-18 - https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance#comments

Fischer, Sophie-Charlotte; Leung, Jade; Anderljung, Markus; O’Keefe, Cullen; Torges, Stefan; Khan, Saif M.; Garfinkel, Ben; Dafoe, Allan - AI Policy Levers: A Review of the U.S. Government’s Tools to Shape AI Research, Development, and Deployment - 2021-03-15 - https://www.governance.ai/research-paper/ai-policy-levers-a-review-of-the-u-s-governments-tools-to-shape-ai-research-development-and-deployment

Fitzgerald, McKenna; Boddy, Aaron; Baum, Seth - 2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy - 2020-12-31 - https://gcrinstitute.org/2020-survey-of-artificial-general-intelligence-projects-for-ethics-risk-and-policy/

GAA - Nuclear Espionage and AI Governance - 2021-10-04 - https://forum.effectivealtruism.org/posts/CKfHDw5Lmoo6jahZD/nuclear-espionage-and-ai-governance-1

Gabriel, Iason - Towards a Theory of Justice for Artificial Intelligence - 2021-10-27 - https://arxiv.org/abs/2110.14419

Galaz, Victor; Centeno, Miguel; Callahan, Peter; Causevic, Amar; Patterson, Thayer; Brass, Irina; Baum, Seth; Farber, Darry; Fischer, Joern; Garcia, David; McPhearson, Timon; Jimenex, Daniel; King, Brian; Larcey, Paul; Levy, Karen - Artificial Intelligence, Systemic Risks, and Sustainability - 2021-10-07 - https://www.sciencedirect.com/science/article/pii/S0160791X21002165?via%3Dihub

Garfinkel, Ben - A Tour of Emerging Cryptographic Technologies - 2021-05-15 - https://www.governance.ai/research-paper/a-tour-of-emerging-cryptographic-technologies

Garrabrant, Scott - Temporal Inference with Finite Factored Sets - 2021-10-23 - https://arxiv.org/abs/2109.11513

Gates, Vael; Callaway, Frederick; Ho, Mark; Griffiths, Thomas - A rational model of people’s inferences about others’ preferences based on  response times - 2021-03-15 - https://psyarxiv.com/25zfx/

Grace, Katja - Argument for AI x-risk from large impacts - 2021-09-29 - https://aiimpacts.org/argument-from-large-impacts/

Grace, Katja - Beyond fire alarms: freeing the groupstruck - 2021-09-26 - https://aiimpacts.org/beyond-fire-alarms-freeing-the-groupstruck/

Grace, Katja - Coherence arguments imply a force for goal-directed behavior - 2021-03-25 - https://aiimpacts.org/coherence-arguments-imply-a-force-for-goal-directed-behavior/

Greaves, Hilary; MacAskill, William - The case for strong longtermism - 2021-06-15 - https://globalprioritiesinstitute.org/hilary-greaves-william-macaskill-the-case-for-strong-longtermism-2/

Guterres, António - Our Common Agenda - 2021-09-10 - https://www.un.org/en/un75/common-agenda

Hammond, Lewis; Fox, James; Everitt, Tom; Abate, Alessandro; Wooldridge, Michael - Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice - 2021-02-09 - https://arxiv.org/abs/2102.05008

Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob - Unsolved Problems in ML Safety - 2021-09-28 - https://arxiv.org/abs/2109.13916

Hendrycks, Dan; Mazeika, Mantas; Zou, Andy; Patel, Sahil; Zhu, Christine; Navarro, Jesus; Song, Dawn; Li, Bo; Steinhardt, Jacob - What Would Jiminy Cricket Do? Towards Agents That Behave Morally - 2021-10-25 - https://arxiv.org/abs/2110.13136

Hod, Shlomi; Casper, Stephen; Filan, Daniel; Wild, Cody; Critch, Andrew; Russell, Stuart - Detecting Modularity in Deep Neural Networks - 2021-10-13 - https://arxiv.org/abs/2110.08058

Hua, Shin-Shin; Belfield, Haydn - AI & Antitrust: Reconciling Tensions Between Competition Law and Cooperative AI Development - 2021-11-15 - https://yjolt.org/ai-antitrust-reconciling-tensions-between-competition-law-and-cooperative-ai-development

Imbrie, Andrew; Gelles, Rebecca; Dunham, James; Aiken, Catherine - Evaluating Rhetorical Dynamics in AI - 2021-05-15 - https://cset.georgetown.edu/publication/contending-frames/

Jiang, Liwei; Hwang, Jena D.; Bhagavatula, Chandra; Bras, Ronan Le; Forbes, Maxwell; Borchardt, Jon; Liang, Jenny; Etzioni, Oren; Sap, Maarten; Choi, Yejin - Delphi: Towards Machine Ethics and Norms - 2021-10-14 - https://arxiv.org/abs/2110.07574

Karnofsky, Holden - All Possible Views About Humanity's Future Are Wild - 2021-07-13 - https://forum.effectivealtruism.org/s/isENJuPdB3fhjWYHd/p/TwQzyP3QgttmuTHym

Klinova, Katya; Korinek, Anton - AI and Shared Prosperity - 2021-05-18 - https://arxiv.org/abs/2105.08475

Koch, Jack; Langosco, Lauro; Pfau, Jacob; Le, James; Sharkey, Lee - Objective Robustness in Deep Reinforcement Learning - 2021-05-28 - https://arxiv.org/abs/2105.14111

Kokotajlo, Daniel - Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain - 2021-01-18 - https://www.alignmentforum.org/posts/HhWhaSzQr6xmBki8F/birds-brains-planes-and-ai-against-appeals-to-the-complexity#comments

Korinek, Anton; Stiglitz, Joseph - Artificial Intelligence, Globalization, and Strategies for Economic Development - 2021-02-04 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3812820

Laidlaw, Cassidy; Russell, Stuart - Uncertain Decisions Facilitate Better Preference Learning - 2021-01-15 - https://proceedings.neurips.cc/paper/2021/hash/7f141cf8e7136ce8701dc6636c2a6fe4-Abstract.html

Lee, Kimin; Smith, Laura; Abbeel, Pieter - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training - 2021-06-09 - https://arxiv.org/abs/2106.05091

lifelonglearner; Hase, Peter - Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers - 2021-04-09 - https://www.alignmentforum.org/posts/GEPX7jgLMB8vR2qaK/opinions-on-interpretable-machine-learning-and-70-summaries

Lin, Stephanie; Hilton, Jacob; Evans, Owain - TruthfulQA: Measuring How Models Mimic Human Falsehoods - 2021-10-08 - https://arxiv.org/abs/2109.07958

Lindner, David; Shah, Rohin; Abbeel, Pieter; Dragan, Anca - Learning What To Do by Simulating the Past - 2021-04-08 - https://arxiv.org/abs/2104.03946

Liu, Hin-Yan; Maas, Matthijs - 'Solving for X?' Towards a problem-finding framework to ground long-term governance strategies for artificial intelligence - 2021-02-00 - https://www.researchgate.net/publication/342774816_%27Solving_for_X%27_Towards_a_problem-finding_framework_to_ground_long-term_governance_strategies_for_artificial_intelligence

Maas, Matthijs - AI, Governance Displacement, and the (De)Fragmentation of International Law - 2021-03-22 - https://www.cser.ac.uk/resources/ai-governance-displacement-and-defragmentation-international-law/

Maas, Matthijs - Aligning AI Regulation to Sociotechnical Change - 2021-06-23 - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3871635

Maas, Matthijs; Stix, Charlotte - Bridging the gap: the case for an ‘Incompletely Theorized Agreement’ on AI policy - 2021-01-18 - https://www.cser.ac.uk/resources/bridging-gap-case-incompletely-theorized-agreement-ai-policy/

Maclure, Jocelyn; Russell, Stuart - AI for Humanity: The Global Challenges - 2021-02-06 - https://smile.amazon.com/gp/product/B08W3XZ1TJ/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1

Manheim, David; Sandberg, Anders - What is the Upper Limit of Value? - 2021-01-27 - https://philarchive.org/rec/MANWIT-6

Mittelsteadt, Matthew - Mechanisms to Ensure AI Arms Control Compliance - 2021-02-15 - https://cset.georgetown.edu/publication/ai-verification/

Mogensen, Andreas - Do not go gentle: why the Asymmetry does not support anti-natalism - 2021-05-15 - https://globalprioritiesinstitute.org/do-not-go-gentle-why-the-asymmetry-does-not-support-anti-natalism-andreas-mogensen-global-priorities-institute-oxford-university/

Murphy, Ben - Translation: Ethical Norms for New Generation Artificial Intelligence Released - 2021-10-21 - https://cset.georgetown.edu/publication/ethical-norms-for-new-generation-artificial-intelligence-released/

Murphy, Ben - Translation: White Paper on Trustworthy Artificial Intelligence - 2021-09-14 - https://cset.georgetown.edu/publication/white-paper-on-trustworthy-artificial-intelligence/

Niconiconi - Whole Brain Emulation: No Progress on C. elgans After 10 Years - 2021-10-01 - https://www.lesswrong.com/posts/mHqQxwKuzZS69CXX5/whole-brain-emulation-no-progress-on-c-elgans-after-10-years

Oesterheld, Caspar; Conitzer, Vincent - Safe Pareto Improvements for Delegated Game Playing - 2021-05-03 - https://users.cs.duke.edu/~conitzer/safeAAMAS21.pdf

Ord, Toby; Mercer, Angus; Dannreuther, Sophie; Nelson, Cassidy; Lewis, Gregory; Millett, Piers; Whittlestone, Jess; Leung, Jade; Anderljung, Markus; Hilton, Sam; Belfield, Haydn - Future Proof: The Opportunity to Transform the UK's Resilience to Extreme Risks - 2021-06-15 - https://www.governance.ai/research-paper/futureproof-artificial-intelligence-chapter

Owe, Andrea; Baum, Seth - Moral Consideration of Nonhumans in the Ethics of Artificial Intelligence - 2021-06-07 - https://gcrinstitute.org/moral-consideration-of-nonhumans-in-the-ethics-of-artificial-intelligence/

Owe, Andrea; Baum, Seth - The Ethics of Sustainability for Artificial Intelligence - 2021-11-17 - https://gcrinstitute.org/the-ethics-of-sustainability-for-artificial-intelligence/

Prunkl, Carina; Ashurst, Carolyn; Anderljung, Markus; Webb, Helena; Leike, Jan; Dafoe, Allan - Institutionalizing ethics in AI through broader impact requirements - 2021-02-17 - http://www.cs.jhu.edu/~misha/DIReadingSeminar/Papers/Prunkl21.pdf

Roman, Charlotte; Dennis, Michael; Critch, Andrew; Russell, Stuart - Accumulating Risk Capital Through Investing in Cooperation - 2021-01-25 - https://arxiv.org/abs/2101.10305

Rudner, Tim; Toner, Helen - Key Concepts in AI Safety: An Overview - 2021-03-15 - https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-an-overview/

Rudner, Tim; Toner, Helen - Key Concepts in AI Safety: Interpretability in Machine Learning - 2021-03-15 - https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-interpretability-in-machine-learning/

Rudner, Tim; Toner, Helen - Key Concepts in AI Safety: Robustness and Adversarial Examples - 2021-03-15 - https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-robustness-and-adversarial-examples/

Shah, Rohin; Wild, Cody; Wang, Steven H.; Alex, Neel; Houghton, Brandon; Guss, William; Mohanty, Sharada; Kanervisto, Anssi; Milani, Stephanie; Topin, Nicholay; Abbeel, Pieter; Russell, Stuart; Dragan, Anca - The MineRL BASALT Competition on Learning from Human Feedback - 2021-07-05 - https://arxiv.org/abs/2107.01969

Shlegeris, Buck - Redwood Research’s current project - 2021-09-21 - https://www.alignmentforum.org/posts/k7oxdbNaGATZbtEg3/redwood-research-s-current-project

Shlegeris, Buck - The alignment problem in different capability regimes - 2021-09-21 - https://www.alignmentforum.org/posts/HHunb8FPnhWaDAQci/the-alignment-problem-in-different-capability-regimes

Soares, Nate - Visible Thoughts Project and Bounty Announcement - 2021-11-29 - https://www.alignmentforum.org/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement

Stastny, Julian; Treutlein, Johannes; Riché, Maxime; Clifton, Jesse - Multi-agent learning in mixed-motive coordination problems - 2021-03-15 - https://longtermrisk.org/files/stastny_et_al_implicit_bargaining.pdf

Stooke, Adam; Mahajan, Anuj; Barros, Catarina; Deck, Charlie; Bauer, Jakob; Sygnowski, Jakub; Trebacz, Maja; Jaderberg, Max; Mathieu, Michael; McAleese, Nat; Bradley-Schmieg, Nathalie; Wong, Nathaniel; Porcel, Nicolas; Raileanu, Roberta; Hughes-Fitt, Steph; Czarnecki, Valentin Dalibard and Wojciech Marian - Open-Ended Learning Leads to Generally Capable Agents - 2021-07-27 - https://deepmind.com/research/publications/2021/open-ended-learning-leads-to-generally-capable-agents

Thomas, Teruji - Simulation Expectation - 2021-09-15 - https://globalprioritiesinstitute.org/simulation-expectation-teruji-thomas-global-priorities-institute-university-of-oxford/

Thorstad, David - The scope of longtermism - 2021-06-15 - https://globalprioritiesinstitute.org/the-scope-of-longtermism-david-thorstad-global-priorities-institute-university-of-oxford/

Trammell, Philip; Korinek, Anton - Economic Growth Under Transformative AI: A Guide to the Vast Range of Possibilities for Output Growth, Wages, and the Laborshare - 2020-02-04 - https://www.governance.ai/research-paper/economic-growth-under-transformative-ai-a-guide-to-the-vast-range-of-possibilities-for-output-growth-wages-and-the-laborshare

Turner, Alex - Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability - 2021-11-17 - https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via

Welbl, Johannes; Glaese, Amelia; Uesato, Jonathan; Dathathri, Sumanth; Mellor, John; Hendricks, Lisa Anne; Anderson, Kirsty; Kohli, Pushmeet; Coppin, Ben; Huang, Po-Sen - Challenges in Detoxifying Language Models - 2021-09-15 - https://arxiv.org/abs/2109.07445

Wentworth, John - How To Get Into Independent Research On Alignment/Agency - 2021-11-18 - https://www.lesswrong.com/posts/P3Yt66Wh5g7SbkKuT/how-to-get-into-independent-research-on-alignment-agency#Meta

Wentworth, John - Utility Maximization = Description Length Minimization - 2021-02-18 - https://www.alignmentforum.org/posts/voLHQgNncnjjgAPH7/utility-maximization-description-length-minimization

Whittlestone, Jess; Clark, Jack - Why and How Governments Should Monitor AI Development - 2021-08-31 - https://www.cser.ac.uk/resources/why-and-how-governments-should-monitor-ai-development/

Woods, Kat - The Nonlinear Library - 2021-10-19 - https://forum.effectivealtruism.org/posts/JTZTBienqWEAjGDRv/listen-to-more-ea-content-with-the-nonlinear-library

Yudkowsky, Eliezer - Discussion with Eliezer Yudkowsky on AGI interventions - 2021-11-10 - https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions

Yudkowsky, Eliezer - Yudkowsky and Christiano discuss "Takeoff Speeds" - 2021-11-22 - https://forum.effectivealtruism.org/posts/rho5vtxSaEdXxLu3o/yudkowsky-and-christiano-discuss-takeoff-speeds

Zaidi, Waqar; Dafoe, Allan - International Control of Powerful Technology: Lessons from the Baruch Plan for Nuclear Weapons - 2021-03-15 - https://www.governance.ai/research-paper/international-control-of-powerful-technology-lessons-from-the-baruch-plan-for-nuclear-weapons

Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan - Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers - 2021-08-15 - https://jair.org/index.php/jair/article/view/12895/26701

Zhang, Tianjun; Rashidinejad, Paria; Jiao, Jiantao; Tian, Yuandong; Gonzalez, Joseph E.; Russell, Stuart - MADE: Exploration via Maximizing Deviation from Explored Regions - 2021-01-15 - https://proceedings.neurips.cc/paper/2021/hash/5011bf6d8a37692913fce3a15a51f070-Abstract.html

Zhuang, Simon; Hadfield-Menell, Dylan - Consequences of Misaligned AI - 2021-02-07 - https://arxiv.org/abs/2102.03896



 

49

15 comments, sorted by Click to highlight new comments since: Today at 4:00 PM
New Comment

Ought co-founder here. Seems worth clarifying how Elicit relates to alignment (cross-posted from EA forum):

1 - Elicit informs how to train powerful AI through decomposition

Roughly speaking, there are two ways of training AI systems:

  1. End-to-end training
  2. Decomposition of tasks into human-understandable subtasks

We think decomposition may be a safer way to train powerful AI if it can scale as well as end-to-end training.

Elicit is our bet on the compositional approach. We’re testing how feasible it is to decompose large tasks like “figure out the answer to this science question by reading the literature” by breaking them into subtasks like:

  • Brainstorm subquestions that inform the overall question
  • Find the most relevant papers for a (sub-)question
  • Answer a (sub-)question given an abstract for a paper
  • Summarize answers into a single answer

Over time, more of this decomposition will be done by AI assistants.

At each point in time, we want to push the compositional approach to the limits of current language models, and keep up with (or exceed) what’s possible through end-to-end training. This requires that we overcome engineering barriers in gathering human feedback and orchestrating calls to models in a way that doesn’t depend much on current architectures.

I view this as the natural continuation of our past work where we studied decomposition using human participants. Unlike then, it’s now possible to do this work using language models, and the more applied setting has helped us a lot in reducing the gap between research assumptions and deployment.

2 - Elicit makes AI differentially useful for AI & tech policy, and other high-impact applications

In a world where AI capabilities scale rapidly, I think it’s important that these capabilities can support research aimed at guiding AI development and policy, and more generally help us figure out what’s true and make good plans as much as they help persuade and optimize goals with fast feedback or easy specification.

Ajeya mentions this point in The case for aligning narrowly superhuman models:

"Better AI situation in the run-up to superintelligence: If at each stage of ML capabilities progress we have made sure to realize models’ full potential to be helpful to us in fuzzy domains, we will be going into the next stage with maximally-capable assistants to help us navigate a potentially increasingly crazy world. We’ll be more likely to get trustworthy forecasts, policy advice, research assistance, and so on from our AI assistants. Medium-term AI challenges like supercharged fake news / clickbait or AI embezzlement seem like they would be less severe. People who are pursuing more easily-measurable goals like clicks or money seem like they would have less of an advantage over people pursuing hard-to-measure goals like scientific research (including AI alignment research itself). All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition."

Beth mentions the more general point in Risks from AI persuasion under possible interventions: 

“Instead, try to advance applications of AI that help people understand the world, and advance the development of truthful and genuinely trustworthy AI. For example, support API customers like Ought who are working on products with these goals, and support projects inside OpenAI to improve model truthfulness.”

Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence is likely to be meaningful. This kind of comparison seems preferable to general arguments like that the Bay Area is expensive + has bad epistemics. 

(In terms of general arguments, I'd also mention that the Bay Area has the best track record in the world by a huge margin for producing technology companies and is among the top 5 regions in the world for cutting-edge scientific research.) 

ETA: I tried to clarify my thoughts in the reply to Larks.  

Is your argument about personnel overlap that one could do some sort of mixed effect regression, with location as the primary independent variable and controls for individual productivity? If so I'm so somewhat skeptical about the tractability: the sample size is not that big, the data seems messy, and I'm not sure it would capture necessarily the fundamental thing we care about. I'd be interested in the results if you wanted to give it a go though!

More importantly, I'm not sure this analysis would be that useful. Geography-based-priors only really seem useful for factors we can't directly observe; for an organization like CHAI our direct observations will almost entirely screen off this prior. The prior is only really important for factors where direct measurement is difficult, and hence we can't update away from the prior, but for those we can't do the regression. (Though I guess we could do the regression on known firms/researchers and extrapolate to new unknown orgs/individuals).

The way this plays out here is we've already spent the vast majority of the article examining the research productivity of the organizations; geography based priors only matter insomuchas you think they can proxy for something else that is not captured in this.

As befits this being a somewhat secondary factor, it's worth noting that I think (though I haven't explicitly checked) in the past I have supported bay area organisations more than non-bay-area ones.   

I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic). 

I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?

We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can't quite interpret individual neurons, but we've found some examples of where we can interpret what an individual attention head is doing.

I would be happy to see you write a top-level post about this paper. :)

Thanks! I'm probably not going to have time to write a top-level post myself, but I liked Evan Hubinger's post about it.

(Moderation note: added to the Alignment Forum from LessWrong.)

This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand. 

Thanks, that's very kind of you!

I noticed that you didn't mention infra-Bayesianism, not in 2020 and not this year. Any particular reason?

  • I prioritized posts by named organizations.
    • Diffractor does not list any institutional affiliations on his user page.
    • No institution I noticed listed the post/sequence on their 'research' page.
    • No institution I contacted mentioned the post/sequence.
  • No post in the sequence was that high in the list of 2021 Alignment Forum posts, sorted by karma.
  • Several other filtering methods also did not identify the post

However upon reflection it does seem to be MIRI-affiliated so perhaps should have been affiliated; if I have time I may review and edit it in later.

Notice that in MIRI's summary of 2020 they wrote "From our perspective, our most interesting public work this year is Scott Garrabrant’s Cartesian frames model and Vanessa Kosoy’s work on infra-Bayesianism."

Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)

The link should include "that does not lie". 
length --> lenghty

 

Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. Of course, these models are trying to copy human responses, not be honest, so because many of the questions allude to common misconceptions, the more advanced models 'lie' more often. Interestingly they also used GPT-3 to evaluate the truth of these answers. See also the discussion here. Researchers from OpenPhil were also named authors on the paper. #Other

"OpenPhil" --> OpenAI
As a minor clarification, all the results in the paper are based on human evaluation of truth. But we show that GPT-3 can be used as a fairly reliably substitute for human evaluation under certain conditions. 

Thanks, fixed in both copies.