Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


More realistic tales of doom (Paul Christiano): This Vox article does a nice job of explaining the first part of this post, though I disagree with its characterization of the second part.

The typical example of AI catastrophe has a powerful and adversarial AI system surprising us with a treacherous turn allowing it to quickly take over the world (think of the paperclip maximizer). This post uses a premise of continuous AI development and broad AI deployment and depicts two other stories of AI catastrophe that Paul finds more realistic.

The first story is rooted in the fact that AI systems have a huge comparative advantage at optimizing for easily measured goals. We already see problems with humans optimizing for the easily measured goals (scientific malpractice, outrage-inducing social media, etc.) and with AI these problems will be severely exacerbated. So far, we have been able to use human reasoning to ameliorate these problems, by changing incentives, enacting laws, or using common sense to interpret goals correctly. We will initially be able to use human reasoning to create good proxies, but over time as AI systems become more capable our ability to do this will lag further and further behind. We end up "going out with a whimper": ultimately our values are no longer shaping society's trajectory.

The second story starts out like the first story, but adds in a new complication: the AI system could develop internal goals of its own. AI performs a huge search over policies for ones that score well on the training objective. Unfortunately, a policy that optimizes for the goal of "having influence" will initially score well on most training objectives: when you don't already have influence, a good strategy for gaining influence is to do what your overseers want you to do. (Here "influence" doesn't mean just social influence; control over nukes also counts as influence.) At some point the system will be powerful enough that gaining influence no longer means doing what the overseers want. We will probably know about this dynamic through some catastrophic AI failures (e.g. an AI-run corporation stealing the money it manages), but may not be able to do anything about it because we would be extremely reliant on AI systems. Eventually, during some period of heightened vulnerability, one AI system may do something catastrophic, leading to a distribution shift which triggers a cascade of other AI systems (and human systems) failing, leading to an unrecoverable catastrophe (think something in the class of a hostile robot takeover). Note that "failure" here means an AI system "intentionally" doing something that we don't want, as opposed to the AI system not knowing what to do because it is not robust to distributional shift.

Rohin's opinion: Note that Paul thinks these scenarios are more realistic because he expects that many of the other problems (e.g. wireheading, giving AI systems an objective such that it doesn't kill humans) will be solved by default. I somewhat expect even the first story to be solved by default -- it seems to rest on a premise of human reasoning staying as powerful as it is right now, but it seems plausible that as AI systems grow in capability we will be able to leverage them to improve human reasoning (think of how paper or the Internet amplified human reasoning). The second story seems much more difficult -- I don't see any clear way that we can avoid influence-seeking behavior. It is currently my most likely scenario for an AI catastrophe that was a result of a failure of technical AI safety (or more specifically, intent alignment (AN #33)).

Read more: AI disaster won’t look like the Terminator. It’ll be creepier.

80K podcast: How can policy keep up with AI advances? (Rob Wiblin, Jack Clark, Miles Brundage and Amanda Askell): OpenAI policy researchers Jack Clark, Amanda Askell and Miles Brundage cover a large variety of topics relevant to AI policy, giving an outside-view perspective on the field as a whole. A year or two ago, the consensus was that the field required disentanglement research; now, while disentanglement research is still needed, there are more clearly defined important questions that can be tackled independently. People are now also taking action in addition to doing research, mainly by accurately conveying relevant concepts to policymakers. A common thread across policy is the framing of the problem as a large coordination problem, for which an important ingredient of the solution is to build trust between actors.

Another thread was the high uncertainty over specific details of scenarios in the future, but the emergence of some structural properties that allow us to make progress anyway. This implies that the goal of AI policy should be aiming for robustness rather than optimality. Some examples:

  • The malicious use of AI report was broad and high level because each individual example is different and the correct solution depends on the details; a general rule will not work. In fact, Miles thinks that they probably overemphasized how much they could learn from other fields in that report, since the different context means that you quickly hit diminishing returns on what you can learn.
  • None of them were willing to predict specific capabilities over more than a 3-year period, especially due to the steep growth rate of compute, which means that things will change rapidly. Nonetheless, there are structural properties that we can be confident will be important: for example, a trained AI system will be easy to scale via copying (which you can't do with humans).
  • OpenAI's strategy is to unify the fields of capabilities, safety and policy, since ultimately these are all facets of the overarching goal of developing beneficial AI. They aim to either be the main actor developing beneficial AGI, or to help the main actor, in order to be robust to many different scenarios.
  • Due to uncertainty, OpenAI tries to have policy institutions that make sense over many different time horizons. They are building towards a world with formal processes for coordinating between different AI labs, but use informal relationships and networking for now.

AI policy is often considered a field where it is easy to cause harm. They identify two (of many) ways this could happen: first, you could cause other actors to start racing (which you may not even realize, if it manifests as a substantial increase in some classified budget), and second, you could build coordination mechanisms that aren't the ones people want and that work fine for small problems but break once they are put under a lot of stress. Another common one people think about is information hazards. While they consider info hazards all the time, they also think that (within the AI safety community) these worries are overblown. Typically people overestimate how important or controversial their opinion is. Another common reason for not publishing is not being sure whether the work meets high intellectual standards, but in this case the conversation will be dominated by people with lower standards.

Miscellaneous other stuff:

  • Many aspects of races can make them much more collaborative, and it is not clear that AI corresponds to an adversarial race. In particular, large shared benefits make races much more collaborative.
  • Another common framing is to treat the military as an adversary, and try to prevent them from gaining access to AI. Jack thinks this is mistaken, since then the military will probably end up developing AI systems anyway, and you wouldn't have been able to help them make it safe.
  • There's also a lot of content at the end about career trajectories and working at OpenAI or the US government, which I won't get into here.

Rohin's opinion: It does seem like building trust between actors is a pretty key part of AI policy. That said, there are two kinds of trust that you can have: first, trust that the statements made by other actors are true, and second, trust that other actors are aligned enough with you in their goals that their success is also your success. The former can be improved by mechanisms lie monitoring, software verification, etc. while the latter cannot. The former is often maintained using processes that impose a lot of overhead, while the latter usually does not require much overhead once established. The former can scale to large groups comprising thousands or millions of people, while the latter is much harder to scale. I think it's an open question in AI policy to what extent we need each of these kinds of trust to exist between actors. This podcast seems to focus particularly on the latter kind.

Other miscellaneous thoughts:

  • I think a lot of these views are conditioned on a gradual view of AI development, where there isn't a discontinuous jump in capabilities, and there are many different actors all deploying powerful AI systems.
  • Conditional on the military eventually developing AI systems, it seems worth it to work with them to make their AI systems safer. However, it's not inconceivable that AI researchers could globally coordinate to prevent military AI applications. This wouldn't prevent it from happening eventually, but could drastically slow it down, and let defense scale faster than offense. In that case, working with the military can also be seen as a defection in a giant coordination game with other AI researchers.
  • One of my favorite lines: "I would recommend everyone who has calibrated intuitions about AI timelines spend some time doing stuff with real robots and it will probably … how should I put this? … further calibrate your intuitions in quite a humbling way." (Not that I've worked with real robots, but many of my peers have.)

Technical AI alignment


More realistic tales of doom (Paul Christiano): Summarized in the highlights!

The Main Sources of AI Risk? (Wei Dai): This post lists different causes or sources of existential risk from advanced AI.

Technical agendas and prioritization

Unsolved research problems vs. real-world threat models (Catherine Olsson): Papers on adversarial examples often suggest that adversarial examples can lead to real world problems as their motivation. As we've seen (AN #19) previously (AN #24), many adversarial example settings are not very realistic threat models for any real world problem. For example, adversarial "stickers" that cause vision models to fail to recognize stop signs could cause an autonomous vehicle to crash... but an adversary could also just knock over the stop sign if that was their goal.

There are more compelling reasons that we might care about imperceptible perturbation adversarial examples. First, they are a proof of concept, demonstrating that our ML models are not robust and make "obvious" mistakes and so cannot be relied on. Second, they form an unsolved research problem, in which progress can be made more easily than in real settings, because it can be formalized straightforwardly (unlike realistic settings). As progress is made in this toy domain, it can be used to inform new paradigms that are closer to realistic settings. But it is not meant to mimic real world settings -- in the real world, you need a threat model of what problems can arise from the outside world, which will likely suggest much more basic concerns than the "research problems", requiring solutions involving sweeping design changes rather than small fixes.

Rohin's opinion: I strongly agree with the points made in this post. I don't know to what extent researchers themselves agree with this point -- it seems like there is a lot of adversarial examples research that is looking at the imperceptible perturbation case and many papers that talk about new types of adversarial examples, without really explaining why they are doing this or giving a motivation that is about unsolved research problems rather than real world settings. It's possible that researchers do think of it as a research problem and not a real world problem, but present their papers differently because they think that's necessary in order to be accepted.

The distinction between research problems and real world threat models seem to parallel the distinction between theoretical or conceptual research and engineering in AI safety. The former typically asks questions of the form "how could we do this in principle, making simplifying assumptions X, Y and Z", even though X, Y and Z are known not to hold in the real world, for the sake of having greater conceptual clarity that can later be leveraged as a solution to a real world problem. Engineering work on the other hand is typically trying to scale an approach to a more complex environment (with the eventual goal of getting to a real world problem).

Learning human intent

Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning (Smitha Milli et al): In Cooperative Inverse Reinforcement Learning, we assume a two-player game with a human and a robot where the robot doesn't know the reward R, but both players are trying to maximize the reward. Since one of the players is a human, we cannot simply compute the optimal strategy and deploy it -- we are always making some assumption about the human, that may be misspecified. A common assumption is that the human is playing optimally for the single-player version of the game, also known as a literal human. The robot then takes the best response actions given that assumption. Another assumption is to have a pedagogic human, who acts as though the robot is interpreting her literally. The robot that takes the best response actions with this assumption is called a pedagogic or pragmatic robot.

However, any assumption we make about the human is going to be misspecified. This paper looks at how we can be robust to misspecification, in particular if the human could be literal or pedagogic. The main result is that the literal robot is more robust to misspecification. The way I think about this is that the literal robot is designed to work with a literal human, and a pedagogic human is "designed" to work with the literal robot, so unsurprisingly the literal robot works well with both of them. On the other hand, the pedagogic robot is designed to work with the pedagogic human, but has no relationship with the literal robot, and so should not be expected to work well. It turns out we can turn this argument into a very simple proof: (literal robot, pedagogic human) outperforms (literal robot, literal human) since the pedagogic human is designed to work well with the literal robot, and (literal robot, literal human) outperforms (pedagogic robot, literal human) since the literal robot is designed to work with the literal human.

They then check that the theory holds in practice. They find that the literal robot is better than the pedagogic robot even when humans are trying to be pedagogic, a stronger result than the theory predicted. The authors hypothesize that even when trying to be pedagogic, humans are more accurately modeled as a mixture of literal and pedagogic humans, and the extra robustness of the literal robot means that it is the better choice.

Rohin's opinion: I found this theorem quite unintuitive when I first encountered it, despite it being two lines long, which is something of a testament to how annoying and tricky misspecification can be. One way I interpret the empirical result is that the wider the probability distributions of our assumptions, the more robust they are to misspecification. A literal robot assumes that the human can take any near-optimal trajectory, whereas a pedagogic robot assumes that the human takes very particular near-optimal trajectories that best communicate the reward. So, the literal robot places probability mass over a larger space of trajectories given a particular reward, and does not update as strongly on any particular observed trajectory compared to the pedagogic robot, making it more robust.


SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability (Maithra Raghu et al)


Call for Papers: ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning (summarized by Dan H): Topics of this workshop include out-of-distribution detection, calibration, robustness to corruptions, robustness to adversaries, etc. Submissions are due April 30th.

AI strategy and policy

80K podcast: How can policy keep up with AI advances? (Rob Wiblin, Jack Clark, Miles Brundage and Amanda Askell): Summarized in the highlights!

A Survey of the EU's AI Ecosystem (Charlotte Stix): This report analyzes the European AI ecosystem. The key advantage that Europe has is a strong focus on ethical AI, as opposed to the US and China that are more focused on capabilities research. However, Europe does face a significant challenge in staying competitive with AI, as it lacks both startup/VC funding as well as talented researchers (who are often going to other countries). While there are initiatives meant to help with this problem, it is too early to tell whether they will have an impact. The report also recommends having large multinational projects, along the lines of CERN and the Human Brain Project. See also Import AI.

Other progress in AI

Reinforcement learning

Assessing Generalization in Deep Reinforcement Learning (blog post) (Charles Packer and Katelyn Guo): This is a blog post summarizing Assessing Generalization in Deep Reinforcement Learning (AN #31).

Meta learning

Online Meta-Learning (Chelsea Finn, Aravind Rajeswaran et al)

Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (Eleni Triantafillou et al)

New Comment