Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


AI Safety Needs Social Scientists (Geoffrey Irving et al) (summarized by Richard): One approach to AI safety is to "ask humans a large number of questions about what they want, train an ML model of their values, and optimize the AI system to do well according to the learned values". However, humans give answers that are limited, biased and often in disagreement with each other, and so AI safety needs social scientists to figure out how to improve this data - which eventually may be gathered from thousands or millions of people. Of particular importance is the ability to design rigorous experiments, drawing from an interdisciplinary understanding of human cognition and behaviour. The authors discuss Debate (AN #5) as a case study of a safety technique whose success depends on empirical questions such as: how skilled are humans as judges by default? Can we train people to be better judges? Are there ways to restrict debate to make it easier to judge?

There are a couple of key premises underlying this argument. The first is that, despite human biases, there are correct answers to questions about human values - perhaps defined as the answer we would endorse if given all relevant information and unlimited time to think. However, it’s not necessary for AIs to always find those answers, as long as they are able to recognise cases in which they’re uncertain and do nothing (while there are some cases in which inaction can cause harm, such as a self-driving car ceasing to steer mid-journey, it seems that the most worrying long-term catastrophes can be avoided by inaction). Another reason for optimism is that even incomplete or negative results from social science experiments may be useful in informing technical safety research going forward. However, in some cases the systems we're trying to reason about are very different from anything we can test now - for example, AI debaters that are much stronger than humans.

Richard's opinion: This post, and its accompanying paper, seems very sensible to me. While I have some doubts about how informative human debate data will be about superhuman debaters, it certainly seems worth trying to gain more empirical information. Note that while the paper primarily discusses Debate, I think that many of its arguments are applicable to any human-in-the-loop safety methods (and probably others too). Currently I think Ought is the safety group focusing most on collecting human data, but I look forward to seeing other researchers doing so.

Technical AI alignment

Technical agendas and prioritization

FLI Podcast: AI Breakthroughs and Challenges in 2018 with David Krueger and Roman Yampolskiy (Ariel Conn, David Krueger and Roman Yampolskiy): David and Roman review AI progress in 2018 and speculate about its implications. Roman identified a pattern where we see breakthroughs like AlphaZero (AN #36), AlphaStar (AN #43) and AlphaFold (AN #36) so frequently now that it no longer seems as impressive when a new one comes out. David on the other hand sounded less impressed by progress on Dota and StarCraft, since both AI systems were capable of executing actions that humans could never do (fast reaction times for Dota and high actions-per-minute for StarCraft). He also thought that these projects didn't result in any clear general algorithmic insights the way AlphaZero did.

On the deep RL + robotics side, David identified major progress in Dactyl (AN #18) and QT-Opt (which I remember reading and liking but apparently I failed to put in the newsletter). He also cited GANs as having improved significantly, and talked about feature-wise transformations in particular. Roman noted the improving performance of evolutionary algorithms.

David also noted how a lot of results were obtained by creating algorithms that could scale, and then using a huge amount of compute for them, quoting AI and Compute (AN #7), Interpreting AI Compute Trends (AN #15) and Reinterpreting AI and Compute (AN #38).

On the policy side, they talked about deep fakes and the general trend that AI may be progressing to fast for us to keep up with its security implications. They do find it promising that researchers are beginning to accept that their research does have safety and security implications.

On the safety side, David noted that the main advance seemed to be with approaches using superhuman feedback, including debate (AN #5), iterated amplification (discussed frequently in this newsletter, but that paper was in AN #30) and recursive reward modeling (AN #34). He also identified unrestricted adversarial examples (AN #24) as an area to watch in the future.

Rohin's opinion: I broadly agree with the areas of AI progress identified here, though I would probably also throw in NLP, e.g. BERT. I disagree on the details -- for example, I think that OpenAI Five (AN #13) was much better than I would have expected at the time and the same would have been true of AlphaStar if I hadn't already seen OpenAI Five, and the fact that they did a few things that humans can't do barely diminishes the achievement at all. (My take is pretty similar to Alex Irpan's take in his post on AlphaStar.)

Treacherous Turn, Simulations and Brain-Computer Interfaces (Michaël Trazzi)

Learning human intent

AI Alignment Podcast: Human Cognition and the Nature of Intelligence (Lucas Perry and Joshua Greene) (summarized by Richard): Joshua Greene's lab has two research directions. The first is how we combine concepts to form thoughts: a process which allows us to understand arbitrary novel scenarios (even ones we don't think ever occurred). He discusses some of his recent reseach, which uses brain imaging to infer what's happening when humans think about compound concepts. While Joshua considers the combinatorial nature of thought to be important, he argues that to build AGI, it's necessary to start with "grounded cognition" in which representations are derived from perception and physical action, rather than just learning to manipulate symbols (like language).

Joshua also works on the psychology and neuroscience of morality. He discusses his recent work in which participants are prompted to consider Rawls' Veil of Ignorance argument (that when making decisions affecting many people, we should do so as if we don't know which one we are) and then asked to evaluate moral dilemmas such as trolley problems. Joshua argues that the concept of impartiality is at the core of morality, and that it pushes people towards more utilitarian ideas (although he wants to rebrand utilitarianism as "deep pragmatism" to address its PR problems).

Imitation Learning from Imperfect Demonstration (Yueh-Hua Wu et al)

Learning User Preferences via Reinforcement Learning with Spatial Interface Valuing (Miguel Alonso Jr)


Regularizing Black-box Models for Improved Interpretability (Gregory Plumb et al)


Adversarial Examples Are a Natural Consequence of Test Error in Noise (Nicolas Ford, Justin Gilmer et al) (summarized by Dan H): While this was previously summarized in AN #32, this draft is much more readable.

Improving Robustness of Machine Translation with Synthetic Noise (Vaibhav, Sumeet Singh, Craig Stewart et al) (summarized by Dan H): By injecting noise (such as typos, word omission, slang) into the training set of a machine translation model, the authors are able to improve performance on naturally occurring data. While this trick usually does not work for computer vision models, it can work for NLP models.

Push the Student to Learn Right: Progressive Gradient Correcting by Meta-learner on Corrupted Labels (Jun Shu et al)

Miscellaneous (Alignment)

AI Safety Needs Social Scientists (Geoffrey Irving et al): Summarized in the highlights!

AI strategy and policy

Humans Who Are Not Concentrating Are Not General Intelligences (Sarah Constantin): This post argues that humans who skim the stories produced by GPT-2 (AN #46) would not be able to tell that they were generated by a machine, because while skimming we are not able to notice the obvious logical inconsistencies in its writing. Key quote: "OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot". This suggests that fake news, social manipulation, etc. will become much easier. However, it might also force people to learn the skill of detecting the difference between humans and bots, which could let them learn to tell when they are actively focusing on something and are "actually learning" as opposed to skimming for "low order correlations".

Rohin's opinion: I noticed a variant of this effect myself while reading GPT-2 results -- my brain very quickly fell into the mode of skimming without absorbing anything, though it felt more like I had made the evaluation that there was nothing to gain from the content, which seems okay if the goal is to avoid fake news. I also find this to be particularly interesting evidence about the differences between our low-level, effortless pattern matching, as well as our more effortful and accurate "logical reasoning".

Other progress in AI


InfoBot: Transfer and Exploration via the Information Bottleneck (Anirudh Goyal et al)

Reinforcement learning

An Overdue Post on AlphaStar (Alex Irpan): The first post in this two-parter talks about the impact of AlphaStar (AN #43) on the StarCraft community and broader public. I'm focusing on the second one, which talks about AlphaStar's technical details and implications. Some of this post overlaps with my summary of AlphaStar, but those parts are better fleshed out and have more details.

First, imitation learning is a surprisingly good base policy, getting to the level of a Gold player. It's surprising because you might expect the DAgger problem to be extreme: since there are so many actions in a StarCraft game, your imitation learning policy will make some errors, and those errors will then compound over the very long remainder of the episode as they take the policy further away from normal human play into states that the policy wasn't trained on.

Second, population-based training is probably crucial and will be important in the future, because it allows for exploring the full strategy space.

Third, the major challenge is making RL achieve okay performance, and after that they very quickly become great. It took years of research to get Dota and StarCraft bots reach decent play, and then a few days of more training got them to be world class. Fun quote: "although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training".

Fourth, there were a lot of research results that went into AlphaStar. This suggests that there are large gains to be had by throwing a lot of techniques together and seeing how well they work, which doesn't happen very much currently. There are good reasons for this: it's much easier to evaluate a technique if its built upon a simple, standard algorithm rather than having to consider all of its interactions with other techniques which you may or may not be able to properly compare against. Still, there are going to be some cool results that we could do now if we just threw the right things together, and this sort of work also lets us test techniques in new settings to see which ones actually work in general, as opposed to only in the original evaluation.

Rohin's opinion: I really like this post, and agree with almost everything in it. On the imitation learning point, I also found it surprising how well imitation learning worked. Alex suggests that it could be that human data has enough variation that the agent can learn how to recover from incorrect decisions it could make. I think this is a partial explanation at best -- there is a huge combinatorial explosion, it's not clear why you don't need a much larger dataset to cover the entire space. Maybe there are "natural" representations in any realistic complex environment that you start to accurately learn at the level of compute that they're using, and once you have those then imitation learning with sufficient variation can work well.

On the last point about tossing techniques together, I think this might sometimes be worth doing but often may not be. It makes sense to do this with any real task, since that's a test of the technique against reality. (Here StarCraft counts as a "real" task while Atari does not; the criterion is something like "if the task is successfully automated we are impressed regardless of how it is solved".) I'm less keen on tossing techniques together for artificial benchmarks. I think typically these techniques improve the sample efficiency by a constant multiplicative factor by adding something akin to a good inductive bias; in that case throwing them together may let us solve the artificial benchmark sooner but it doesn't give us great evidence that the "inductive bias" will be good for realistic tasks. I think I don't actually disagree with Alex very much on the object-level recommendations, I would just frame them differently.

Learning to Generalize from Sparse and Underspecified Rewards (Rishabh Agarwal et al)

Reward Shaping via Meta-Learning (Haosheng Zou, Tongzheng Ren et al)

Investigating Generalisation in Continuous Deep Reinforcement Learning (Chenyang Zhao et al)

Deep learning

Random Search and Reproducibility for Neural Architecture Search (Liam Li et al)


MIRI Summer Fellows Program (Colm Ó Riain): CFAR and MIRI are running the MIRI Summer Fellows Program from August 9-24. Applications are due March 31.

RAISE is launching their MVP (Toon Alfrink): The Road to AI Safety Excellence will begin publishing lessons on inverse reinforcement learning and iterated amplification on Monday. They are looking for volunteers for their testing panel, who will study the material for about one full day per week, with guidance from RAISE, and provide feedback on the material and in particular on any sources of confusion.

New Comment