Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


The Alignment Problem (Brian Christian) (summarized by Rohin): This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:

- The gorilla misclassification incident

- The faulty reward in CoastRunners

- The gender bias in language models

- The failure of facial recognition models on minorities

- The COMPAS controversy (leading up to impossibility results in fairness)

- The neural net that thought asthma reduced the risk of pneumonia

It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming (AN #97) and mesa optimization (AN #58) that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example):

- BF Skinner's experiments in training pigeons

- The invention of the perceptron

- The success of TD-Gammon, and later AlphaGo Zero

The final part, titled "Normativity", delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities -- how to get AI systems that optimize for their objectives -- this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty.

Rohin's opinion: I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and sometimes also who initially came up with the ideas, it's much more engaging to read the detailed stories of how that person came to develop the idea; Brian's book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post.



Clarifying “What failure looks like” (part 1) (Sam Clarke) (summarized by Rohin): The first scenario outlined in What failure looks like (AN #50) stems from a failure to specify what we actually want, so that we instead build AI systems that pursue proxies of what we want instead. As AI systems become responsible for more of the economy, human values become less influential relative to the proxy objectives the AI systems pursue, and as a result we lose control over the future. This post aims to clarify whether such a scenario leads to lock in, where we are stuck with the state of affairs and cannot correct it to get “back on course”. It identifies five factors which make this more likely:

1. Collective action problems: Many human institutions will face competitive (short-term) pressures to deploy AI systems with bad proxies, even if it isn’t in humanity’s long-term interest.

2. Regulatory capture: Influential people (such as CEOs of AI companies) may benefit from AI systems that optimize proxies, and so oppose measures to fix the issue (e.g. by banning such AI systems).

3. Ambiguity: There may be genuine ambiguity about whether it is better to have these AI systems that optimize for proxies, even from a long-term perspective, especially because all clear and easy-to-define metrics will likely be going up (since those can be turned into proxy objectives).

4. Dependency: AI systems may become so embedded in society that society can no longer function without them.

5. Opposition: The AI systems themselves may oppose any fixes we propose.

We can also look at historical precedents. Factors 1-3 have played an important role in climate change, though if it does lead to lock in, this will be “because of physics”, unlike the case with AI. The agricultural revolution, which arguably made human life significantly worse, still persisted thanks to its productivity gains (factor 1) and the loss of hunter-gathering skills (factor 4). When the British colonized New Zealand, the Maori people lost significant control over their future, because each individual chief needed guns (factor 1), trading with the British genuinely made them better off initially (factor 3), and eventually the British turned to manipulation, confiscation and conflict (factor 5).

With AI in particular, we might expect that an increase in misinformation and echo chambers exacerbates ambiguity (factor 3), and that due to its general-purpose nature, dependency (factor 4) may be more of a risk.

The post also suggests some future directions for estimating the severity of lock in for this failure mode.

Rohin's opinion: I think this topic is important and the post did it justice. I feel like factors 4 and 5 (dependency and opposition) capture the reasons I expect lock in, with factors 1-3 as less important but still relevant mechanisms. I also really liked the analogy with the British colonization of New Zealand -- it felt like it was in fact quite analogous to how I’d expect this sort of failure to happen.

"Unsupervised" translation as an (intent) alignment problem (Paul Christiano) (summarized by Rohin): We have previously seen that a major challenge for alignment is that our models may learn inaccessible information (AN #104) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.

Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.

One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).


Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery (Kristian Hartikainen et al) (summarized by Robert): In reinforcement learning (RL), reward function specification is a central problem in training a successful policy. For a large class of tasks, we can frame the problem as goal-directed RL: giving a policy a representation of a goal (for example coordinates in a map, or a picture of a location) and training the policy to reach this goal. In this setting, the naive reward function would be to give a reward of 1 when the policy reaches the goal state (or very close to it), and a reward of 0 otherwise. However, this makes it difficult to train the correct policy, as it will need to explore randomly for a long time before finding the true reward. Instead, if we had a notion of distance within the environment, we could use the negative distance from the goal state as the reward function - this would give the policy good information about which direction it should be moving in, even if it hasn't yet found the reward.

This paper is about how to learn a distance function in an unsupervised manner, such that it's useful for shaping the reward of an RL policy. Given an environment without a reward function, and starting with a random goal-directed policy, they alternate between (1) choosing a state s* to train the policy to reach, and (2) training a distance function d(s*, s') which measures the minimum number of environment steps it takes for the policy to reach a state s* from a different state s'. This distance function is trained with supervised learning using data collected by the policy acting in the environment, and is called the Dynamical Distance, as it measures the distance with respect to the environment dynamics and policy behaviour.

The key choice in implementing this algorithm is how states are chosen to train the policy (step 1). In the first implementation, the authors choose the state which is farthest from the current state or the starting state, to encourage better long-term planning and skills in the policy and better generalisation in the agent. In the second (and more relevant) implementation, the state is chosen from a selection of random states by a human who is trying to express a preference for a given goal state. This effectively trains the policy to be able to reach states which match humans preferences. This second method outperforms Deep RL from Human Preferences in terms of sample efficiency of human queries in learning human preferences across a range of locomotion tasks.

Robert's opinion: What's most interesting about this paper (from an alignment perspective) is the increased sample efficiency of the learning of human preferences, by limiting the type of preferences that can be expressed to preferences over goal states in a goal-directed setting. While not all preferences could be captured this way, I think a large amount of them in a large number of settings could be - it might come down to creating a clever encoding of the task as goal-directed in a way an RL policy could learn.

Aligning Superhuman AI and Human Behavior: Chess as a Model System (Reid McIlroy-Young et al) (summarized by Rohin) (H/T Dylan Hadfield-Menell): Current AI systems are usually focused on some well-defined performance metric. However, as AI systems become more intelligent, we would presumably want to have humans learn from and collaborate with such systems. This is currently challenging since our superintelligent AI systems are quite hard to understand and don’t act in human-like ways.

The authors aim to study this general issue within chess, where we have access both to superintelligent AI systems and lots of human-generated data. (Note: I’ll talk about “ratings” below; these are not necessarily ELO ratings and should just be thought of as some “score” that functions similarly to ELO.) The authors are interested in whether AI systems play in a human-like way and can be used as a way of understanding human gameplay. One particularly notable aspect of human gameplay is that there is a wide range in skill: as a result we would like an AI system that can make predictions conditioned on varying skill levels.

For existing algorithms, the authors analyze the traditional Stockfish engine and the newer Leela (an open-source version of AlphaZero (AN #36)). They can get varying skill levels by changing the depth of the tree search (in Stockfish) or changing the amount of training (in Leela).

For Stockfish, they find that regardless of search depth, Stockfish action distributions monotonically increase in accuracy as the skill of the human goes up -- even when the depth of the search leads to a Stockfish agent with a similar skill rating as an amateur human. (In other words, if you take a low-ELO Stockfish agent and treat it as a predictive model of human players, it isn’t a great predictive model ever, but it is best at predicting human experts, not human amateurs.) This demonstrates that Stockfish plays very differently than humans.

Leela on the other hand is somewhat more human-like: when its rating is under 2700, its accuracy is highest on amateur humans; at a rating of 2700 its accuracy is about constant across humans, and above 2700 its accuracy is highest on expert humans. However, its accuracy is still low, and the most competent Leela model is always the best predictor of human play (rather than the Leela model with the most similar skill level to the human whose actions are being predicted).

The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.

They also investigate a bunch of other scenarios, such as decisions in which there is a clear best action and decisions where humans tend to make mistakes, and find that the models behave as you’d expect (for example, when there’s a clear best action, model accuracy increases across the board).

Rohin's opinion: While I found the motivation and description of this paper somewhat unclear or misleading (Maia seems to me to be identical to behavior cloning, in which case it would not just be a “connection”), the experiments they run are pretty cool and it was interesting to see the pretty stark differences between models trained on a performance metric and models trained to imitate humans.



Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems (Sergey Levine et al) (summarized by Zach): The authors in this paper give an overview of offline-reinforcement learning with the aim that readers gain enough familiarity to start thinking about how to make contributions in this area. The utility of a fully offline RL framework is significant: just as supervised learning methods have been able to utilize data for generalizable and powerful pattern recognition, offline RL methods could enable data to be funneled into decision-making machines for applications such as health-care, robotics, and recommender systems. The organization of the article is split into a section on formulation and another on benchmarks, followed by a section on applications and a general discussion.

In the formulation portion of the review, the authors give an overview of the offline learning problem and then discuss a number of approaches. Broadly speaking, the biggest challenge is the need for counterfactual reasoning because the agent must learn using data by another agent. Thus, the agent is forced to reason about what would happen if a different decision was used. Importance sampling, approximate dynamic programming, and offline model-based approaches are discussed as possible approaches to this counterfactual reasoning problem. In the benchmarks section, the authors review evaluation techniques for offline RL methods. While the authors find that there are many domain-specific evaluations, general benchmarking is less well established. A major issue in creating benchmarks is deciding whether or not to use diverse trajectories/replay buffer data, or only the final expert policy.

In the discussion, the authors argue that while importance sampling and dynamic programming work on low-dimensional and short-horizon tasks, they struggle to integrate well with function approximators. On the other hand, the authors see approaches that constrain the space of policies to be near the dataset as a promising direction to mitigate the effects of distributional shift. However, the authors acknowledge that it may ultimately take more systematic datasets to push the field forward.

Zach's opinion: This was a great overview of the state of the field. A recurring theme that the authors highlight is that offline RL requires counterfactual reasoning which may be fundamentally difficult to achieve because of distributional shift. Some results shown in the paper suggest that offline RL may just be fundamentally hard. However, I find myself sharing optimism with the authors on the subject of policy constraint techniques and the inevitable importance of better datasets.


State of AI Report 2020 (Nathan Benaich and Ian Hogarth) (summarized by Rohin): The third State of AI (AN #15) report is out! I won’t go into details here since there is really quite a lot of information, but I recommend scrolling through the presentation to get a sense of what’s been going on. I was particularly interested in their 8 predictions for the next year: most of them seemed like they were going out on a limb, predicting something that isn’t just “the default continues”. On last year’s 6 predictions, 4 were correct, 1 was wrong, and 1 was technically wrong but quite close to being correct; even this 67% accuracy would be pretty impressive on this year’s 8 predictions. (It does seem to me that last year’s predictions were more run-of-the-mill, but that might just be hindsight bias.)


Hiring engineers and researchers to help align GPT-3 (Paul Christiano) (summarized by Rohin): The Reflection team at OpenAI is hiring ML engineers and ML researchers to push forward work on aligning GPT-3. Their most recent results are described in Learning to Summarize with Human Feedback (AN #116).


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
4 comments, sorted by Click to highlight new comments since:

The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.

Yeah, I think that's all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples - typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.

Personally, I would've trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.

But it does differ from behavioral cloning in that they stratify the samples

Fair point. In my ontology, "behavior cloning" is always with respect to some expert distribution, so I see the stratified samples as "several instances of behavior cloning with different expert distributions", but that isn't a particularly normal or accepted ontology.

Personally, I would've trained a single conditional model with a specified player-Elo for each move

Yeah it does seem like this would have worked better -- if nothing else, the predictions could be more precise (rather than specifying the bucket in which the current player falls, you can specify their exact ELO instead).

That Hartikainen et al. paper was really interesting! Unfortunately I don't know enough about the state of the art for unsupervised exploration - they compare DDLUS to a 2018 paper (DIAYN), but I'm not sure how either of these compares to other prominent exploration techniques (e.g. something like NGU).

I also wonder if different techniques do better on atari vs. mujoco environments for "unprincipled" reasons that make apples to apples comparisons difficult for techniques developed by different groups.

they compare DDLUS to a 2018 paper (DIAYN)

Note the paper itself is from July 2019. (Not everything in the newsletter is the latest news.)

I also wonder if different techniques do better on atari vs. mujoco environments for "unprincipled" reasons that make apples to apples comparisons difficult for techniques developed by different groups.

That seems quite likely to me, but one would hope that a good method also works in situations it wasn't designed for, so this still seems like a reasonable evaluation to me.