[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts


Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


AI alignment landscape (Paul Christiano) (summarized by Rohin): This post presents the following decomposition of how to make AI go well:

[Link to image below]

Rohin's opinion: Here are a few points about this decomposition that were particularly salient or interesting to me.

First, at the top level, the problem is decomposed into alignment, competence, and coping with the impacts of AI. The "alignment tax" (extra technical cost for safety) is only applied to alignment, and not competence. While there isn't a tax in the "coping" section, I expect that is simply due to a lack of space; I expect that extra work will be needed for this, though it may not be technical. I broadly agree with this perspective: to me, it seems like the major technical problem which differentially increases long-term safety is to figure out how to get powerful AI systems that are trying to do what we want, i.e. they have the right motivation (AN #33). Such AI systems will hopefully make sure to check with us before taking unusual irreversible actions, making e.g. robustness and reliability less important. Note that techniques like verification, transparency, and adversarial training (AN #43) may still be needed to ensure that the alignment itself is robust and reliable (see the inner alignment box); the claim is just that robustness and reliability of the AI's capabilities is less important.

Second, strategy and policy work here is divided into two categories: improving our ability to pay technical taxes (extra work that needs to be done to make AI systems better), and improving our ability to handle impacts of AI. Often, generically improving coordination can help with both categories: for example, the publishing concerns around GPT-2 (AN #46) have allowed researchers to develop synthetic text detection (the first category) as well as to coordinate on when not to release models (the second category).

Third, the categorization is relatively agnostic to the details of the AI systems we develop -- these only show up in level 4, where Paul specifies that he is mostly thinking about aligning learning, and not planning and deduction. It's not clear to me to what extent the upper levels of the decomposition make as much sense if considering other types of AI systems: I wouldn't be surprised if I thought the decomposition was not as good for risks from e.g. powerful deductive algorithms, but it would depend on the details of how deductive algorithms become so powerful. I'd be particularly excited to see more work presenting more concrete models of powerful AGI systems, and reasoning about risks in those models, as was done in Risks from Learned Optimization (AN #58).

Previous newsletters

Addendum to AI and Compute (Girish Sastry et al) (summarized by Rohin): Last week, I said that this addendum suggested that we don't see the impact of AI winters in the graph of compute usage over time. While true, this was misleading: the post is measuring compute used to train models, which was less important in past AI research (e.g. it doesn't include Deep Blue), so it's not too surprising that we don't see the impact of AI winters.

Technical AI alignment

Mesa optimization

Will transparency help catch deception? Perhaps not (Matthew Barnett) (summarized by Rohin): Recent (AN #70) posts (AN #72) have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that's about as smart as the deceptive model, so that the deceptive model can't fool it.

Rohin's opinion: In a comment, Evan Hubinger makes a point I agree with: the transparency tools don't need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn't "suddenly" become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.

Prerequisities: Relaxed adversarial training for inner alignment (AN #70)

More variations on pseudo-alignment (Evan Hubinger) (summarized by Nicholas): This post identifies two additional types of pseudo-alignment not mentioned in Risks from Learned Optimization (AN #58). Corrigible pseudo-alignment is a new subtype of corrigible alignment. In corrigible alignment, the mesa optimizer models the base objective and optimizes that. Corrigible pseudo-alignment occurs when the model of the base objective is a non-robust proxy for the true base objective. Suboptimality deceptive alignment is when deception would help the mesa-optimizer achieve its objective, but it does not yet realize this. This is particularly concerning because even if AI developers check for and prevent deception during training, the agent might become deceptive after it has been deployed.

Nicholas's opinion: These two variants of pseudo-alignment seem useful to keep in mind, and I am optimistic that classifying risks from mesa-optimization (and AI more generally) will make them easier to understand and address.

Preventing bad behavior

Vehicle Automation Report (NTSB) (summarized by Zach): Last week, the NTSB released a report on the Uber automated driving system (ADS) that hit and killed Elaine Herzberg. The pedestrian was walking across a two-lane street with a bicycle. However, the car didn't slow down before impact. Moreover, even though the environment was dark, the car was equipped with LIDAR sensors which means that the car was able to fully observe the potential for collision. The report takes a closer look at how Uber had set up their ADS and notes that in addition to not considering the possibility of jay-walkers, "...if the perception system changes the classification of a detected object, the tracking history of that object is no longer considered when generating new trajectories". Additionally, in the final few seconds leading up to the crash the vehicle engaged in action suppression, which is described as "a one-second period during which the ADS suppresses planned braking while the (1) system verifies the nature of the detected hazard and calculates an alternative path, or (2) vehicle operator takes control of the vehicle". The reason cited for implementing this was concerns of false alarms which could cause the vehicle to engage in unnecessary extreme maneuvers. Following the crash, Uber suspended its ADS operations and made several changes. They now use onboard safety features of the Volvo system that were previously turned off, action suppression is no longer implemented, and path predictions are held across object classification changes.

Zach's opinion: While there is a fair amount of nuance regarding the specifics of how Uber's ADS was operating it does seem as though there was a fair amount of incompetence in how the ADS was deployed. Turning off Volvo system fail-safes, not accounting for jaywalking, and trajectory reseting seem like unequivocal mistakes. A lot of people also seem upset that Uber was engaging in action suppression. However, given that randomly engaging in extreme maneuvering in the presence of other vehicles can indirectly cause accidents I have a small amount of sympathy for why such a feature existed in the first place. Of course, the feature was removed and it's worth noting that "there have been no unintended consequences—increased number of false alarms".

Read more: Jeff Kaufman writes a post summarizing both the original incident and the report. Wikipedia is also rather thorough in their reporting on the factual information. Finally, Planning and Decision-Making for Autonomous Vehicles gives an overview of recent trends in the field and provides good references for people interested in safety concerns.


Explicability? Legibility? Predictability? Transparency? Privacy? Security? The Emerging Landscape of Interpretable Agent Behavior (Tathagata Chakraborti et al) (summarized by Flo): This paper reviews and discusses definitions of concepts of interpretable behaviour. The first concept, explicability measures how close an agent's behaviour is to the observer's expectations. An agent that takes a turn while its goal is straight ahead does not behave explicably by this definition, even if it has good reasons for its behaviour, as long as these reasons are not captured in the observer's model. Predictable behaviour reduces the observer's uncertainty about the agent's future behaviour. For example, an agent that is tasked to wait in a room behaves more predictably if it shuts itself off temporarily than if it paced around the room. Lastly, legibility or transparency reduces observer's uncertainty about an agent's goal. This can be achieved by preferentially taking actions that do not help with other goals. For example, an agent tasked with collecting apples can increase its legibility by actively avoiding pears, even if it could collect them without any additional costs.

These definitions do not always assume correctness of the observer's model. In particular, an agent can explicably and predictably achieve the observer's task in a specific context while actually trying to do something else. Furthermore, these properties are dynamic. If the observer's model is imperfect and evolves from observing the agent, formerly inexplicable behaviour can become explicable as the agent's plans unfold.

Flo's opinion: Conceptual clarity about these concepts seems useful for more nuanced discussions and I like the emphasis on the importance of the observer's model for interpretability. However, it seems like concepts around interpretability that are not contingent on an agent's actual behaviour (or explicit planning) would be even more important. Many state-of-the-art RL agents do not perform explicit planning, and ideally we would like to know something about their behaviour before we deploy them in novel environments.

AI strategy and policy

AI policy careers in the EU (Lauro Langosco)

Other progress in AI

Reinforcement learning

Superhuman AI for multiplayer poker (Noam Brown et al) (summarized by Matthew): In July, this paper presented the first AI that can play six-player no-limit Texas hold’em poker better than professional players. Rather than using deep learning, it works by precomputing a blueprint strategy using a novel variant of Monte Carlo linear counterfactual regret minimization, an iterative self-play algorithm. To traverse the enormous game tree, the AI buckets moves by abstracting information in the game. During play, the AI adapts its strategy by modifying its abstractions according to how the opponents play, and by performing real-time search through the game tree. It used the equivalent of $144 of cloud compute to calculate the blueprint strategy and two server grade CPUs, which was much less hardware than what prior AI game milesones required.

Matthew's opinion: From what I understand, much of the difficulty of poker lies in being careful not to reveal information. For decades, computers have already had an upper hand in being silent, computing probabilities, and choosing unpredictable strategies, which makes me a bit surprised that this result took so long. Nonetheless, I found it interesting how little compute was required to accomplish superhuman play.

Read more: Let's Read: Superhuman AI for multiplayer poker

Meta learning

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning (Tianhe Yu, Deirdre Quillen, Zhanpeng He et al) (summarized by Asya): "Meta-learning" or "learning to learn" refers to the problem of transferring insight and skills from one set of tasks to be able to quickly perform well on new tasks. For example, you might want an algorithm that trains on some set of platformer games to pick up general skills that it can use to quickly learn new platformer games.

This paper introduces a new benchmark, "Meta World", for evaluating meta-learning algorithms. The benchmark consists of 50 simulated robotic manipulation tasks that require a robot arm to do a combination of reaching, pushing and grasping. The benchmark tests the ability of algorithms to learn to do a single task well, learn one multi-task policy that trains and performs well on several tasks at once, and adapt to new tasks after training on a number of other tasks. The paper argues that unlike previous meta-learning evaluations, the task distribution in this benchmark is very broad while still having enough shared structure that meta-learning is possible.

The paper evaluates existing multi-task learning and meta-learning algorithms on this new benchmark. In meta-learning, it finds that different algorithms do better depending on how much training data they're given. In multi-task learning, it finds that the algorithm that performs best uses multiple "heads", or ends of neural networks, one for each task. It also finds that algorithms that are "off-policy"-- that estimate the value of actions other than the one that the network is currently planning to take-- perform better on multi-task learning than "on-policy" algorithms.

Asya's opinion: I really like the idea of having a standardized benchmark for evaluating meta-learning algorithms. There's a lot of room for improvement in performance on the benchmark tasks and it would be cool if this incentivized algorithm development. As with any benchmark, I worry that it is too narrow to capture all the nuances of potential algorithms; I wouldn't be surprised if some meta-learning algorithm performed poorly here but did well in some other domain.


CHAI 2020 Internships (summarized by Rohin): CHAI (the lab where I work) is currently accepting applications for its 2020 internship program. The deadline to apply is Dec 15.