Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


The Assistive Multi-Armed Bandit (Lawrence Chan et al) (summarized by Asya): Standard approaches for inverse reinforcement learning assume that humans are acting optimally according to their preferences, rather than learning about their preferences as time goes on. This paper tries to model the latter by introducing the assistive multi-armed bandit problem.

In the standard multi-armed bandit problem, a player repeatedly chooses one of several “arms” to pull, where each arm provides reward according to some unknown distribution. Imagine getting 1000 free plays on your choice of 10 different, unknown slot machines. This is a hard problem since the player must trade off between exploration (learning about some arm) and exploitation (pulling the best arm so far). In assistive multi-armed bandit, a robot is given the opportunity to intercept the player every round and pull an arm of its choice. If it does not intercept, it can see the arm pulled by the player but not the reward the player receives. This formalizes the notion of an AI with only partial information trying to help a learning agent optimize their reward.

The paper does some theoretical analysis of this problem as well as an experimental set-up involving a neural network and players acting according to a variety of different policies. It makes several observations about the problem:

- A player better at learning does not necessarily lead to the player-robot team performing better-- the robot can help a suboptimal player do better in accordance with how much information the player's arm pulls convey about the reward of the arm.

- A robot is best at assisting when it has the right model for how the player is learning.

- A robot that models the player as learning generally does better than a robot that does not, even if the robot has the wrong model for the player's learning.

- The problem is very sensitive to which learning model the player uses and which learning model the robot assumes. Some player learning models can only be effectively assisted when they are correctly modeled. Some robot-assumed learning models effectively assist for a variety of actual player learning models.

Asya's opinion: The standard inverse reinforcement learning assumption about humans acting optimally seems unrealistic; I think this paper provides an insightful initial step in not having that assumption and models the non-optimal version of the problem in a clean and compelling way. I think it's a noteworthy observation that this problem is very sensitive to the player's learning model, and I agree with the paper that this suggests that we should put effort into researching actual human learning strategies. I am unsure how to think about the insights here generalizing to other inverse reinforcement learning cases.

Technical AI alignment


Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence (David Manheim) (summarized by Flo): While Categorizing Variants of Goodhart’s Law explains failure modes that occur when a single agent’s proxy becomes decoupled from the true goal, this paper aims to characterize failures involving multiple agents:

Accidental steering happens when the combined actions of multiple agents facilitate single-agent failures. For example, catching more fish now is usually positively correlated with a fisherman's long term goals, but this relationship inverts once there are lots of fishermen optimizing for short term gains and the fish population collapses.

Coordination Failure occurs when agents with mutually compatible goals fail to coordinate. For example, due to incomplete models of other agent's goals and capabilities, two agents sharing a goal might compete for a resource even though one of them is strictly better at converting the resource into progress towards their goal.

Adversarial optimization is when an agent O steers the world into states where V's proxy goal is positively correlated with O's goal. For example, one could exploit investors who use short term volatility as a proxy for risk by selling them instruments that are not very volatile but still risky.

Input Spoofing is the act of one agent manipulating another learning agent's model, either by manufacturing false evidence or by filtering the received evidence systematically, as arguably happened with Microsoft’s Tay.

Finally, Goal co-option happens when agent O has (partial) control over the hardware agent V runs or relies on. This way, O can either modify the reward signal V receives to change what V optimizes for, or it can directly change V's outputs.

The difficulties in precisely modelling other sophisticated agents and other concerns related to embedded agency make it hard to completely avoid these failure modes with current methods. Slowing down the deployment of AI systems and focussing on the mitigation of the discussed failure modes might prevent limited near term catastrophes, which in turn might cause a slowdown of further deployment and prioritization of safety.

Flo's opinion: I like that this paper subdivides failure modes that can happen in multiparty optimization into several clear categories and provides various models and examples for each of them. I am unsure about the conclusion: on one hand, slowing down deployment to improve the safety of contemporary systems seems very sensible. On the other hand, it seems like there would be some failures of limited scope that are hard to reproduce "in the lab". Widely deployed AI systems might provide us with valuable empirical data about these failures and improve our understanding of the failure modes in general. I guess ideally there would be differential deployment with rapid deployment in noncritical areas like managing local parking lots, but very slow deployment for critical infrastructure.

Rohin's opinion: I'm particularly interested in an analysis of how these kinds of failures affect existential risk. I'm not sure if David believes they are relevant for x-risk, but even if so the arguments aren't presented in this paper.

Mesa optimization

Relaxed adversarial training for inner alignment (Evan Hubinger) (summarized by Matthew): Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave "unacceptably" and then penalizing the model accordingly. To make the adversary's job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as predicate that constrains possible inputs. This post expands on Paul's proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framework. The penalty relies on the idea of an amplified model inspecting the unamplified version of itself. For this procedure to work, amplified overseers must be able to correctly deduce whether potential inputs will yield unacceptable behavior in their unamplified selves, which seems plausible since it should know everything the unamplified version does. The post concludes by arguing that progress in model transparency is key to these acceptability guarantees. In particular, Evan emphasizes the need to decompose models into the parts involved in their internal optimization processes, such as their world models, optimization procedures, and objectives.

Matthew's opinion: I agree that transparency is an important condition for the adversary, since it would be hard to search for catastrophe-inducing inputs without details of how the model operated. I'm less certain that this particular decomposition of machine learning models is necessary. More generally, I am excited to see how adversarial training can help with inner alignment.

Learning human intent

Learning from Observations Using a Single Video Demonstration and Human Feedback (Sunil Gandhi et al) (summarized by Zach): Designing rewards can be a long and consuming process, even for experts. One common method to circumvent this problem is through demonstration. However, it might be difficult to record demonstrations in a standard representation, such as joint positions. In this paper, the authors propose using human feedback to circumvent the discrepancy between how demonstrations are recorded (video) and the desired standard representation (joint positions). First, humans provide similarity evaluations of short clips of an expert demonstration to the agent's attempt and a similarity function is learned by the agent. Second, this similarity function is used to help train a policy that can imitate the expert. Both functions are learned jointly. The algorithm can learn to make a Hopper agent back-flip both from a Hopper demonstration of a back-flip, and from a YouTube video of a human backflipping. Ultimately, the authors show that their method improves over another method that uses human feedback without direct comparison to desired behavior.

Zach's opinion: This paper seems like a natural extension of prior work. The imitation learning problem from observation is well-known and difficult. Introducing human feedback with a structured state space definitely seems like a viable way to get around a lot of the known difficulties with other methods such as a GAIL.

Handling groups of agents

Collaborating with Humans Requires Understanding Them (Micah Carroll et al) (summarized by Rohin): Note: I am second author on this paper. Self-play agents (like those used to play Dota (AN #13) and Starcraft (AN #43)) are very good at coordinating with themselves, but not with other agents. They "expect" their partners to be similar to them; they are unable to predict what human partners would do. In competitive games, this is fine: if the human deviates from optimal play, even if you don't predict it you will still beat them. (Another way of saying this: the minimax theorem guarantees a minimum reward regardless of the opponent.) However, in cooperative settings, things are not so nice: a failure to anticipate your partner's plan can lead to arbitrarily bad outcomes. We demonstrate this with a simple environment that requires strong coordination based on the popular game Overcooked. We show that agents specifically trained to play alongside humans perform much better than self-play or population-based training when paired with humans, both in simulation and with a real user study.

Rohin's opinion: I wrote a short blog post talking about the implications of the work. Briefly, there are three potential impacts. First, it seems generically useful to understand how to coordinate with an unknown agent. Second, it is specifically useful for scaling up assistance games (AN #69), which are intractable to solve optimally. Finally, it can lead to more ML researchers focusing on solving problems with real humans, which may lead to us finding and solving other problems that will need to be solved in order to build aligned AI systems.

Read more: Paper: On the Utility of Learning about Humans for Human-AI Coordination

Learning Existing Social Conventions via Observationally Augmented Self-Play (Adam Lerer and Alexander Peysakhovich) (summarized by Rohin): This paper starts from the same key insight about self-play not working when it needs to generalize to out-of-distribution agents, but then does something different. They assume that the test-time agents are playing an equilibrium policy, that is, each agent plays a best response policy assuming all the other policies are fixed. They train their agent using a combination of imitation learning and self-play: the self-play gets them to learn an equilibrium behavior, while the imitation learning pushes them towards the equilibrium that the test-time agents use. They outperform both vanilla self-play and vanilla imitation learning.

Rohin's opinion: Humans don't play equilibrium policies, since they are often suboptimal. For example, in Overcooked, any equilibrium policy will zip around the layout, rarely waiting, which humans are not capable of doing. However, when you have a very limited dataset of human behavior, the bias provided by the assumption of an equilibrium policy probably does help the agent generalize better than a vanilla imitation learning model, and so this technique might do better when there is not much data.

Adversarial examples

Adversarial Policies: Attacking Deep Reinforcement Learning (Adam Gleave et al) (summarized by Sudhanshu): This work demonstrates the existence of adversarial policies of behaviour in high-dimensional, two-player zero-sum games. Specifically, they show that adversarially-trained agents ("Adv"), who can only affect a victim's observations of their (Adv's) states, can act in ways that confuse the victim into behaving suboptimally.

An adversarial policy is trained by reinforcement learning in a single-player paradigm where the victim is a black-box fixed policy that was previously trained via self-play to be robust to adversarial attacks. As a result, the adversarial policies learn to push the observations of the victim outside the training distribution, causing the victim to behave poorly. The adversarial policies do not actually behave intelligently, such as blocking or tackling the victim, but instead do unusual things like spasming in a manner that appears random to humans, curling into a ball or kneeling.

Further experiments showed that if the victim's observations of the adversary were removed, then the adversary was unable to learn such an adversarial policy. In addition, the victim's network activations were very different when playing against an adversarial policy relative to playing against a random or lifeless opponent. By comparing two similar games where the key difference was the number of adversary dimensions being observed, they showed that such policies were easier to learn in higher-dimensional games.

Sudhanshu's opinion: This work points to an important question about optimisation in high dimension continuous spaces: without guarantees on achieving solution optimality, how do we design performant systems that are robust to (irrelevant) off-distribution observations? By generating demonstrations that current methods are insufficient, it can inspire future work across areas like active learning, continual learning, fall-back policies, and exploration.

I had a tiny nit-pick: while the discussion is excellent, the paper doesn't cover whether this phenomenon has been observed before with discrete observation/action spaces, and why/why not, which I feel would be an important aspect to draw out. In a finite environment, the victim policy might have actually covered every possible situation, and thus be robust to such attacks; for continuous spaces, it is not clear to me whether we can always find an adversarial attack.

In separate correspondence, author Adam Gleave notes that he considers these to be relatively low-dimensional -- even MNIST has way more dimensions -- so when comparing to regular adversarial examples work, it seems like multi-agent RL is harder to make robust than supervised learning.

Read more: Adversarial Policies website

Other progress in AI

Reinforcement learning

Solving Rubik’s Cube with a Robot Hand (OpenAI) (summarized by Asya): Historically, researchers have had limited success making general purpose robot hands. Now, OpenAI has successfully trained a pair of neural networks to solve a Rubik's cube with a human-like robot hand (the learned portion of the problem is manipulating the hand -- solving the Rubik's cube is specified via a classical algorithm). The hand is able to solve the Rubik's cube even under a variety of perturbations, including having some of its fingers tied together, or having its view of the cube partially occluded. The primary innovation presented is a new method called Automatic Domain Randomization (ADR). ADR automatically generates progressively more difficult environments to train on in simulation that are diverse enough to capture the physics of the real world. ADR performs better than existing domain randomization methods, which require manually specifying randomization ranges. The post speculates that ADR is actually leading to emergent meta-learning, where the network learns a learning algorithm that allows itself to rapidly adapt its behavior to its environment.

Asya's opinion: My impression is that this is a very impressive robotics result, largely because the problem of transferring training in simulation to real life ("sim2real") is extremly difficult. I also think it's quite novel if as the authors hypothesize, the system is exhibiting emergent meta-learning. It's worth noting that the hand is still not quite at human-level -- in the hardest configurations, it only succeeds 20% of the time, and for most experiments, the hand gets some of the state of the cube via Bluetooth sensors inside the cube, not just via vision.

Read more: Vox: Watch this robot solve a Rubik’s Cube one-handed


FHI DPhil Scholarships (summarized by Rohin): The Future of Humanity Institute will be awarding up to two DPhil scholarships for the 2020/21 academic year, open to students beginning a DPhil at the University of Oxford whose research aims to answer crucial questions for improving the long-term prospects of humanity. Applications will open around January or February, and decisions will be made in April.

Post-Doctoral Fellowship on Ethically Aligned Artificial Intelligence (summarized by Rohin) (H/T Daniel Dewey): Mila is looking for a postdoctoral fellow starting in Fall 2020 who would work on ethically aligned learning machines, towards building machines which can achieve specific goals while acting in a way consistent with human values and social norms. Applications are already being processed, and will continue to be processed until the position is filled.

New Comment