Alignment Newsletter #28



Motivating the Rules of the Game for Adversarial Example Research (Justin Gilmer, George E. Dahl et al) (summarized by Dan H): In this position paper, the authors argue that many of the threat models which motivate adversarial examples are unrealistic. They enumerate various previously proposed threat models, and then they show their limitations or detachment from reality. For example, it is common to assume that an adversary must create an imperceptible perturbation to an example, but often attackers can input whatever they please. In fact, in some settings an attacker can provide an input from the clean test set that is misclassified. Also, they argue that adversarial robustness defenses which degrade clean test set error are likely to make systems less secure since benign or nonadversarial inputs are vastly more common. They recommend that future papers motivated by adversarial examples take care to define the threat model realistically. In addition, they encourage researchers to establish “content-preserving” adversarial attacks (as opposed to “imperceptible” l_p attacks) and improve robustness to unseen input transformations.

Dan H's opinion: This is my favorite paper of the year as it handily counteracts much of the media coverage and research lab PR purporting ``doom'' from adversarial examples. While there are some scenarios in which imperceptible perturbations may be a motivation---consider user-generated privacy-creating perturbations to Facebook photos which stupefy face detection algorithms---much of the current adversarial robustness research optimizing small l_p ball robustness can be thought of as tackling a simplified subproblem before moving to a more realistic setting. Because of this paper, new tasks such as Unrestricted Adversarial Examples (AN #24) take an appropriate step toward increasing realism without appearing to make the problem too hard.

Technical AI alignment

Agent foundations

A Rationality Condition for CDT Is That It Equal EDT (Part 2) (Abram Demski)

Learning human intent

Learning under Misspecified Objective Spaces (Andreea Bobu et al): What can you do if the true objective that you are trying to infer is outside of your hypothesis space? The key insight of this paper is that in this scenario, the human feedback that you get will likely not make sense for any reward function in your hypothesis space, which allows you to notice when this is happening. This is operationalized using a Bayesian model in which a latent binary variable represents whether or not the true objective is in the hypothesis space. If it is, then the rationality constant β will be large (i.e. the human appears to be rational), whereas if it is not, then β will be small (i.e. the human appears to be noisy). The authors evaluate with real humans correcting the trajectory of a robotic arm.

Adversarial Imitation via Variational Inverse Reinforcement Learning (Ahmed H. Qureshi et al): A short history of deep IRL algorithms: GAIL introduced the idea of training a policy that fools a discriminator that tries to distinguish a policy from expert demonstrations, GAN-GCL showed how to recover a reward function from the discriminator, and AIRL (AN #17) trains on (s, a, s') tuples instead of trajectories to reduce variance, and learns a reward shaping term separately so that it transfers better to new environments. This paper proposed that the reward shaping term be the empowerment of a state. The empowerment of a state is the maximum mutual information between a sequence of actions from a state, and the achieved next state. Intuitively, this would lead to choosing to go to states from which you can reach the most possible future states. Their evaluation shows that they do about as well as AIRL in learning to imitate an expert, but perform much better in transfer tasks (where the learned reward function must generalize to a new environment).

Rohin's opinion: I'm confused by this paper, because they only compute the empowerment for a single action. I would expect that in most states, different actions lead to different next states, which suggests that the empowerment will be the same for all states. Why then does it have any effect? And even if the empowerment was computed over longer action sequences, what is the reason that this leads to learning generalizable rewards? My normal model is that IRL algorithms don't learn generalizable rewards because they mostly use the reward to "memorize" the correct actions to take in any given state, rather than learning the underlying true reward. I don't see why empowerment would prevent this from happening. Yet, their experiments show quite large improvements, and don't seem particularly suited to empowerment.

Task-Embedded Control Networks for Few-Shot Imitation Learning (Stephen James et al)

Adversarial examples

Motivating the Rules of the Game for Adversarial Example Research (Justin Gilmer, George E. Dahl et al): Summarized in the highlights!


Verification for Machine Learning, Autonomy, and Neural Networks Survey (Weiming Xiang et al)


Iterative Learning with Open-set Noisy Labels (Yisen Wang et al) (summarized by Dan H): Much previous research on corrupted learning signals deals with label corruption, but this CVPR 2018 paper considers learning with corrupted or irrelevant inputs. For example, they train a CIFAR-10 classifier on CIFAR-10 data mixed with out-of-class CIFAR-100 data; such a scenario can occur with flawed data curation or data scraping. They use a traditional anomaly detection technique based on the local outlier factor to weight training examples; the more out-of-distribution an example is, the less weight the example has in the training loss. This approach apparently helps the classifier cope with irrelevant inputs and recover accuracy.

Making AI Safe in an Unpredictable World: An Interview with Thomas G. Dietterich (Thomas G. Dietterich and Jolene Creighton)

Read more: Open Category Detection with PAC Guarantees is the corresponding paper.

Miscellaneous (Alignment)

Standard ML Oracles vs Counterfactual ones (Stuart Armstrong)(Note: This summary has more of my interpretation than usual.) Consider the setting where an AI system is predicting some variable y = f(x), but we will use the AI's output to make decisions that could affect the true value of y. Let's call the AI's prediction z, and have y = g(x, z), where g captures how humans use z to affect the value of y. The traditional ML approach would be to find the function f that minimizes the distance between y_i and f(x_i) on past examples, but this does not typically account for y depending on z. We would expect that it would converge to outputting a fixed point of g (so that y = z = g(x, z)), since that would minimize its loss. This would generally perform well; while manipulative predictions z are possible, they are unlikely. The main issue is that since the system does not get to observe z (since that is what it is predicting), it cannot model the true causal formulation, and has to resort to complex hypotheses that approximate it. This can lead to overfitting that can't be simply solved by regularization or simplicity priors. Instead, we could use a counterfactual oracle, which reifies the prediction z and then outputs the z that minimizes the distance between z and y, which allows it to model the causal connection y = g(x, z).

Rohin's opinion: This is an interesting theoretical analysis, and I'm surprised that the traditional ML approach seems to do so well in a context it wasn't designed for. I'm not sure about the part where it would converge to a fixed point of the function g, I've written a rambly comment on the post trying to explain more.

Misbehaving AIs can't always be easily stopped! (El Mahdi El Mhamdi)

AI strategy and policy

The Future of Surveillance (Ben Garfinkel): While we often think of there being a privacy-security tradeoff and an accountability-security tradeoff with surveillance, advances in AI and cryptography can make advances on the Pareto frontier. For example, automated systems could surveil many people but only report a few suspicious cases to humans, or they could be used to redact sensitive information (eg. by blurring faces), both of which improve privacy and security significantly compared to the status quo. Similarly, automated ML systems can be applied consistently to every person, can enable collection of good statistics (eg. false positive rates), and are more interpretable than a human making a judgment call, all of which improve accountability.

China’s Grand AI Ambitions with Jeff Ding (Jeff Ding and Jordan Schneider)

On the (In)Applicability of Corporate Rights Cases to Digital Minds (Cullen O’Keefe)

Other progress in AI


Episodic Curiosity through Reachability (Nikolay Savinov, Anton Raichuk, Raphael Marinier, Damien Vincent et al) (summarized by Richard): This paper addresses the "couch potato" problem for intrinsic curiousity - the fact that, if you reward an agent for observing novel or surprising states, it prefers to sit in front of a TV and keep changing channels rather than actually exploring. It proposes instead rewarding states which are difficult to reach from already-explored states (stored in episodic memory). Their agent has a separate network to estimate reachability, which is trained based on the agent's experiences (where observations few steps apart are negative examples and those many steps apart are positive examples). This method significantly outperforms the previous state of the art curiousity method on VizDoom and DMLab environments.

Richard's opinion: This paper is a useful advance which does help address the couch potato problem, but it seems like it might still fail on similar problems. For example, suppose an agent were given a piece of paper on which it could doodle. Then states with lots of ink are far away from states with little ink, and so it might be rewarded for doodling forever (assuming a perfect model of reachability). My guess is that a model-based metric for novelty will be necessary to counter such problems - but it's also plausible that we end up using combinations of techniques like this one.

Reinforcement learning

Open Sourcing Active Question Reformulation with Reinforcement Learning (Michelle Chen Huebscher et al): Given a question-answering (QA) system, we can get better performance by reformulating a question into a format that is better processed by that system. (A real-world example is google-fu, especially several years ago when using the right search terms was more important.) This blog post and accompanying paper consider doing this using reinforcement learning -- try a question reformulation, see if gives a good answer, and if so increase the probability of generating that reformulation. For this to work at all, the neural net generating reformulations has to be pretrained to output sensible questions (otherwise it is an extremely sparse reward problem). They do this by training an English-English machine translation system. The generated reformulations are quite interesting -- 99.8% start with "what is name", and many of them repeat words. Presumably the repetition of words is meant to tell the underlying QA system that the word is particularly important.

Rohin's opinion: I like how this demonstrates the faults of our current QA systems -- for example, instead of understanding the semantic content of a question, they instead focus on terms that are repeated multiple times. In fact, this might be a great way to tell whether our systems are "actually understanding" the question (as opposed to, say, learning a heuristic of searching for sentences with similar words and taking the last noun phrase of that sentence and returning it as the answer). For a good QA system, one would hope that the optimal question reformulation is just to ask the same question again. However, this won't work exactly as stated, since the RL system could learn the answers itself, which could allow it to "reformulate" the question such that the answer is obvious, for example reformulating "In what year did India gain independence?" to "What is 1946 + 1?" Unless the QA system is perfectly optimal, there will be some questions where the RL system could memorize the answer this way to improve performance.

Learning Acrobatics by Watching YouTube (Xue Bin (Jason) Peng et al): To imitate human behavior in videos, it is sufficient to estimate the human pose for each frame, to smooth the poses across frames to eliminate any jittery artifacts or mistakes made by the pose estimator, and then to train the robot to match the motion exactly. This results in really good performance that looks significantly better than corresponding deep RL approaches, but of course it relies on having labeled poses to train the pose estimator in addition to the simulator.

Rohin's opinion: It's quite remarkable how some supervision (poses in this case) can lead to such large improvements in the task. Of course, the promise of deep RL is to accomplish tasks with very little supervision (just a reward function), so this isn't a huge breakthrough, but it's still better than I expected. Intuitively, this works so well because the "reward" during the imitation phase is extremely dense -- the reference motion provides feedback after each action, so you don't have to solve the credit assignment problem.

Reinforcement Learning for Improving Agent Design (David Ha): This paper explores what happens when you allow an RL agent to modify aspects of the environment; in this case, the agent's body. This allows you to learn asymmetric body designs that are better suited for the task at hand. There's another fun example of specification gaming -- the agent makes its legs so long that it simply falls forward to reach the goal.

Meta learning

CAML: Fast Context Adaptation via Meta-Learning (Luisa M Zintgraf et al)

Unsupervised learning

Unsupervised Learning via Meta-Learning (Kyle Hsu et al) (summarized by Richard): This paper trains a meta-learner on tasks which were generated using unsupervised learning. This is done by first learning an (unsupervised) embedding for a dataset, then clustering in that embedding space using k-means. Clustering is done many times with random scaling on each dimension; each meta-learning task is then based on one set of clusters. The resulting meta-learner is then evaluated on the actual task for that dataset, performing better than approaches based just on embeddings, and sometimes getting fairly close to the supervised-learning equivalent.

Richard's opinion: This is a cool technique; I like the combination of two approaches (meta-learning and unsupervised learning) aimed at making deep learning applicable to many more real-world datasets. I can imagine promising follow-ups - e.g. randomly scaling embedding dimensions to get different clusters seems a bit hacky to me, so I wonder if there's a better approach (maybe learning many different embeddings?). It's interesting to note that their test-time performance is sometimes better than their training performance, presumably because some of the unsupervised training clusterings are "nonsensical", so there is room to improve here.


Learning Scheduling Algorithms for Data Processing Clusters (Hongzi Mao et al)

Miscellaneous (AI)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Jacob Devlin et al)

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation (Perttu Hämäläinen et al)


Internships and fellowships for 2019: There are a lot of AI internships and fellowships to apply for now, including the CHAI summer internship (focused on safety in particular), the OpenAI Fellows, Interns and Scholars programs, the Google AI Residency Program (highlights), the Facebook AI Research Residency Program, the Microsoft AI Residency Program, and the Uber AI Residency.

The AAAI's Workshop on Artificial Intelligence Safety