Alignment Newsletter #29

Rohin Shah

Highlights

Deep Imitative Models for Flexible Inference, Planning, and Control (Nicholas Rhinehart et al): It's hard to apply deep RL techniques to autonomous driving, because we can't simply collect a large amount of experience with collisions in order to learn. However, imitation learning is also hard, because as soon as your car deviates from the expert trajectories that you are imitating, you are out of distribution, and you could make more mistakes, leading to accumulating errors until you crash. Instead, we can model the expert's behavior, so that we can tell when we are moving out of distribution, and take corrective action.

They split up the problem into three different stages. First, they generate a set of waypoints along the path to be followed, which are about 20m away from each other, by using A* search on a map. Next, they use model-based planning using an imitative model to generate a plan (sequence of states) that would take the car to the next waypoint. Finally, they use a simple PID controller to choose low-level actions that keep the car on target towards the next state in the plan.

The key technical contribution is with the imitative model, which is a probabilistic model P(s_{1:T}, G, φ), where φ is the current observation (eg. LIDAR), s_{1:T} is the planned trajectory, and G is a goal. We can learn P(s_{1:T} | φ) from expert demonstrations. The goal G can be anything for which you can write down a specification P(G | s_{1:T}, φ). For example, if you simply want to reach a waypoint, you can use the normal distribution on the distance between the final state s_T and the waypoint. You can also incorporate a hand-designed cost on each state.

They evaluate in simulation on a static world (so no pedestrians, for example). They show decent transfer from one map to a second map, and also that they can avoid artificially introduced potholes at test time (despite not seeing them at training time), simply by adding a cost on states over a pothole (which they can take into account because they are performing model-based planning).

Rohin's opinion: I really like this paper, it showcases the benefits of both model-based planning and imitation learning. Since the problem has been decomposed into a predictive model, a goal G, and a planner, we can edit G directly to get new behavior at test time without any retraining (as they demonstrate with the pothole experiment). At the same time, they can get away with not specifying a full reward function, as many features of good driving, like passenger comfort and staying in the correct lane, are learned simply by imitating an expert.

That said, they initially state that one of their goals is to learn from offline data, even though offline data typically has no examples of crashes, and "A model ignorant to the possibility of a crash cannot know how to prevent it". I think the idea is that you never get into a situation where you could get in a crash, because you never deviate from expert behavior since that would have low P(s_{1:T} | φ). This is better than model-based planning on offline data, which would consider actions that lead to a crash and have no idea what would happen, outputting garbage. However, it still seems that situations could arise where a crash is imminent, which don't arise much (if at all) in the training data, and the car fails to swerve or brake hard, because it hasn't seen enough data.

Interpretability and Post-Rationalization (Vincent Vanhoucke): Neuroscience suggests that most explanations that we humans give for a decision are post-hoc rationalizations, and don't reflect the messy underlying true reasons for the decision. It turns out that decision making, perception, and all the other tasks we're hoping to outsource to neural nets are inherently complex and difficult, and are not amenable to easy explanation. We can aim for "from-without" explanations, which post-hoc rationalize the decisions a neural net makes, but "from-within" explanations, which aim for a mechanistic understanding, are intractable. We could try to design models that are more interpretable (in the "from-within" sense), but this would lead to worse performance on the actual task, which would hurt everyone, including the people calling for more accountability.

Rohin's opinion: I take a pretty different view from this post -- I've highlighted it because I think this is an important disagreement that's relevant for alignment. In particular, it's not clear to me that "from-within" interpretability is doomed -- while I agree that humans basically only do "from-without" rationalizations, we also aren't able to inspect a human brain in the same way that we can inspect a neural net. For example, we can't see the output of each individual neuron, we can't tell what input would each neuron would respond maximally to, and we can't pose counterfactuals with slightly different inputs to see what changes. In fact, I think that "from-within" interpretability techniques, such as Building Blocks of Interpretability have already seen successes in identifying biases that image classifiers suffer from, that we wouldn't have known about otherwise.

We could also consider whether post-hoc rationalization is sufficient for alignment. Consider a thought experiment where a superintelligent AI is about to take a treacherous turn, but there is an explainer AI system that post-hoc rationalizes the output of the AI that could warn us in advance. If the explainer AI only gets access to the output of the superintelligent AI, I'm very worried -- it seems way too easy to come up with some arbitrary rationalization for an action that makes it seem good, you'd have to be have a much more powerful explainer AI to have a hope. On the other hand, if the explainer AI gets access to all of the weights and activations that led to the output, it seems more likely that this could work -- as an analogy, I think a teenager could tell if I was going to betray them, if they could constantly eavesdrop on my thoughts.

Technical AI alignment

Learning human intent

Deep Imitative Models for Flexible Inference, Planning, and Control (Nicholas Rhinehart et al): Summarized in the highlights!

Addressing Sample Inefficiency and Reward Bias in Inverse Reinforcement Learning (Ilya Kostrikov et al): Deep IRL algorithms typically work by training a discriminator that distinguishes between states and actions from the expert from states and actions from the learned policy, and extracting a reward function from the discriminator. In any environment where the episode can end after a variable number of timesteps, this assumes that the reward is zero after the episode ends. The reward function from the discriminator often takes a form where it must always be positive, inducing a survival incentive, or a form where it must always be negative, inducing a living cost. For example, GAIL's reward is always positive, giving a survival incentive. As a result, without any reward learning at all GAIL does better on Hopper than behavioral cloning, and fails to learn on a reaching or pushing task (where you want to do the task as quickly as possible, so you want the living cost). To solve this, they learn an "absorbing state reward", which is a reward given after the episode ends -- this allows the algorithm to learn for itself whether it should have a survival incentive or living cost.

They also introduce a version that keeps a replay buffer of experience and uses an off-policy algorithm to learn from the replay buffer in order to improve sample efficiency.

Rohin's opinion: The key insight that rewards are not invariant to additions of a constant when you have variable-length episodes is useful and I'm glad that it's been pointed out, and a solution proposed. However, the experiments are really strange -- in one case (Figure 4, HalfCheetah) their algorithm outperforms the expert (which has access to the true reward), and in another (Figure 5, right) the blue line implies that using a uniformly zero reward lets you achieve around a third of expert performance (!!).

Interpretability

Interpretability and Post-Rationalization (Vincent Vanhoucke): Summarized in the highlights!

Sanity Checks for Saliency Maps (Julius Adebayo et al)

Adversarial examples

Spatially Transformed Adversarial Examples (Chaowei Xiao et al) (summarized by Dan H): Many adversarial attacks perturb pixel values, but the attack in this paper perturbs the pixel locations instead. This is accomplished with a smooth image deformation which has subtle effects for large images. For MNIST images, however, the attack is more obvious and not necessarily content-preserving (see Figure 2 of the paper).

Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation (Chaowei Xiao et al) (summarized by Dan H): This paper considers adversarial attacks on segmentation systems. They find that segmentation systems behave inconsistently on adversarial images, and they use this inconsistency to detect adversarial inputs. Specifically, they take overlapping crops of the image and segment each crop. For overlapping crops of an adversarial image, they find that the segmentation are more inconsistent. They defend against one adaptive attack.

Uncertainty

On Calibration of Modern Neural Networks (Chuan Guo et al.) (summarized by Dan H): Models should not be unduly confident, especially when said confidence is used for decision making or downstream tasks. This work provides a simple method to make models more calibrated so that the confidence estimates are closer to the true correctness likelihood. (For example, if a calibrated model predicts “toucan” with 60% confidence, then 60% of the time the input was actually a toucan.) Before presenting their method, they observe that batch normalization can make models less calibrated, while unusually large weight decay regularization can increase calibration. However, their proposed approach to increase calibration does not impact accuracy or require substantive model changes. They simply adjust the temperature of the softmax to make the model’s “confidence” (here the maximum softmax probability) more calibrated. Specifically, after training they tune the softmax temperature to minimize the cross entropy (negative average log-likelihood) on validation data. They then measure model calibration with a measure which is related to the Brier score, but with absolute values rather than squares.

Dan H's opinion: Previous calibration work in machine learning conferences would often to focus on calibrating regression models, but this work has renewed interest in calibrating classifiers. For that reason I view this paper highly. That said, this paper’s evaluation measure, the “Expected Calibration Error” is not a proper scoring rule, so optimizing this does not necessarily lead to calibration. In their approximation of the ECE, they use equally-wide bins when there is reason to use adaptively sized bins. Consequently I think Nguyen and O’Connor Sections 2 and 3 provide a better calibration explanation, better calibration measure, and better estimation procedure. They also suggest using a convex optimization library to find the softmax temperature, but at least libraries such as CVXPY require far more time and memory than a simple softmax temperature grid search. Finally, an understandable limitation of this work is that it assumes test-time inputs are in-distribution, but when inputs are out-of-distribution this method hardly improves calibration.

Miscellaneous (Alignment)

AI Alignment Podcast: On Becoming a Moral Realist with Peter Singer (Peter Singer and Lucas Perry): There's a fair amount of complexity in this podcast, and I'm not an expert on moral philosophy, but here's an oversimplified summary anyway. First, in the same way that we can reach mathematical truths through reason, we can also arrive at moral truths through reason, which suggests that they are true facts about the universe (a moral realist view). Second, preference utilitarianism has the problem of figuring out which preferences you want to respect, which isn't a problem with hedonic utilitarianism. Before and after the interview, Lucas argues that moral philosophy is important for AI alignment. Any strategic research "smuggles" in some values, and many technical safety problems, such as preference aggregation, would benefit from a knowledge of moral philosophy. Most importantly, given our current lack of consensus on moral philosophy, we should be very wary of locking in our values when we build powerful AI.

Rohin's opinion: I'm not convinced that we should be thinking a lot more about moral philosophy. While I agree that locking in a set of values would likely be quite bad, I think this means that researchers should not hardcode a set of values, or create an AI that infers some values and then can never change them. It's not clear to me why studying more moral philosophy helps us with this goal. For the other points, it seems not too important to get preference aggregation or particular strategic approaches exactly perfect as long as we don't lock in values -- as an analogy, we typically don't argue that politicians should be experts on moral philosophy, even though they aggregate preferences and have large impacts on society.

Near-term concerns

Fairness and bias

A new course to teach people about fairness in machine learning (Sanders Kleinfeld): Google has added a short section on fairness to their Machine Learning Crash Course (MLCC).

Privacy and security

Secure Deep Learning Engineering: A Software Quality Assurance Perspective (Lei Ma et al)

Other progress in AI

Reinforcement learning

Open sourcing TRFL: a library of reinforcement learning building blocks (Matteo Hessel et al) (summarized by Richard): DeepMind is open-sourcing a Tensorflow library of "key algorithmic components" used in their RL agents. They hope that this will allow less buggy RL code.

Richard's opinion: This continues the trend of being able to easily implement deep learning at higher and higher levels of abstraction. I'm looking forward to using it.

CURIOUS: Intrinsically Motivated Multi-Task, Multi-Goal Reinforcement Learning (Cédric Colas et al) (summarized by Richard): This paper presents an intrinsically-motivated algorithm (an extension of Universal Value Function Approximators) which learns to complete multiple tasks, each parameterised by multiple “goals” (e.g. the locations of targets). It prioritises replays of tasks which are neither too easy nor too hard, but instead allow maximal learning progress; this also help prevent catastrophic forgetting by refocusing on tasks which it begins to forget.

Richard's opinion: While I don’t think this paper is particularly novel, it usefully combines several ideas and provides easily-interpretable results.

Deep learning

Discriminator Rejection Sampling (Samaneh Azadi et al): Under simplifying assumptions, GAN training should converge to the generator modelling the true data distribution while the discriminator always outputs 0.5. In practice, at the end of training the discriminator can still distinguish between images from the generator and images from the dataset. This suggests that we can improve the generated images by only choosing the ones that the discriminator thinks are from the dataset. However, if we use a threshold (rejecting all images where the discriminator is at least X% sure it comes from the generator), then we no longer model the true underlying distribution, since some low probability images could never be generated. They instead propose a rejection sampling algorithm that still recovers the data distribution under strict assumptions, and then relax those assumptions to get a practical algorithm, and show that it improves performance.

Meta learning

Meta-Learning: A Survey (Joaquin Vanschoren) (summarized by Richard): This taxonomy of meta-learning classifies approaches by the main type of meta-data they learn from:

1. Evaluations of other models on related tasks

2. Characterisations of the tasks at hand (and a similarity metric between them)

3. The structures and parameters of related models

Vanschoren explores a number of different approaches in each category.

Critiques (AI)

The 30-Year Cycle In The AI Debate (Jean-Marie Chauvet)

News

Introducing Stanford's Human-Centered AI Initiative (Fei-Fei Li et al): Stanford will house the Human-centered AI Initiative (HAI), which will take a multidisciplinary approach to understand how to develop and deploy AI so that it is robustly beneficial to humanity.

Rohin's opinion: It's always hard to tell from these announcements what exactly the initiative will do, but it seems to be focused on making sure that AI does not make humans obsolete. Instead, AI should allow us to focus more on the creative, emotional work that we are better at. Given this, it's probably not going to focus on AI alignment, unlike the similarly named Center for Human-Compatible AI (CHAI) at Berkeley. My main question for the author would be what she would do if we could develop AI systems that could replace all human labor (including creative and emotional work). Should we not develop such AI systems? Is it never going to happen?