Alignment Newsletter #24

Rohin Shah

Starting from this week, Richard Ngo will join me in writing summaries. His summaries are marked as such; I'm reviewing some of them now but expect to review less over time.

Highlights

Introducing the Unrestricted Adversarial Examples Challenge (Tom B. Brown et al): There's a new adversarial examples contest, after the one from NIPS 2017. The goal of this contest is to figure out how to create a model that never confidently makes a mistake on a very simple task, even in the presence of a powerful adversary. This leads to many differences from the previous contest. The task is a lot simpler -- classifiers only need to distinguish between bicycles and birds, with an option of saying "ambiguous". Instead of using the L-infinity norm ball to define what an adversarial example is, attackers are allowed to supply any image whatsoever, as long as a team of human evaluators agrees unanimously on the classification of the image. The contest has no time bound, and will run until some defense survives for 90 days without being broken even once. A defense is not broken if it says "ambiguous" on an adversarial example. Any submitted defense will be published, which means that attackers can specialize their attacks to that specific model (i.e. it is white box).

My opinion: I really like this contest format, it seems like it's actually answering the question we care about, for a simple task. If I were designing a defense, the first thing I'd aim for would be to get a lot of training data, ideally from different distributions in the real world, but data augmentation techniques may also be necessary, especially for eg. images of a bicycle against an unrealistic textured background. The second thing would be to shrink the size of the model, to make it more likely that it generalizes better (in accordance with Occam's razor or the minimum description length principle). After that I'd think about the defenses proposed in the literature. I'm not sure how the verification-based approaches will work, since they are intrinsically tied to the L-infinity norm ball definition of adversarial examples, or something similar -- you can't include the human evaluators in your specification of what you want to verify.

The What-If Tool: Code-Free Probing of Machine Learning Models (James Wexler): When you train an ML model, it is often hard to understand what your model is doing and why. This post introduces the What-If tool, which allows you to ask counterfactual queries about the decision rule implemented by your final trained model, for classification and regression tasks. For example, you can take a particular data point, edit it slightly, and see how that changes the model prediction. Or you can graph the data points by L2 distance from a particular point. For classification tasks, you can find the "closest counterfactual", that is, the data point closest to the current point where the decision of the model is reversed. I played around with some of the demos, and apparently for a particular person and a particular model trained on census data, the probability that they had a salary of over $50k depended much more strongly on their marital status than their age, which was the opposite of my prediction. I figured this out by choosing a point, finding the closest counterfactual, and then making each of the changes in the delta individually and seeing which affected the model probability most.

My opinion: I'm guessing this is limited to tasks where your data points have a reasonable number of features (< 1000, I'd guess) and you are only analyzing a small set of test data points (around tens of thousands), due to computational constraints. That said, for those tasks, this seems incredibly useful to actually get a good model that you can debug and eventually deploy.

It's worth noting that this is an engineering achievement. Researchers are considering even stronger (but more computationally difficult) techniques, such as finding which part of the training set most influenced a particular decision, whereas the What-If tool doesn't talk about the training set and training process at all, instead only allowing you to ask queries about the final trained model.

Preserving Outputs Precisely while Adaptively Rescaling Targets (Matteo Hessel et al): When an agent is trained on multiple tasks across which the sizes of rewards vary greatly, it usually focuses on tasks which provide the largest or most frequent rewards at the expense of performance on others. Previous work dealt with this by clipping rewards outside a certain range, but this changes the optimal policy (eg. in Pacman, eating pellets is just as rewarding as eating ghosts). This paper uses PopArt (introduced in this 2016 paper) to normalise rewards from each task before using them to update the policy in an actor-critic RL algorithm. The authors use PopArt to train a single IMPALA agent which can play all 57 Atari games, achieving a median performance slightly higher than human performance.

To delve into more detail about PopArt, let's consider training a policy with an actor-critic algorithm. In this case, we need a critic that produces estimates of values V, and an actor that produces probabilities of actions. Both of these networks are trained by taking gradients of their outputs, and weighting them based on the observed rewards. Now, a key empirical fact about deep learning is that it works better if all the things are normalized, especially the gradients. (If nothing else, it makes it easier to choose the learning rate.) For the actor, this is easy -- probabilities are already normalized, and the weight of gradient is proportional to the reward, so we can just rescale the weight of the gradient based on the mean and standard deviation of the rewards we have observed so far. This is a bit harder for the critic, since it has to predict values, so we have to normalize both the outputs and the gradient weights. We can normalize the gradient weights in the same way as before. However, normalizing the outputs is tricky, because as time goes on the means and standard deviations change. To do this, at every timestep we modify the weights of the critic that is equivalent to unnormalizing based on the old statistics and then normalizing based on the new statistics. This gives the PopArt method.

Here's a simple example where I butcher types, ignore the difference between states and trajectories, and throw away the standard deviation. Suppose the first reward we see is 10, so we say that our mean is 10 and train our net to output a normalized reward of 0 for this state and action. Then, we see a reward of 100, so we update our mean to 55. On our previous (state, action) pair, we still output a normalized reward of 0, which now corresponds to a real reward of 55, even though it should correspond to 10! We then do the unnormalize-and-renormalize trick. After unnormalization, the critic would output 10, and after renormalization, the network would output -45, which when combined with the mean of 55 would give us the desired 10 reward.

My opinion: This is an impressive result, since it's the first time a single agent has performed so well on a range of Atari games. It doesn't seem to have required any novel techniques except for a straightforward extension of PopArt to the multi-task setting, but this is still interesting since the results from the previous PopArt paper were very mixed (performance increased and decreased dramatically on different games, with the average remaining roughly stable).

One confusing aspect was that PopArt still benefitted slightly from being trained with reward clipping (110% vs. 101% in the unclipped case), even though the point of PopArt was to normalize rewards so that clipping wasn't necessary. I'm assuming the clipping happens after PopArt normalization, since if it happens before then you lose information as in the Pacman example. In this case, maybe it's that the reward distribution is fat-tailed, and so even after normalization there could be some extreme rewards that after normalization are still large enough that they would cause updates that are too large, and clipping alleviates this problem.

ML Writing Month May 2018 (Cody Wild): The author wrote up a summary of an ML paper every day in May, which have all been collected in this doc.

My opinion: These summaries seem really good to me (probably higher quality than a typical summary that I write), but are often on topics I'm not an expert in (eg. GANs) so it's hard for me to evaluate. The one paper I knew well (Inverse Reward Design) had a good summary.

Technical AI alignment

Technical agendas and prioritization

Comment on decision theory (Rob Bensinger): MIRI works on Agent Foundations because AGIs should be good reasoners, and we're currently confused about how to have good reasoning, and work on logical uncertainty and decision theory should help us resolve this confusion. If we don't resolve the confusion, it will be significantly harder to build AGI in a way that is clean, understandable and interpretable, and as a result it will be harder to understand what is happening and to fix it if anything goes wrong. This is analogous to how it seems really useful to understand Newton's law of gravitation before you try to build rockets, even though the work of figuring out Newton's law is very different from the rocket-building work.

My opinion: My basic opinion is that this makes sense and agrees with my model. On the other hand, I'm not planning to switch to working on decision theory now, so perhaps I should say why. Partly it's that I have a comparative advantage at ML work, but it's also an impression that Agent Foundations will not help much with the first powerful AI systems we build. On one axis, I wouldn't be surprised if the first powerful AI systems don't look like the good reasoners that MIRI studies, and so Agent Foundations research won't apply. On another axis, Agent Foundations seems like a hard problem that we may not solve before powerful AI systems are created. I do find it plausible that to build aligned AI systems that are much more powerful than humans, we must understand it at the level of Agent Foundations understanding. (Though I also find the opposite statement plausible.) However, I think we will first build powerful AI systems that are not that much more powerful than humans, and that direct alignment of ML techniques will be sufficient to make that safe (even though they do pose an x-risk). (I suspect this is where my main disagreement with people at MIRI is.) We can then use those systems to help us solve Agent Foundations before we scale up.

Iterated distillation and amplification

Disagreement with Paul: alignment induction (Stuart Armstrong): The amplification step in iterated distillation and amplification (IDA) requires an inductive argument saying that if the agent at level n is aligned, then so is the one at level n+1. However, in order to get the induction to work, you need to say not just that agent at level n won't take unaligned actions, but that it will also "assess the decisions of a higher agent in a way that preserves alignment (and preserves the preservation of alignment, and so on)". This seems like a much less intuitive criterion, and so getting the base case of an agent that a human can verify has this property may be too hard -- it probably has to have all of the friendly utility function in the base case itself, or perhaps it gets it after one or two iterations (if it needs to solve a few problems along the way).

My opinion: If I imagine each level A[n] as maximizing the expected value of some simple utility function, I agree that it would be surprising if the result was not one of your first three cases. Intuitively, either we already have all of the friendly utility function, and we didn't need induction, or we didn't and bad things happen, which corresponds to cases 1 and 3.

But it seems like one of the main points of iterated amplification is that at least the initial levels need not be maximizing the expected value of some simple utility. In that case, there seems to be a much wider space of possible designs.

For example, we could have a system that has the epistemic state of wanting to help humans but knowing that it doesn't know how best to do that, and so asking humans for feedback and deferring to them when appropriate. Such a system with amplification might eventually learn the friendly utility function and start maximizing that, but it seems like there could be many iterations before that point, during which it is corrigible in the sense of deferring to humans and not maximizing its current conception of what is best.

I don't have a strong sense at the moment what would happen, but it seems plausible that the induction will go through and will have "actually mattered".

Learning human intent

Active Inverse Reward Design (Sören Mindermann et al): (Note: I am an author on this paper.) Inverse Reward Design (IRD) introduced the idea that a hand-designed reward function should be treated as an observation about the intended reward function. Any particular hand-designed reward function (called a proxy reward) is likely if it incentivizes good behavior in the training environment, as measured by the true reward function. In this paper, instead of asking the reward designer to choose a proxy reward from the space of all possible reward functions, the designer is presented with a small subset of possible reward functions and asked to choose the best option from those. The subset is chosen such that the answer will be maximally informative about the true reward (in expectation). This is an easier query for the reward designer to answer, and can convey more information about the true reward function. To see this, we can imagine pairing each possible reward function with the trajectory that it incentivizes in the training environment. Then original IRD gets information about the best such trajectory, which could correspond to multiple true reward functions, whereas active IRD can get information about the best trajectory in any subset of trajectories, and so can get more information in total. In some cases, that extra information can narrow down the space of possible reward functions to a single true reward function. The paper discusses two kinds of queries that can be asked (discrete and continuous), and several optimizations for actually computing the best query to ask (which can be computationally intensive). The technique is demonstrated on a contextual bandits problem and gridworlds.

Prerequisities: Inverse Reward Design

Expert-augmented actor-critic for ViZDoom and Montezumas Revenge (Michał Garmulewicz et al) (summarized by Richard): The authors augment ACKTR (a natural gradient RL algorithm) with an additional term in the loss function which depends on expert data. In particular, policies which choose different actions from samples of 14 expert trajectories are penalised, with a coefficient that depends on the expert's advantage over the critic's current expectations. This allows the agent to perform well on Montezuma's Revenge and a ViZDoom maze, sometimes beating the experts it was trained on. It also discovered a new bug in Montezuma's Revenge which increases its score by a factor of 40.

My opinion: I'm not convinced that this paper's method of utilising expert data is an improvement on other approaches, such as this paper in which an agent learns to play Montezuma's revenge from watching a Youtube video. However, it does seem to learn faster than most others, probably due to using ACKTR. I'd also expect it to be overfitting to the expert trajectories, but can't determine the extent to which this is the case (the authors claim that their agent can continue gameplay into the second world of Montezuma's Revenge despite only having expert trajectories for the first world, but don't provide metrics of success in the second world).

Addressing Sample Inefficiency and Reward Bias in Inverse Reinforcement Learning (Ilya Kostrikov et al)

Interpretability

The What-If Tool: Code-Free Probing of Machine Learning Models (James Wexler): Summarized in the highlights!

Verification

Training for Faster Adversarial Robustness Verification via Inducing ReLU Stability (Kai Y. Xiao et al)

Miscellaneous (Alignment)

(A -> B) -> A (Scott Garrabrant): This blog post has some thoughts on the type (A -> B) -> A, which can be thought of as the type of an agent. Rather than summarize the post, which seems hard, I'm going to say things about this type inspired by the post, and then you can decide whether to read the post.

My opinion: Intuitively, this type says that given something that maps A to B, you can get an A. If you think of (A -> B) as an environment where A is the action and B is the effect, then this type is a function that says which actions you should take. Note that the goal, which would be an element of B, is not passed in explicitly, so the goal must be inside of the function of type (A -> B) -> A, similar to our typical views of what an "agent" is. If you only assume that you know what A and B are, but you know nothing else, to do anything interesting you would be doing black box optimization -- that is, you get some f of type (A -> B) that you know nothing about, and so you just keep computing f(a) for different a ∈ A, looking for a particular b ∈ B. Perhaps you build a model of the function f and then do something more abstract with your model to find a good a ∈ A. (The post mentions a similar idea by noting that argmax has this type signature.) The post also has some thoughts about game theory that are interesting.

Petrov corrigibility (Stuart Armstrong): There can be situations where a human asks an AI for guidance, and the AI's action then determines what the human's preferences are. In such a situation, taking either action is incorrigible and so the AI should say "there is no corrigible action to take". But what if saying that would itself predictably change the human's decision? In that case, even that would not be corrigible.

There is also a comment chain discussing whether how this notion of corrigibility/alignment differs from the notion Paul Christiano talks about here.

My opinion: I've written enough opinions about this version of corrigibility, I think I'd just be repeating myself. You can look through Stuart's recent posts and find my comments there if you really want (eg. here).

Near-term concerns

Adversarial examples

Introducing the Unrestricted Adversarial Examples Challenge (Tom B. Brown et al): Summarized in the highlights!

Are adversarial examples inevitable? (Ali Shafahi et al)

Other progress in AI

Reinforcement learning

Preserving Outputs Precisely while Adaptively Rescaling Targets (Matteo Hessel et al): Summarized in the highlights!

Challenges of Context and Time in Reinforcement Learning: Introducing Space Fortress as a Benchmark (Akshat Agarwal et al)

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning (Tom Zahavy, Matan Haroush, Nadav Merlis et al)

Improving On-policy Learning with Statistical Reward Accumulation (Yubin Deng et al)

Keep it stupid simple (Erik J Peterson et al)

ViZDoom Competitions: Playing Doom from Pixels (Marek Wydmuch et al)

Deep learning

Neural Guided Constraint Logic Programming for Program Synthesis (Lisa Zhang et al): In program synthesis from examples, we want to find a program consistent with a given set of input-output examples. One classic approach is to use logic programming. In logic programming, instead of writing functions that compute output = f(input), we write rules to compute relations. To encode standard functions, we would write the relation (f, i, o), which is interpreted as "computing f(i) gives o". In logic programming, you can let any variable be unknown, and the language will search for a solution. Using this you can eg. invert a function f on a specific output o, using the query (f, ?, o). To apply logic programming to program synthesis, we write an interpreter eval for the language we want to synthesize in, and pose the query (eval, ?, i, o). They consider the lambda calculus with pairs and lists as their language.

The algorithm that falls out is a recursive descent search over the possible structure of the program, that generates and checks partial constraints over the partial programs implied by the input-output examples during the search. The search has branching points where it must choose, for some as-yet-unknown part of the program, what language construct it should use (if, cons, variable, etc.) This paper attempts to use a neural net to predict what choice the search should make to find a solution, replacing some simple hand-tuned heuristics. It can be trained either using reinforcement learning (where the search choices are actions, the partial search trees are states, and the goal is to find a complete program), or through supervised learning since they know for training programs what choices are optimal. They also use a curriculum and experience replay. They evaluate against classical symbolic approaches (λ2, Escher, Myth) and RobustFill, and show that their method generalizes better to finding longer programs not seen in the training dataset.

My opinion: It always makes me happy to read a paper about making symbolic approaches faster using neural nets to learn heuristics. That said, I'm concerned about the evaluation in this paper -- their programs are fairly strange, often involving a huge mess of cons (make-pair), car (first) and cdr (second), and not including recursion. The symbolic approaches they evaluate against are aiming to synthesize recursive functions similar to what people write, and I wouldn't be surprised if they had heuristics that actively discouraged these big messes of cars and cdrs, since normal programs don't look like that. The programs are also primarily taking pieces of data out of an input, and then recombining them in some way -- this feels like a significantly easier task than most synthesis problems (in the sense that I could probably write a handcoded solution that performs very well on this domain only).

Miscellaneous (AI)

ML Writing Month May 2018 (Cody Wild): Summarized in the highlights!

[-]Rohin Shah7y20

Update: A reader suggested that in the open-source implementation of PopArt, the PopArt normalization happens after the reward clipping, counter to my assumption. I no longer understand why PopArt is helping, beyond "it's good for things to be normalized".

[-]Arthur Conmy4y00

What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?

[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples