Alignment Newsletter #19

Rohin Shah

Highlights

OpenAI Five Benchmark: Results (OpenAI's Dota Team): The OpenAI Five benchmark happened last Sunday, where OpenAI Five won two matches against the human team, and lost the last one when their draft was adversarially selected. They are now planning to play at The International in a couple of weeks (dates to be finalized). That will be a harder challenge, since they will be playing against teams that play and train professionally, and so will be better at communication and coordination than the human team here.

Blitz (one of the human players) said: "The only noticeable difference in the mechanical skill aspect was the hex from the Lion, but even that was sorta irrelevant to the overall game flow. Got outdrafted and outmaneuvered pretty heavily, and from a strategy perspective it was just better then us. Even with the limitations in place it still 'felt' like a dota game, against a very good team. It made all the right plays I'd expect most top tier teams to make."

On the technical side, OpenAI implemented a brute-force draft system. With a pool of 18 heroes, you get some combinatorial explosion, but there are still only ~11 million possible matchups. You can then do a simple tree search over which hero to draft, where at the leaves (when you have a full draft) you choose which leaf you want based on the win probability (which OpenAI Five already outputs). Seeing this in action, it seems to me like it's a vanilla minimax algorithm, probably with alpha-beta pruning so that they don't have to evaluate all ~159 billion nodes in the tree. (Or they could have done the full search once, hardcoded the action it comes up with for the first decision, and run the full search for every subsequent action, which would have under 10 billion nodes in the tree.)

Besides the win probabilities, there are other ways to get insight into what the model is "thinking" -- for example, by asking the model to predict where the hero will be in 6 seconds, or by predicting how many last hits / denies / kills / deaths it will have.

The model that played the benchmark has been training since June 9th. Of course, in that time they've changed many things about the system (if for no other reason than to remove many of the restrictions in the original post). This is not a thing that you can easily do -- typically you would change your model architecture, which means your old parameters don't map over to the new architecture. I've been pretty curious about how they handle this, but unfortunately the blog post doesn't go into much detail, beyond saying that they can in fact handle these kinds of "surgery" issues.

They estimate that this particular model has used 190 petaflop/s-days of compute, putting it just below AlphaZero.

My opinion: I think this finally fell within my expectations, after two instances where I underestimated OpenAI Five. I expected that they would let the human team choose heroes in some limited way (~80%), that OpenAI Five would not be able to draft using just gradients via PPO (~60%), and (after having seen the first two games) that the human team would win after an adversarial draft (~70%). Of course, a draft did happen, but it was done by a tree search algorithm, not an algorithm learned using PPO.

The games themselves were pretty interesting (though I have not played Dota so take this with a grain of salt). It seemed to me like OpenAI Five had learned a particularly good strategy that plays to the advantages of computers, but hadn't learned some of the strategies and ideas that human players use to think about Dota. Since it uses the same amount of computation for each decision, it makes good decisions on all timescales, including ones where something surprising has occurred where humans would need some time to react, and also to coordinate. For example, as soon as a human hero entered within range of the bots (just to look and retreat), all of the bots would immediately unleash a barrage of attacks, killing the hero -- a move that humans could not execute, because of slower reaction times and worse communication and teamwork. Similarly, one common tactic in human gameplay is to teleport into a group of heroes and unleash an area-of-effect ability, but when they tried this against OpenAI Five, one of the bots hexed the hero as soon as he teleported in, rendering him unable to cast the spell. (That felt like the decisive moment in the first game.) On the other hand, there were some clear issues with the bots. At one point, two OpenAI bots were chasing Blitz, and Blitz used an ability that made him invisible while standing still. Any human player would have spammed area attacks, but the bots simply became confused and eventually left. Similarly, I believe (if I understood the commentary correctly) that a bot once used an ability multiple times, wasting mana, even though all uses after the first had no additional effect.

Other articles would have you believe that the games weren't even close, and if you look at the kill counts, that would seem accurate. I don't think that's actually right -- from what I understand, kills aren't as important as experience and gold, and you could see this in the human gameplay. OpenAI Five would often group most of its heroes together to push forward, which means they get less experience and gold. The human team continued to keep their heroes spread out over the map to collect resources -- and even though OpenAI Five got way more kills, the overall net worth of the two teams' heroes remained about equal for most of the early game. The big difference seemed to be that when the inevitable big confrontation between the two teams happened, OpenAI Five always came out on top. I'm not sure how, my Dota knowledge isn't good enough for that. Based on Blitz's comment, my guess is that OpenAI Five is particularly good at fights between heroes, and the draft reflects that. But I'd still guess that if you had pro human players who ceded control to OpenAI Five whenever a fight was about to happen, they would beat OpenAI Five (~70%). I used to put 80% on that prediction, but Blitz's comment updated me away from that.

One interesting thing was that the win probability seemed to be very strongly influenced by the draft, which in hindsight seems obvious. Dota is a really complicated game that is constantly tweaked to keep it balanced for humans, and even then the draft is very important. When you now introduce a new player (OpenAI Five) with very different capabilities (such as very good decision making under time pressure) and change the game conditions (such as a different pool of heroes), you should expect the game to become very imbalanced, with some teams far outshining others. And in fact we did see that Lion (the hero with the hexing ability) was remarkably useful (against humans, at least).

Certified Defenses against Adversarial Examples (Aditi Raghunathan et al) and A Dual Approach to Scalable Verification of Deep Networks (Krishnamurthy (Dj) Dvijotham et al): Even when defenses are developed to make neural nets robust against adversarial examples, they are usually broken soon after by stronger attacks. Perhaps we could prove once and for all that the neural net is robust to adversarial examples?

The abstract from the Raghunathan paper summarizes their approach well: "[W]e study this problem for neural networks with one hidden layer. We first propose a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value. Second, as this certificate is differentiable, we jointly optimize it with the network parameters, providing an adaptive regularizer that encourages robustness against all attacks. On MNIST, our approach produces a network and a certificate that no attack that perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test error."

To compute the certificate, they consider the optimal attack A. Given a particular input x, the optimal attack A is the one that changes f(A(x)) to a different class, where f is the ML model, and A(x) is restricted to not change x too much. They leverage the structure of f (linear models and neural nets with one hidden layer) and the restrictions on A to compute a bound on f(A(x)) in terms of x. So, for each data point in the training set, the bound either says “guaranteed that it can’t be adversarially attacked” or “might be possible to adversarially attack it”. Averaging this over the training set or test set gives you an estimate of an upper bound on the optimal adversarial attack success rate.

The Dvijotham paper can work on general feedforward and recurrent neural nets, though they show the math specifically for nets with layers with componentwise activations. They start by defining an optimization problem, where the property to be verified is encoded as the optimization objective, and the mechanics of the neural net are encoded as equality constraints. If the optimal value is negative, then the property has been verified. The key idea to solving this problem is to break down the hard problem of understanding a sequence of linear layers followed by nonlinearities into multiple independent problems each involving a single layer and a nonlinearity. They do this by computing bounds on the values coming out of each layer (both before and after activations), and allowing the constraints to be satisfied with some slack, with the slack variables going into the objective with Lagrange multipliers. This dual problem satisfies weak duality -- the solution to the dual problem for any setting of the Lagrange multipliers constitutes an upper bound on the solution to the original problem. If that upper bound is negative, then we have verified the property. They show how to solve the dual problem -- this is easy now that the slack variables allow us to decouple the layers from each other. They can then compute a tighter upper bound by optimizing over the Lagrange multipliers (which is a convex optimization problem, and can be done using standard techniques). In experiments, they show that the computed bounds on MNIST are reasonably good for very small perturbations, even on networks with 2-3 layers.

My opinion: Lots of AI alignment researchers talk about provable guarantees from our AI system, that are quite broad and comprehensive, even if not a proof of "the AI is aligned and will not cause catastrophe". Both of these papers seem like an advance in our ability to prove things about neural nets, and so could help with that goal. My probably-controversial opinion is that in the long term the harder problem is actually figuring out what you want to prove, and writing down a formal specification of it in a form that is amenable to formal verification that will generalize to the real world, if you want to go down that path. To be clear, I'm excited about this research, both because it can be used both to solve problems that affect current AI systems (eg. to verify that a neural net on a plane will never crash under a mostly-realistic model of the world) and because it can be used as a tool for developing very capable, safer AI systems in the future -- I just don't expect it to be the main ingredient that gives us confidence that our AI systems are aligned with us.

On the methods themselves, it looks like the Raghunathan paper can achieve much tighter bounds if you use their training procedure, which can optimize the neural net weights in tandem with the certificate of robustness -- they compute a bound of 35% on MNIST with perturbations of up to size 26 (where the maximum is 256). However, there are many restrictions on the applicability of the method. The Dvijotham paper lifts many of these restrictions (multilayer neural nets instead of just one hidden layer, any training procedure allowed) but gets much looser bounds as a result -- the bounds are quite tight at perturbations of size 1 or 2, but by perturbations of size 10 the bounds are trivial (i.e. a bound of 100%). The training procedure that Raghunathan et al use is crucial -- without it, their algorithm finds non-trivial bounds on only a single small neural net, for perturbations of size at most 1.

Technical AI alignment

Problems

When Bots Teach Themselves How To Cheat (Tom Simonite): A media article about specification gaming in AI that I actually just agree with, and it doesn't even have a Terminator picture!

Agent foundations

Probabilistic Tiling (Preliminary Attempt) (Diffractor)

Logical Counterfactuals for Perfect Predictors and A Short Note on UDT (Chris Leong)

Handling groups of agents

Learning to Share and Hide Intentions using Information Regularization (DJ Strouse et al)

Interpretability

Techniques for Interpretable Machine Learning (Mengnan Du et al): This paper summarizes work on interpretability, providing a classification of different ways of achieving interpretability. There are two main axes -- first, whether you are trying to gain insight into the entire model, or its classification of a particular example; and second, whether you try to create a new model that is inherently interpretable, or whether you are post-hoc explaining the decision made by an uninterpretable model. The whole paper is a summary of techniques, so I'm not going to summarize it even further.

My opinion: This seems like a useful taxonomy that hits the kinds of interpretability research I know about, though the citation list is relatively low for a summary paper, and there are a few papers I expected to see that weren't present. On the other hand, I'm not actively a part of this field, so take it with a grain of salt.

Verification

Certified Defenses against Adversarial Examples (Aditi Raghunathan et al): Summarized in the highlights!

A Dual Approach to Scalable Verification of Deep Networks (Krishnamurthy (Dj) Dvijotham et al): Summarized in the highlights!

Near-term concerns

Adversarial examples

Adversarial Vision Challenge (Wieland Brendel et al): There will be a competition on adversarial examples for vision at NIPS 2018.

Motivating the Rules of the Game for Adversarial Example Research (Justin Gilmer, George E. Dahl et al) (H/T Daniel Filan)

Privacy and security

Security and Privacy Issues in Deep Learning (Ho Bae, Jaehee Jang et al)

AI capabilities

Reinforcement learning

OpenAI Five Benchmark: Results (OpenAI's Dota Team): Summarized in the highlights!

Learning Actionable Representations from Visual Observations (Debidatta Dwibedi et al): Prior work on Time Contrastive Networks (TCN)s showed that you can use time as an unsupervised learning signal, in order to learn good embeddings of states that you can then use in other tasks. This paper extends TCNs to work with multiple frames, so that it can understand motion as well. Consider any two short videos of a task demonstration. If they were taken at different times, then they should be mapped to different embedding vectors (since they correspond to different "parts" of the task). On the other hand, if they were taken at the same time (even if from different viewpoints), they should be mapped to the same embedding vector. The loss function based on this encourages the network to learn an embedding for these short videos that is invariant to changes in perspective (which are very large changes in pixel-space), but is different for changes in time (which may be very small changes in pixel-space). They evaluate with a bunch of different experiments.

My opinion: Unsupervised learning seems like the way forward to learn rich models of the world, because of the sheer volume of data that you can use.

ICML 2018 Notes (David Abel)

Deep learning

When Recurrent Models Don't Need to be Recurrent (John Miller): Recurrent neural networks (RNNs) are able to use and update a hidden state over an entire sequence, which means that in theory it is possible for them to learn very long term dependencies in a sequence, that a feedforward model would not be able to do. For example, it would be easy to assign weights to an RNN so that on input x_n it outputs n (the length of the sequence so far), whereas a feedforward model could not learn this function. Despite this, in practice feedforward methods match and exceed the performance of RNNs on sequence modeling tasks. This post argues that this is because of gradient descent -- any stable gradient descent on RNNs can be well approximated by gradient descent on a feedforward model (both at training and inference time).

My opinion: The post doesn't really explain why this is the case, instead referencing the theory in their paper (which I haven't read). It does sound like a cool result explaining a phenomenon that I do find confusing, since RNNs should be more expressive than feedforward models. It does suggest that gradient descent is not actually good at finding the optimum of a function, if that optimum involves lots of long-term dependencies.

Objects that Sound (Relja Arandjelović, Andrew Zisserman et al): The key idea behind this blog post is that there is a rich source of information in videos -- the alignment between the video frames and audio frames. We can leverage this by creating a proxy task that will force the neural net to learn good representations of the video, which we can then use for other tasks. In particular, we can consider the proxy task of deciding whether a short (~1 second) video clip and audio clip are aligned or not. We don't care about this particular task, but by designing our neural net in the right way, we can ensure that the net will learn good representations of video and audio. We pass the video clip through a convolutional net, the audio clip through another convolutional net, and take the resulting vectors and use the distance between them as a measure of how dissimilar they are. There is no way for video to affect the audio or vice versa before the distance -- so the net is forced to learn to map each of them to a shared space where the distance is meaningful. Intuitively, we would expect that this shared space would have to encode the cause of both the audio and video. Once we have these embeddings (and the neural nets that generate them), we can use them for other purposes. For example, their audio encoder sets the new state-of-the-art on two audio classification benchmarks. In addition, by modifying the video encoder to output embeddings for different regions in the image, we can compute the distance between the audio embedding and the video embedding at each region, and the regions where this is highest correspond to the object that is making the sound.

My opinion: Another great example of using unsupervised learning to learn good embeddings. Also, a note -- you might wonder why I'm calling this unsupervised learning even though there's a task, with a yes/no answer, a loss function, and an iid dataset, which are hallmarks of supervised learning. The difference is that the labels for the data did not require any human annotation, and we don't care about the actual task that we're learning -- we're after the underlying embeddings that it uses to solve the task. In the previous paper on learning actionable representations, time was used to define an unsupervised learning signal in a similar way.

MnasNet: Towards Automating the Design of Mobile Machine Learning Models (Mingxing Tan): Mobile phones have strong resource constraints (memory, power usage, available compute), which makes it hard to put neural nets on them. Previously, for image classification, researchers hand designed MobileNetV2 to be fast while still achieving good accuracy. Now, using neural architecture search, researchers have found a new architecture, MnasNet, which is 1.5x faster with the same accuracy. Using the squeeze-and-excitation optimization improves it even further.

My opinion: Neural architecture search is diversifying, focusing on computation time in addition to accuracy now. It seems possible that we'll run into the same problems with architecture search soon, where the reward functions are complex enough that we don't get them right on the first try. What would it look like to learn from human preferences here? Perhaps we could present two models from the search to humans, along with statistics about each, and see which ones the researchers prefer? Perhaps we could run tests on the model, and then have humans provide feedback on the result? Maybe we could use feature visualization to provide feedback on whether the network is learning the "right" concepts?

Neural Arithmetic Logic Units (Andrew Trask et al)

Generalization Error in Deep Learning (Daniel Jakubovitz et al)

Applications

The Machine Learning Behind Android Smart Linkify (Lukas Zilka): Android now has Smart Linkify technology, which allows it to automatically find pieces of text that should link to another app (for example, addresses should link to Maps, dates and times to Calendar, etc). There are a lot of interesting details on what had to be done to get this to actually work in the real world. The system has two separate nets -- one which generates candidate entities, and another which says what kind of entity each one is. In between these two nets, we have a regular program that takes the set of proposed entities, and prunes it so that no two entities overlap, and then sends it off to the entity classification net. There are a few tricks to get the memory requirements down, and many dataset augmentation tricks to get the nets to learn particular rules that it would not otherwise have learned.

My opinion: I take this as an example of what advanced AI systems will look like -- a system of different modules, each with its own job, passing around information appropriately in order to perform some broad task. Some of the modules could be neural nets (which can learn hard-to-program functions), while others could be classic programs (which generalize much better and are more efficient). OpenAI Five also has elements of this -- the drafting system is a classic program operating on the win probabilities from the neural net. It's also interesting how many tricks are required to get Smart Linkify to work -- I don't know whether to think that this means generally intelligent AI is further away, or that the generally intelligent AI that we build will rely on these sorts of tricks.

News

Human-Aligned AI Summer School: A Summary (Michaël Trazzi): A summary of the talks at the summer school that just happened, from one of the attendees, that covers value learning, agent foundations, bounded rationality, and side effects. Most of the cited papers have been covered in this newsletter, with the notable exceptions of Bayesian IRL and information-theoretic bounded rationality.

4