Alignment Newsletter #16: 07/23/18

Rohin Shah

Highlights

Seedbank — discover machine learning examples (Michael Tyka): Seedbank provides interactive machine learning examples in Colab notebooks (think Jupyter notebooks in the cloud). This makes it really easy to just run example code without any setup, and even to modify it to play around with it. Google even provides a free GPU to make the training and inference faster!

My opinion: I haven't explored it yet, but this seems great, especially if you want to learn ML. I have used Colab notebooks before and recommend them highly for small projects (maybe even large ones, I'm not sure), especially if you're familiar with Jupyter notebooks.

Announcement: AI alignment prize round 3 winners and next round (Zvi Mowshowitz and Vladimir Slepnev): The winners of the second round of the AI Alignment Prize have been announced! Vadim Kosoy wins the first prize of $7500 for The Learning-Theoretic AI Alignment Research Agenda, and Alexander Turner wins the second prize of $2500 for Worrying About the Vase: Whitelisting and Overcoming Clinginess in Impact Measures. The next round has started and will last until December 31, and each participant has been asked to submit a single entry (possibly in parts).

DeepMind hiring Research Scientist, Safety: Career opportunity!

Previous newsletters

Pascal’s Muggle Pays (Zvi) (H/T Alex Mennen): Last week I mentioned non-exploitability as a justification for not paying Pascal's mugger. Alex pointed me to this post, which makes this argument, which I had seen before, but more importantly to these comments that argue against it, which I hadn't seen. The basic idea is that the downside of being continuously exploited in the real world is still not bad enough to cancel out the potentially huge upside in the (very unlikely) world where the mugger is telling the truth.

My opinion: I'm convinced, non-exploitability doesn't save you from being Pascal's mugged. My current opinion on Pascal's mugging is ¯_(ツ)_/¯

Technical AI alignment

Technical agendas and prioritization

Mechanism design for AI (Tobias Baumann): One cause of outcomes worse than extinction could be escalating conflicts between very capable AI systems (that could eg. threaten to simulate suffering beings). It is worth studying how we could have AI systems implement mechanism design in order to guide such systems into more cooperative behavior.

Agent foundations

Probability is Real, and Value is Complex (Abram Demski): If you interpret events as vectors on a graph, with probability on the x-axis and probability * utility on the y-axis, then any rotation of the vectors preserves the preference relation, so that you will make the same decision. This means that from decisions, you cannot distinguish between rotations, which intuitively means that you can't tell if a decision was made because it had a low probability of high utility, or medium probability of medium utility, for example. As a result, beliefs and utilities are inextricably linked, and you can't just separate them. Key quote: "Viewing [probabilities and utilities] in this way makes it somewhat more natural to think that probabilities are more like "caring measure" expressing how much the agent cares about how things go in particular worlds, rather than subjective approximations of an objective "magical reality fluid" which determines what worlds are experienced."

My opinion: I am confused. If you want to read my probably-incoherent confused opinion on it, it's here.

Prerequisities: Bayesian Utility: Representing Preference by Probability Measures

Buridan's ass in coordination games (jessicata): Suppose two agents have to coordinate to choose the same action, X or Y, where X gives utility 1 and Y gives utility u, for some u in [0, 2]. (If the agents fail to coordinate, they get zero utility.) If the agents communicate, decide on policies, then observe the value of u with some noise ϵ, and then execute their policies independently, there must be some u for which they lose out on significant utility. Intuitively, the proof is that at u = 0, you should say X, and at u = 2, you should say Y, and there is some intermediate value where you are indifferent between the two (equal probability of choosing X or Y), meaning that 50% of the time you will fail to coordinate. However, if you have a shared source of randomness (after observing the value of u), then you can correlate your decisions using the randomness in order to do much better.

My opinion: Cool result, and quite easy to understand. As usual I don't want to speculate on relevance to AI alignment because it's not my area.

Learning human intent

Generative Adversarial Imitation from Observation (Faraz Torabi et al)

Exploring Hierarchy-Aware Inverse Reinforcement Learning (Chris Cundy et al): One heuristic that humans use to deal with bounded computation is to make plans hierarchically, building long-term plans out of slightly smaller building blocks. How can we incorporate this knowledge into an IRL algorithm? This paper extends Bayesian IRL to the setting where the demonstrator has access to a set of options, which are (to a first approximation) policies that can be used to achieve some subgoal. Now, when you are given a trajectory of states and actions, it is no longer clear which options the demonstrator was using to generate that trajectory. The authors provide an algorithm that can enumerate all the options that are consistent with the trajectory, and assign probabilities to them according to the Boltzmann-rational model. They evaluate on a taxi driver gridworld often used in hierarchical planning, as well as on real human data from a game called Wikispeedia.

My opinion: Hierarchy seems to be a very important tool that humans use, so I'm glad to see work on it. Currently, the algorithm is very computationally expensive, and can only be applied in small domains right now, and requires the options to be specified ahead of time, but it does lead to a benefit on the environments they consider, despite the inevitable misspecification from having to hardcode the options. I would be very interested to see an extension to high-dimensional data where the options are learned (analogous to Meta-Learning Shared Hierarchies for hierarchical RL). Not only would this be more realistic, it could perform better because the options would be learned, not hardcoded.

IBM researchers train AI to follow code of ethics (Ben Dickson): Parents want movie recommendation systems not to recommend particular kinds of movies to children, but we would also like the recommendation system to suggest movies that the children will actually like. Researchers solved this problem by first learning a model for what kinds of movies should not be recommended, and then combined that with a contextual bandit model that learns online from the child's data to provide good suggestions that follow the parent's constraints.

My opinion: We can look at this from an alignment perspective -- the child is giving the AI system a misspecified reward, relative to the parent's goal of "provide good suggestions that do not have inappropriate content". While the researchers solve it using contextual bandits, it could be interesting to consider how AI alignment approaches could deal with this situation.

Reward learning theory

Figuring out what Alice wants, parts I and II (Stuart Armstrong): Since it's not possible to infer human preferences without making some normative assumption about the human, we should try to learn the models that humans use to reason about each other that allow us to infer preferences of other humans. While we can't get access to these models directly, we can access fragments of them -- for example, whenever a person expresses regret, that can be taken as an mismatch between the model expectation and actual outcome. Part II goes through two example scenarios and what the internal human models might look like, and the challenges that arise in trying to learn them.

My opinion: It does seem like we should be able to learn the things that humans mostly agree on, and that this can help us a lot with inferring human preferences. I don't know if the goal is to use these models to infer broad human values, or something a lot simpler. Broad human values seems very unlikely to work, since you are trying to get to superhuman ability at knowing what humans want by mimicking human models (which are tautologically not superhuman).

Preventing bad behavior

Shielded Decision-Making in MDPs (Nils Jansen et al): Given a model of an MDP, we can compute a shield, which restricts the actions available to an RL agent to only the ones that can achieve at least some fraction of the optimal value. This results in safe exploration (since catastrophes would fall under the level that the shield guarantees), and also improves sample efficiency, since you no longer have tons of episodes in which the agent gets a large negative reward which only serve to teach it what not to do. They evaluate their approach on Pacman.

My opinion: They require quite a lot of modeling in order to do this -- I think that it's specific to a particular kind of MDP, where there is an agent, and adversaries (the ghosts in Pacman), that are traversing a graph (the maze), which can have tokens (the food pellets). In theory, you should just solve the MDP and not use RL at all. Also in theory, shielding would actually require you to do this (in order to calculate the optimal values of actions), in which case it seems pointless (just use the optimal policy instead). In practice, the shield is only computed over a few timesteps. So you can think of this as a way of combining explicit, computationally-expensive forward reasoning (as in value iteration, for example) with RL, which learns from experience and can scale to much longer time horizons.

From the perspective of safety, I would be a lot more interested in approaches based on formal verification if they could work with learned features, rather than requiring that the human accurately formally model the world. This seems doable using a framework similar to Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, except by getting a formal safety specification iteratively instead of learning to mimic the human shield with neural nets.

Verification

A Game-Based Approximate Verification of Deep Neural Networks with Provable Guarantees (Min Wu et al)

Miscellaneous (Alignment)

Compact vs. Wide Models (Vaniver): A compact model is one which is very general, and easy to prove things about, but doesn't inherently capture the messiness of the real world inside the model. Examples include Turing machines and utility functions. A wide model is one which still has a conceptually crisp core, but these crisp core units must then be combined in a complicated way in order to get something useful. Examples include the use of transistors to build CPUs, and the hierarchical control model of human psychology. The nice thing about wide models is that they start to engage with the messiness of the real world, and so make it clearer where the complexity is being dealt with. This is a useful concept to have when evaluating a proposal for alignment -- it asks the question, "where does the complexity reside?"

My opinion: I definitely support having models that engage more with the messiness of the real world. I'm not sure if I would have used "wide models" -- it seems like even the assumption of a crisp core makes it not as capable of handling messiness as I want. But if you're trying to get formal guarantees and you need to use some model, a wide model seems probably useful to use.

Discontinuity from the Eiffel Tower (Beth Barnes and Katja Grace): The Eiffel tower represented a 54-year discontinuity in the trend for "height of the tallest existing structure", and an 8000-year discontinuity in the trend for "height of the tallest structure ever". It's unclear what the cause of this discontinuity is, though the authors provide some speculation.

My opinion: I'm not sure if I should update without knowing the cause of the discontinuity, or how the search for discontinuities was conducted. If you're searching for discontinuities, I do expect you'll find some, even if in general I expect discontinuities not to arise, so it doesn't feel like strong evidence that discontinuities are probable.

Prerequisities: Discontinuous progress investigation or Likelihood of discontinuous progress around the development of AGI

Near-term concerns

Privacy and security

Model Reconstruction from Model Explanations (Smitha Milli et al): Many methods for providing explanations of why a neural net made the prediction it did rely on gradient information. However, the gradient encodes a lot of information about the model, and so we should expect it to be possible to easily reconstruct the model given gradients, which we might want to prevent (eg. if a company wants to protect trade secrets). In the case of a linear classifier, the gradient directly outputs the weights of the classifier. The authors provide an algorithm that can learn a two-layer neural net with Relu activations, and prove that it learns the model with high probability with a small number of gradients. They also show many experimental results where they work with more complex models, and train them to mimic another model based on gradient information, that show that it is easy to "steal" models in this way.

My opinion: This problem seems very difficult -- even if you are just given predictions from a model, you can learn the model (though it takes many more samples than if you have gradients). One technical solution could be to add random noise to your predictions or gradients, but this could limit the utility of your model, and I suspect if you trained a model to mimic these noisy predictions or gradients, it would do as well as your model + noise, so you haven't gained anything. We could potentially solve this with social mechanisms (maybe patents in particular) or more boring technical approaches like rate-limiting users in how much they can query the model.

Machine ethics

How would you teach AI to be kind? (Nell Watson): The EthicsNet Guardians Challenge is looking for suggestions on how to create a dataset that could be used to teach prosocial behavior. This is not aimed to answer difficult philosophical questions, but to teach an AI system general, simple prosocial behaviors, such as alerting someone who dropped their wallet but didn't notice. They have some ideas for how to achieve this, but are looking for more ideas before they actually start collecting a dataset.

My opinion: One of the things I think about now is how to learn "common sense", and this seems very related (though not exactly the same). One of the hardest things to do with novel AI research is to collect a good dataset (if you don't have a simulator, anyway), so this seems like a great opportunity to get a good dataset for projects trying to tackle these sorts of issues, especially for somewhat fleshed out projects where you know what kind of dataset you'll need.

AI strategy and policy

AI Policy Challenges and Recommendations

AI capabilities

Reinforcement learning

The Bottleneck Simulator: A Model-based Deep Reinforcement Learning Approach (Iulian Vlad Serban et al)

Remember and Forget for Experience Replay (Guido Novati et al)

Visual Reinforcement Learning with Imagined Goals (Ashvin Nair, Vitchyr Pong et al): Hindsight Experience Replay (HER) introduced the idea of accelerating learning with sparse rewards, by taking trajectories where you fail to achieve the goal (and so get no reward, and thus no learning signal) and replacing the actual goal with an "imagined" goal chosen in hindsight such that you actually achieved that goal, which means you get reward and can learn. This requires that you have a space of goals such that for any trajectory, you can come up with a goal such that the trajectory achieves that goal. In practice, this means that you are limited to tasks where the goals are of the form "reach this goal state". However, if your goal state is an image, it is very hard to learn how to act in order to reach any possible image goal state (even if you restrict to realistic ones), since the space is so large and unstructured. The authors propose to first learn a structured latent representation of the space of images using a variational autoencoder (VAE), and then use that structured latent space as the space of goals which can be achieved. They also use Q-learning instead of DDPG (which is what HER used), so that they can imagine any goal with a minibatch (s, a, s') and learn from it (whereas HER/DDPG is limited to states on the trajectory).

My opinion: This is a cool example of a relatively simple yet powerful idea -- instead of having a goal space over all states, learn a good latent representation and use that as your goal space. This enables unsupervised learning in order to figure out how to use a robot to generally affect the world, probably similarly to how babies explore and learn.

OpenAI Five Benchmark: The benchmark match for OpenAI Five will be a best-of-three match on August 5 at 2pm. They have already removed many of the restrictions on gameplay, including the two most important ones (wards and Roshan), as well as widening the pool of heroes to choose from 5 to 18.

My opinion: I wonder if they are planning to play a game where both sides draft heroes, or where both sides get a randomly chosen team of 5 heroes. Previously I would have expected that they were choosing randomly, since it seems very difficult to learn solely from experience whether your team choice works well, given that the number of possible drafts is combinatorially large, and the way that the draft affects outcome is very complicated and long term and so hard to capture in a gradient. Now, I'm pretty uncertain -- if deep RL was enough to get this far, it could be good enough to deal with that as well. And it's possible that you can actually do well at drafting with some relatively simple heuristics -- I don't know Dota well enough to say.

Deep learning

Automatically Composing Representation Transformations as a Means for Generalization (Michael B. Chang et al)

Universal Transformers (Mostafa Dehghani, Stephan Gouws et al)

Seedbank — discover machine learning examples (Michael Tyka): Summarized in the highlights!

Deep Learning in the Wild (Thilo Stadelmann et al): Describes how deep learning is used to solve real-world problems (eg. in industry).

My opinion: The conclusions (section 8) contain a nice list of lessons learned from their case studies, emphasizing problems such as the difficulty of getting good data, the importance of reward shaping, etc.

AGI theory

Steps toward super intelligence (1, 2, 3, 4) (Rodney Brooks): Part 1 goes into four historical approaches to AI and their strengths and weaknesses. Part 2 talks about what sorts of things an AGI should be capable of doing, proposing two tasks to evaluate on to replace the Turing test (which simple not-generally-intelligent chatbots can pass). The tasks are an elder care worker (ECW) robot, that could assist the elderly and let them live their lives in their homes, and a services logistics planner (SLP), which should be able to design systems for logistics, such as the first-ever dialysis ward in a hospital. Part 3 talks about what sorts of things are hard now, but talks about rather high-level things such as reading a book and writing code. Part 4 has suggestions on what to work on right now, such as getting object recognition and manipulation capabilities of young children.

My opinion: Firstly, you may just want to skip it all because many parts drastically and insultingly misrepresent AI alignment concerns. But if you're okay with that, then part 2 is worth reading -- I really like the proposed tasks for AGI, they seem like good cases to think about. Part 1 doesn't actually talk about superintelligence so I would skip it. Part 3 was not news to me, and I suspect will not be news to readers of this newsletter (even if you aren't an AI researcher). I disagree with the intuition behind Part 4 as a method for getting superintelligent AI systems, but it does seem like the way we will make progress in the short term.

News

Solving the AI Race finalists — $15,000 of prizes (Marek Rosa)

Announcement: AI alignment prize round 3 winners and next round (Zvi Mowshowitz and Vladimir Slepnev): Summarized in the highlights!

DeepMind hiring Research Scientist, Safety: Summarized in the highlights!

Ought's Progress Update July 2018 (Andreas Stuhlmüller): A lot of organizational updates that I won't summarize here. There's a retrospective about the Predicting Slow Judgments project, and some updates on the Factored Cognition project. Two particularly interesting points -- first, they have not yet run into questions where it seemed impossible to make progress by decomposing the problem, making them slightly more optimistic; and second, they are now more confident that decomposition will take a large amount of work, such that experiments will require some amount of automation using ML in order to be feasible.

AI Alignment Podcast: AI Safety, Possible Minds, and Simulated Worlds with Roman Yampolskiy (Lucas Perry and Roman Yampolskiy)

12