Alignment Newsletter #20

Rohin Shah

This week's newsletter is pretty light, I didn't find much. On one of the two days I checked, Arxiv Sanity had no recommendations for me at all, when usually it has over five.

Highlights

Large-Scale Study of Curiosity-Driven Learning (Yuri Burda, Harri Edwards, Deepak Pathak et al): One major challenge in RL is how to explore the environment sufficiently in order to find good rewards to learn from. One proposed method is curiosity, in which the agent generates an internal reward for taking any transition where the outcome was surprising, where surprisal is measured as the negative log probability assigned to the outcome by the agent. In this paper, a neural net that takes as input observation features φ(x) and action a, and predicts the features of the next state observation. The mean squared error with the actual features of the next state is then a measure of the surprisal, and is used as the curiosity reward. This is equivalent to treating the output of the neural net as the mean of a Gaussian distribution with fixed variance, and defining the reward to be the negative log probability assigned to the actual next state.

This still leaves the feature function φ undetermined. They consider using pixels directly, using a CNN with randomly chosen fixed weights, learned CNN features using a variational autoencoder (VAE) (which optimize for features that are useful for reconstructing the observation), and learned CNN features using inverse dynamics (IDF) (which optimize for features that are useful for reconstructing the action, biasing the features towards aspect of the environment that the agent can control). As you might expect, pixels don't work very well. However, random features do work quite well, often beating the VAE and IDF. This can happen because the random features stay fixed, leading to more stable learning, whereas with the VAE and IDF methods the features are changing over time, and the environment distribution is changing over time (as the agent explores more of it), leading to a harder learning problem.

Typically, curiosity is combined with an external reward. In this paper, the authors evaluate how well an agent can do with only curiosity and no external reward. Intuitively, in game environments designed by humans, the designer sets up a good curriculum for humans to learn, which would align well with a curiosity reward. In fact, this is what happens, with a curiosity based reward leading to great performance (as measured by the external reward) on Atari games, Super Mario, Unity mazes, and Roboschool Pong, when using random features or IDF features. (The VAE features sometimes work well but were very unstable.) They evaluate transfer between levels in Super Mario, and find that the learned features transfer in more cases than random ones. Looking at the graphs, this seems like a very small effect to me -- I'm not sure if I'd agree with the claim, but I'd want to look at the behavior in videos and what the reward function rewards before making that claim strongly. They also investigate Pong with both players being driven by curiosity, and the players become so good at rallying that they crash the emulator.

Finally, they note one downside -- in any stochastic environment, or any environment where there will be lots of uncertainty about what will happen (eg. in multiagent settings), at convergence the reward for any action will be equal to the entropy of the next state distribution. While they don't demonstrate this flaw in particular, they show a related one -- if you add a TV to a Unity maze, and an action to change the channel, then the agent learns to stand in front of the TV and change the channel forever, rather than solving the maze.

My opinion: I really like these empirical papers that compare different methods and show their advantages and disadvantages. I was pretty surprised to see random features do as well as they did, especially to see that they transferred as well as learned features in one of the two cases they studied. There was of course a neural net that could learn how to use the arbitrary representation induced by the features, but then why couldn't it do the same for pixels? Perhaps the CNN was useful primarily for reducing the dimensionality of the pixels by combining nearby pixels together, and it didn't really matter how that was done since it still retains all the important information, but in a smaller vector?

I'm glad that the paper acknowledges that the good performance of curiosity is limited to environments that human designers have created. In a real world task, such as a house-cleaning robot, there are many other sources of uncertainty in the world that are unrelated to the task, and you need some form of specification to focus on it -- curiosity alone will not be enough.

Technical AI alignment

Agent foundations

Logical Counterfactuals & the Cooperation Game (Chris Leong)

Learning human intent

Risk-Sensitive Generative Adversarial Imitation Learning (Jonathan Lacotte et al): This paper extends GAIL to perform imitation learning where we try to optimize a policy for the mean reward collected under the constraint that the policy is no more risky than the expert policy. Since we don't know the true cost function, we have to approximate this problem with another problem where we infer the cost function as well, and evaluate the risk profile relative to the inferred cost function. The algorithm ends up looking very similar to the original GAIL algorithm, where the gradient updates change in order to include terms dependent on the conditional value-at-risk (CVaR). They evaluate against GAIL and RAIL (another risk-sensitive imitation learning algorithm) and find that their method performs the best on the Hopper and Walker Mujoco environments.

My opinion: I only skimmed through the math, so I don't understand the paper well enough to have a good opinion on it. The overall objective of having more risk-sensitivity seems useful for safety. That said, I do find the VNM utility theorem compelling, and it suggests that risk aversion is a bad strategy. I currently resolve this by saying that while the VNM theorem is true, if you want to optimize expected reward over a long time horizon in an environment with high-downside actions but not high-upside actions, even if you are maximizing expected utility you would not take low-probability-of-high-downside actions. (Here a high-downside action is one that causes something like death/episode termination.) Since humans are (probably) scope-insensitive with respect to time, it becomes important for humans to have a heuristic of risk aversion in order to actually maximize expected utility in practice. I'd be interested in seeing experiments with current (risk neutral) RL algorithms in long-horizon environments with actions with high downside, and see if they automatically learn behavior that we would call "risk-averse".

Take this with a grain of salt -- it's a lot more speculative than most of my opinions, which can already be quite speculative. Most of the steps in that argument are handwavy intuitions I have that aren't based on any research that's been done (though I haven't looked for any such research). Though you can think of the argument for focusing on long-term AI safety at all as an instance of this idea, where the argument is that our risk-aversion heuristic is only sufficient for timescales on the orders of human lifetimes, not for cosmic timescales, and so we should explicitly be more risk-averse and focus on reducing existential risk.

Directed Policy Gradient for Safe Reinforcement Learning with Human Advice (Helene Plisnier et al): One way that you could get advice from humans for RL would be to have the human provide a policy, which can be treated as a suggestion. In this paper, the authors propose to take such a policy, and incorporate it into a policy gradient algorithm by simply multiplying it with the policy chosen by the neural net to get a new policy that is in between the two. You can then run any on-policy RL algorithms using that policy.

My opinion: I'm annoyed at some claims that this paper makes. First, they say that the algorithm can ignore wrong advice that the human gives, but in the deterministic case, it does not ignore the advice, it just learns that if it gets into situations where it has to follow the advice bad things happen, and so it avoids getting into such situations. (The stochastic case is a bit better, in that at convergence the agent will ignore the advice, but it will take much longer to converge, if at all.) Second, their experiment involves a gridworld with 5 macro-actions, and they call this a "complicated environment with sparse rewards" -- yet if you had a uniformly random policy, in expectation it would take 5^3 = 125 episodes before you found the optimal trajectory, which would then be strongly reinforced getting quick convergence.

I do like the idea of providing advice by shaping the policy towards parts of the space that are better -- this would lead to better sample efficiency and safer exploration. I'd be pretty excited to see a paper that ran with this idea and had a more compelling story for how to get the advice policy from a human (specifying a policy is hard!) and better experiments that test the feasibility of the idea in a more complex environment.

Entropic Regret I: Deterministic MDPs (Vadim Kosoy)

Miscellaneous (Alignment)

Building Safer AGI by introducing Artificial Stupidity (Michaël Trazzi et al)

Near-term concerns

Machine ethics

A developmentally-situated approach to teaching normative behavior to AI (gworley)

AI capabilities

Reinforcement learning

Large-Scale Study of Curiosity-Driven Learning (Yuri Burda, Harri Edwards, Deepak Pathak et al): Summarized in the highlights!

Applications

A major milestone for the treatment of eye disease (Mustafa Suleyman): DeepMind's partnership with Moorfields Eye Hospital has resulted in an AI system that can recognize features of eye disease and recommend treatment. Interestingly, in order to get interpretability, they train two networks instead of one, where one predicts the features of eye disease for all of the tissue (eg. haemorrhages, lesions and irregular fluids), and the other then makes a recommendation for treatment. This required them to label a subset of the dataset with feature markers in order to train the first model.

My opinion: As interpretability goes, using a modular model with human-interpretable intermediate representations seems quite good -- it decouples the problem of understanding the model's output into two smaller problems. The big downside is that it requires a lot more labeling (877 segmented images in this case), and that the human-interpretable representation may not be the best one for the job. For example, if there are other visual cues besides the specific features DeepMind used that help with recommending treatment, this model will not be able to take advantage of them, while an end-to-end trained system could.