Alignment Newsletter #51

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

You may have noticed that I've been slowly falling behind on the newsletter, and am now a week behind. I would just skip a week and continue -- but there are actually a lot of papers and posts that I want to read and summarize, and just haven't had the time. So instead, this week you're going to get two newsletters. This one focuses on all of the ML-based work that I have mostly been ignoring for the past few issues.

Highlights

Towards Characterizing Divergence in Deep Q-Learning (Joshua Achiam et al): Q-Learning algorithms use the Bellman equation to learn the Q*(s, a) function, which is the long-term value of taking action a in state s. Tabular Q-Learning collects experience and updates the Q-value for each (s, a) pair independently. As long as each (s, a) pair is visited infinitely often, and the learning rate is decayed properly, the algorithm is guaranteed to converge to Q*.

Once we get to complex environments where you can't enumerate all of the states, we can't explore all of the (s, a) pairs. The obvious approach is to approximate Q*(s, a). Deep Q-Learning (DQL) algorithms use neural nets for this approximation, and use some flavor of gradient descent to update the parameters of the net such that it is closer to satisfying the Bellman equation. Unfortunately, this approximation can prevent the algorithm from ever converging to Q*.

This paper studies the first-order Taylor expansion of the DQL update, and identifies three factors that affect the DQL update: the distribution of (s, a) pairs from which you learn, the Bellman update operator, and the neural tangent kernel, a property of the neural net that specifies how information from one (s, a) pair generalizes to other (s, a) pairs. The theoretical analysis shows that as long as there is limited generalization between (s, a) pairs, and each (s, a) pair is visited infinitely often, the algorithm will converge. Inspired by this, they design PreQN, which explicitly seeks to minimize generalization across (s, a) pairs within the same batch. They find that PreQN leads to competitive and stable performance, despite not using any of the tricks that DQL algorithms typically require, such as target networks.

Rohin's opinion: I really liked this paper: it's a rare instance where I actually wanted to read the theory in the paper because it felt important for getting the high level insight. The theory is particularly straightforward and easy to understand (which usually seems to be true when it leads to high level insight). The design of the algorithm seems more principled than others, and the experiments suggest that this was actually fruitful. The algorithm is probably more computationally expensive per step compared to other algorithms, but that could likely be improved in the future.

One thing that felt strange is that the proposed solution is basically to prevent generalization between (s, a) pairs, but the whole point of DQL algorithms is to generalize between (s, a) pairs since you can't get experience from all of them. Of course, since they are only preventing generalization within a batch, they still generalize between (s, a) pairs that are not in the batch, but presumably that was because they only could prevent generalization within the batch. Empirically the algorithm does seem to work, but it's still not clear to me why it works.

Technical AI alignment

Learning human intent

Deep Reinforcement Learning from Policy-Dependent Human Feedback (Dilip Arumugam et al): One obvious approach to human-in-the-loop reinforcement learning is to have humans provide an external reward signal that the policy optimizes. Previous work noted that humans tend to correct existing behavior, rather than providing an "objective" measurement of how good the behavior is (which is what a reward function is). They proposed Convergent Actor-Critic by Humans (COACH), where instead of using human feedback as a reward signal, they use it as the advantage function. This means that human feedback is modeled as specifying how good an action is relative to the "average" action that the agent would have chosen from that state. (It's an average because the policy is stochastic.) Thus, as the policy gets better, it will no longer get positive feedback on behaviors that it has successfully learned to do, which matches how humans give reinforcement signals.

This work takes COACH and extends it to the deep RL setting, evaluating it on Minecraft. While the original COACH had an eligibility trace that helps "smooth out" human feedback over time, deep COACH requires an eligibility replay buffer. For sample efficiency, they first train an autoencoder to learn a good representation of the space (presumably using experience collected with a random policy), and feed these representations into the control policy. They reward entropy so that the policy doesn't commit to a particular behavior, making it responsive to feedback, but select actions by always picking the action with maximal probability (rather than sampling from the distribution) in order to have interpretable, consistent behavior for the human trainers to provide feedback on. They evaluate on simple navigation tasks in the complex 3D environment of Minecraft, including a task where the agent must patrol the perimeter of a room, which cannot be captured by a state-based reward function.

Rohin's opinion: I really like the focus on figuring out how humans actually provide feedback in practice; it makes a lot of sense that we provide reinforcement signals that reflect the advantage function rather than the reward function. That said, I wish the evaluation had more complex tasks, and had involved human trainers who were not authors of the paper -- it might have taken an hour or two of human time instead of 10-15 minutes, but would have been a lot more compelling.

Before continuing, I recommend reading about Simulated Policy Learning in Video Models below. As in that case, I think that you get sample efficiency here by getting a lot of "supervision information" from the pixels used to train the VAE, though in this case it's by learning useful features rather than using the world model to simulate trajectories. (Importantly, in this setting we care about sample efficiency with respect to human feedback as opposed to environment interaction.) I think the techniques used there could help with scaling to more complex tasks. In particular, it would be interesting to see a variant of deep COACH that alternated between training the VAE with the learned control policy, and training the learned control policy with the new VAE features. One issue would be that as you retrain the VAE, you would invalidate your previous control policy, but you could probably get around that (e.g. by also training the control policy to imitate itself while the VAE is being trained).

From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following (Justin Fu et al): Rewards and language commands are more generalizable than policies: "pick up the vase" would make sense in any house, but the actions that navigate to and pick up a vase in one house would not work in another house. Based on this observation, this paper proposes that we have a dataset where for several (language command, environment) pairs, we are given expert demonstrations of how to follow the command in that environment. For each data point, we can use IRL to infer a reward function, and use that to train a neural net that can map from the language command to the reward function. Then, at test time, given a language command, we can convert it to a reward function, after which we can use standard deep RL techniques to get a policy that executes the command.

The authors evaluate on a 3D house domain with pixel observations, and two types of language commands: navigation and pick-and-place. During training, when IRL needs to be done, since deep IRL algorithms are computationally expensive they convert the task into a small, tabular MDP with known dynamics for which they can solve the IRL problem exactly, deriving a gradient that can then be applied in the observation space to train a neural net that given image observations and a language command predicts the reward. Note that this only needs to be done at training time: at test time, the reward function can be used in a new environment with unknown dynamics and image observations. They show that the learned rewards generalize to novel combinations of objects within a house, as well as to entirely new houses (though to a lesser extent).

Rohin's opinion: I think the success at generalization comes primarily because of the MaxEnt IRL during training: it provides a lot of structure and inductive bias that means that the rewards on which the reward predictor is trained are "close" to the intended reward function. For example, in the navigation tasks, the demonstrations for a command like "go to the vase" will involve trajectories through the state of many houses that end up in the vase. For each demonstration, MaxEnt IRL "assigns" positive reward to the states in the demonstration, and negative reward to everything else. However, once you average across demonstrations in different houses, the state with the vase gets a huge amount of positive reward (since it is in all trajectories) while all the other states are relatively neutral (since they will only be in a few trajectories, where the agent needed to pass that point in order to get to the vase). So when this is "transferred" to the neural net via gradients, the neural net is basically "told" that high reward only happens in states that contain vases, which is a strong constraint on the learned reward.

To be clear, this is not meant as a critique of the paper: indeed, I think when you want out-of-distribution generalization, you have to do it by imposing structure/inductive bias, and this is a new way to do it that I hadn't seen before.

Using Natural Language for Reward Shaping in Reinforcement Learning (Prasoon Goyal et al): This paper constructs a dataset for grounding natural language in Atari games, and uses it to improve performance on Atari. They have humans annotate short clips with natural language: for example, "jump over the skull while going to the left" in Montezuma's Revenge. They use this to build a model that predicts whether a given trajectory matches a natural language instruction. Then, while training an agent to play Atari, they have humans give the AI system an instruction in natural language. They use their natural language model to predict the probability that the trajectory matches the instruction, and add that as an extra shaping term in the reward. This leads to faster learning.

ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning (Andrew Silva et al)

Interpretability

Visualizing memorization in RNNs (Andreas Madsen): This is a short Distill article that showcases a visualization tool that demonstrates how contextual information is used by various RNN units (LSTMs, GRUs, and nested LSTMs). The method is very simple: for each character in the context, they highlight the character in proportion to the gradient of the logits with respect to that character. Looking at this visualization allows us to see that GRUs are better at using long-term context, while LSTMs perform better for short-term contexts.

Rohin's opinion: I'd recommend you actually look at and play around with the visualization, it's very nice. The summary is short because the value of the work is in the visualization, not in the technical details.

Other progress in AI

Exploration

Learning Exploration Policies for Navigation (Tao Chen et al)

Deep Reinforcement Learning with Feedback-based Exploration (Jan Scholten et al)

Reinforcement learning

Towards Characterizing Divergence in Deep Q-Learning (Joshua Achiam et al): Summarized in the highlights!

Eighteen Months of RL Research at Google Brain in Montreal (Marc Bellemare): One approach to reinforcement learning is to predict the entire distribution of rewards from taking an action, instead of predicting just the expected reward. Empirically, this works better, even though in both cases we choose the action with highest expected reward. This blog post provides an overview of work at Google Brain Montreal that attempts to understand this phenomenon. I'm only summarizing the part that most interested me.

First, they found that in theory, distributional RL performs on par with or worse than standard RL when using either a tabular representation or linear features. They then tested this empirically on Cartpole, and found similar results: distributional RL performed worse when using tabular or linear representations, but better when using a deep neural net. This suggests that distributional RL "learns better representations". So, they visualize representations for RL on the four-room environment, and find that distributional RL captures more structured representations. Similarly this paper showed that predicting value functions for multiple discount rates is an effective way to produce auxiliary tasks for Atari.

Rohin's opinion: This is a really interesting mystery with deep RL, and after reading this post I have a story for it. Note I am far from an expert in this field and it's quite plausible that if I read the papers cited in this post I could tell this story is false, but here's the story anyway. As we saw with PreQN earlier in this issue, one of the most important aspects of deep RL is how information about one (s, a) pair is used to generalize to other (s, a) pairs. I'd guess that the benefit from distributional RL is primarily that you get "good representations" that let you do this generalization well. With a tabular representation you don't do any generalization, and with a linear feature space the representation is hand-designed by humans to do this generalization well, so distributional RL doesn't help in those cases.

But why does distributional RL learn good representations? I claim that it provides stronger supervision given the same amount of experience. With normal expected RL, the final layer of the neural net need only be useful for predicting the expected reward, but with distributional RL they must be useful for predicting all of the quantiles of the reward distribution. There may be "shortcuts" or "heuristics" that allow you to predict expected reward well because of spurious correlations in your environment, but it's less likely that those heuristics work well for all of the quantiles of the reward distribution. As a result, having to predict more things enforces a stronger constraint on what representations your neural net must have, and thus you are more likely to find good representations. This perspective also explains why predicting value functions for multiple discount rates helps with Atari, and why adding auxiliary tasks is often helpful (as long as the auxiliary task is relevant to the main task).

The important aspect here is that all of the quantiles are forcing the same neural net to learn good representations. If you instead have different neural nets predicting each quantile, each neural net has roughly the same amount of supervision as in expected RL, so I'd expect that to work about as well as expected RL, maybe a little worse since quantiles are probably harder to predict than means. If anyone actually runs this experiment, please do let me know the result!

Diagnosing Bottlenecks in Deep Q-learning Algorithms (Justin Fu, Aviral Kumar et al): While the PreQN paper used a theoretical approach to tackle Deep Q-Learning algorithms, this one takes an empirical approach. Their results:

- Small neural nets cannot represent Q*, and so have undesired bias that results in worse performance. However, they also have convergence issues, where the Q-function they actually converge to is significantly worse than the best Q-function that they could express. Larger architectures mitigate both of these problems.

- When there are more samples, we get a lower validation loss, showing that we are overfitting. Despite this, larger architectures are better, because the performance loss from overfitting is not as bad as the performance loss from having a bad bias. A good early stopping criterion could help with this.

- To study how non-stationarity affects DQL algorithms, they study a variant where the Q-function is a moving average of the past Q-functions (instead of the full update), which means that the target values don't change as quickly (i.e. it is closer to a stationary target). They find that non-stationarity doesn't matter much for large architectures.

- To study distribution shift, they look at the difference between the expected Bellman error before and after an update to the parameters. They find that distribution shift doesn't correlate much with performance and so is likely not important.

- Algorithms differ strongly in the distribution over (s, a) pairs that the DQL update is computed over. They study this in the absence of sampling (i.e. when they simply weight all possible (s, a) pairs, rather than just the ones sampled from a policy) and find that distributions that are "close to uniform" perform best. They hypothesize that this is the reason that experience replay helps -- initially an on-policy algorithm would take samples from a single policy, while experience replay adds samples from previous versions of the policy, which should increase the coverage of (s, a) pairs.

To sum up, the important factors are using an expressive neural net architecture, and designing a good sampling distribution. Inspired by this, they design Adversarial Feature Matching (AFM), which like Prioritized Experience Replay (PER) puts more weight on samples that have high Bellman error. However, unlike PER, AFM does not try to reduce distribution shift via importance sampling, since their experiments found that this was not important.

Rohin's opinion: This is a great experimental paper, there's a lot of data that can help understand DQL algorithms. I wouldn't take the results too literally, since insights on simple environments may not generalize to more complex environments. For example, they found overfitting to be an issue in their environments -- it's plausible to me that with more complex environments (think Dota/StarCraft, not Mujoco) this reverses and you end up underfitting the data you have. Nonetheless, I think data like this is particularly valuable for coming up with an intuitive theory of how deep RL works, if not a formal one.

Simulated Policy Learning in Video Models (Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błazej Osinski et al): This blog post and the associated paper tackle model-based RL for Atari. The recent world models (AN #23) paper proposed first learning a model of the world by interacting with the environment using a random policy, and then using the model to simulate the environment and training a control policy using those simulations. (This wasn't it's main point, but it was one of the things it talked about.) The authors take this idea and put it in an iterative loop: they first train the world model using experience from a random policy, then train a policy using the world model, retrain the world model with experience collected using the newly trained policy, retrain the policy, and so on. This allows us to correct any mistakes in the world model and let it adapt to novel situations that the control policy discovers. This allows them to train agents that can play Atari with only 100K interactions with the environment (corresponding to about two hours of real-time gameplay), though the final performance is lower than the state-of-the-art achieved with model-free RL. See Import AI for more details.

Rohin's opinion: This work follows the standard pattern where model-based RL is more sample efficient but reaches worse final performance compared to model-free RL. Let's try to explain this using the same story as in the rest of this newsletter.

The sample efficiency comes from the fact that they learn a world model that can predict the future, and then use that model to solve the control problem (which has zero sample cost, since you are no longer interacting with the environment). It turns out that predicting the future is "easier" than selecting the optimal action, and so the world model can be trained in fewer samples than it would take to solve the control problem directly. Why is the world model "easier" to learn? One possibility is that solving the control problem requires you to model the world anyway, and so must be a harder problem. If you don't know what your actions are going to do, you can't choose the best one. I don't find this very compelling, since there are lots of aspects of world modeling that are irrelevant to the control problem -- you don't need to know exactly how the background art will change in order to choose what action to take, but world modeling requires you to do this. I think the real reason is that world modeling benefits from much more supervision -- rather than getting a sparse reward signal over a trajectory, you get a full grid of pixels every timestep that you were supposed to predict. This gives you many orders of magnitude more "supervision information" per sample, and so it makes it easier to learn. (This is basically the same argument as in Yann Lecun's cake analogy.)

Why does it lead to worse performance overall? The policy is now being trained using rollouts that are subtly wrong, and so instead of specializing to the true Atari dynamics it will be specialized to the world model dynamics, which is going to be somewhat different and should lead to a slight dip in performance. (Imagine a basketball player having to shoot a ball that was a bit heavier than usual -- she'll probably still be good, but not as good as with a regular basketball.) In addition, since the world model is supervised by pixels, any small objects are not very important to the world model (i.e. getting them wrong does not incur much loss), even if they are very important for control. In fact, they find that bullets tend to disappear in Atlantis and Battle Zone, which is not good if you want to learn to play those games.

I'm not sure if they shared weights between the world model and the control policy. If they did, then they would also have the problem that the features that are useful for predicting the future are not the same as the features that are useful for selecting actions, which would also cause a drop in performance. My guess is that they didn't share weights for precisely this reason, but I'm not sure.

Unifying Physics and Deep Learning with TossingBot (Andy Zeng): TossingBot is a system that learns how to pick up and toss objects into bins using deep RL. The most interesting thing about it is that instead of using neural nets to directly predict actions, they are instead used to predict adjustments to actions that are computed by a physics-based controller. Since the physics-based controller generalizes well to new situations, TossingBot is also able to generalize to new tossing locations.

Rohin's opinion: This is a cool example of using structured knowledge in order to get generalization while also using deep learning in order to get performance. I also recently came across Residual Reinforcement Learning for Robot Control, which seems to have the same idea of combining deep RL with conventional control mechanisms. I haven't read either of the papers in depth, so I can't compare them, but a very brief skim suggests that their techniques are significantly different.

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (Kate Rakelly, Aurick Zhou et al)

Deep learning

Measuring the Limits of Data Parallel Training for Neural Networks (Chris Shallue and George Dahl): Consider the relationship between the size of a single batch and the number of batches needed to reach a specific performance bound when using deep learning. If all that mattered for performance was the total number of examples that you take gradient steps on (i.e. the product of these two numbers), then you would expect a perfect inverse relationship between these two quantities, which would look like a line with negative slope on a log-log plot. In this case, we could scale batch sizes up arbitrarily far, and distribute them across as many machines as necessary, in order to reduce wall clock training time. A 2x increase in batch size with twice as many machines would lead to a 2x decrease in training time. However, as you make batch sizes really large, you face the problem of stale gradients: if you had updated on the first half of the batch and then computed gradients on the second half of the batch, the gradients for the second half would be "better", because they were computed with respect to a better set of parameters. When this effect becomes significant, you no longer get the nice linear scaling from parallelization.

This post studies the relationship empirically across a number of datasets, architectures, and optimization algorithms. They find that universally, there is initially an era of perfect linear scaling as you increase batch size, followed by a region of diminishing marginal returns that ultimately leads to an asymptote where increasing batch size doesn't help at all with reducing wall-clock training time. However, the transition points between these regimes vary wildly, suggesting that there may be low hanging fruit in the design of algorithms or architectures that explicitly aim to achieve very good scaling.

Rohin's opinion: OpenAI found (AN #37) that the best predictor of the maximum useful batch size was how noisy the gradient is. Presumably when you have noisy gradients, a larger batch size helps "average out" the noise across examples. Rereading their post, I notice that they mentioned the study I've summarized here and said that their results can help explain why there's so much variance in the transition points across datasets. However, I don't think it can explain the variance in transition points across architectures. Noisy gradients are typically a significant problem, and so it would be weird if the variance in transition points across architectures were explained by the noisiness of the gradient: that would imply that two architectures reach the same final performance even though one had the problem of noisy gradients while the other didn't. So there seems to be something left to explain here.

That said, I haven't looked in depth at the data, so the explanation could be very simple. For example, maybe the transition points don't vary much across architecture and vary much more across datasets, and the variance across architecture is small enough that its effect on performance is dwarfed by all the other things that can affect the performance of deep learning systems. Or perhaps while the noisiness of the gradient is a good predictor of the maximum batch size, it still only explains say 40% of the effect, and so variance across architectures is totally compatible with factors other than the gradient noise affecting the maximum batch size.

11