[AN #82]: How OpenAI Five distributed their training computation


Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).


Dota 2 with Large Scale Deep Reinforcement Learning (OpenAI et al) (summarized by Nicholas): In April, OpenAI Five (AN #54) defeated the world champion Dota 2 team, OG. This paper describes its training process. OpenAI et al. hand-engineered the reward function as well as some features, actions, and parts of the policy. The rest of the policy was trained using PPO with an LSTM architecture at a massive scale. They trained this in a distributed fashion as follows:

- The Controller receives and distributes the updated parameters.

- The Rollout Worker CPUs simulate the game, send observations to the Forward Pass GPUs and publish samples to the Experience Buffer.

- The Forward Pass GPUs determine the actions to use and send them to the Rollout Workers.

- The Optimizer GPUs sample experience from the Experience Buffer, calculate gradient updates, and then publish updated parameters to the Controller.

The model trained over 296 days. In that time, OpenAI needed to adapt it to changes in the code and game mechanics. This was done via model “surgery”, in which they would try to initialize a new model to maintain the same input-output mapping as the old one. When this was not possible, they gradually increased the proportion of games played with the new version over time.

Nicholas's opinion: I feel similarly to my opinion on AlphaStar (AN #73) here. The result is definitely impressive and a major step up in complexity from shorter, discrete games like chess or go. However, I don’t see how the approach of just running PPO at a large scale brings us closer to AGI because we can’t run massively parallel simulations of real world tasks. Even for tasks that can be simulated, this seems prohibitively expensive for most use cases (I couldn’t find the exact costs, but I’d estimate this model cost tens of millions of dollars). I’d be quite excited to see an example of deep RL being used for a complex real world task without training in simulation.

Technical AI alignment

Technical agendas and prioritization

Just Imitate Humans? (Michael Cohen) (summarized by Rohin): This post asks whether it is safe to build AI systems that just imitate humans. The comments have a lot of interesting debate.

Agent foundations

Conceptual Problems with UDT and Policy Selection (Abram Demski) (summarized by Rohin): In Updateless Decision Theory (UDT), the agent decides "at the beginning of time" exactly how it will respond to every possible sequence of observations it could face, so as to maximize the expected value it gets with respect to its prior over how the world evolves. It is updateless because it decides ahead of time how it will respond to evidence, rather than updating once it sees the evidence. This works well when the agent can consider the full environment and react to it, and often gets the right result even when the environment can model the agent (as in Newcomblike problems), as long as the agent knows how the environment will model it.

However, it seems unlikely that UDT will generalize to logical uncertainty and multiagent settings. Logical uncertainty occurs when you haven't computed all the consequences of your actions and is reduced by thinking longer. However, this effectively is a form of updating, whereas UDT tries to know everything upfront and never update, and so it seems hard to make it compatible with logical uncertainty. With multiagent scenarios, the issue is that UDT wants to decide on its policy "before" any other policies, which may not always be possible, e.g. if another agent is also using UDT. The philosophy behind UDT is to figure out how you will respond to everything ahead of time; as a result, UDT aims to precommit to strategies assuming that other agents will respond to its commitments; so two UDT agents are effectively "racing" to make their commitments as fast as possible, reducing the time taken to consider those commitments as much as possible. This seems like a bad recipe if we want UDT agents to work well with each other.

Rohin's opinion: I am no expert in decision theory, but these objections seem quite strong and convincing to me.

A Critique of Functional Decision Theory (Will MacAskill) (summarized by Rohin): This summary is more editorialized than most. This post critiques Functional Decision Theory (FDT). I'm not going to go into detail, but I think the arguments basically fall into two camps. First, there are situations in which there is no uncertainty about the consequences of actions, and yet FDT chooses actions that do not have the highest utility, because of their impact on counterfactual worlds which "could have happened" (but ultimately, the agent is just leaving utility on the table). Second, FDT relies on the ability to tell when someone is "running an algorithm that is similar to you", or is "logically correlated with you". But there's no such crisp concept, and this leads to all sorts of problems with FDT as a decision theory.

Rohin's opinion: Like Buck from MIRI, I feel like I understand these objections and disagree with them. On the first argument, I agree with Abram that a decision should be evaluated based on how well the agent performs with respect to the probability distribution used to define the problem; FDT only performs badly if you evaluate on a decision problem produced by conditioning on a highly improbable event. On the second class of arguments, I certainly agree that there isn't (yet) a crisp concept for "logical similarity"; however, I would be shocked if the intuitive concept of logical similarity was not relevant in the general way that FDT suggests. If your goal is to hardcode FDT into an AI agent, or your goal is to write down a decision theory that in principle (e.g. with infinite computation) defines the correct action, then it's certainly a problem that we have no crisp definition yet. However, FDT can still be useful for getting more clarity on how one ought to reason, without providing a full definition.

Learning human intent

Learning to Imitate Human Demonstrations via CycleGAN (Laura Smith et al) (summarized by Zach): Most methods for imitation learning, where robots learn from a demonstration, assume that the actions of the demonstrator and robot are the same. This means that expensive techniques such as teleoperation have to be used to generate demonstrations. This paper presents a method to engage in automated visual instruction-following with demonstrations (AVID) that works by translating video demonstrations done by a human into demonstrations done by a robot. To do this, the authors use CycleGAN, a method to translate an image from one domain to another domain using unpaired images as training data. CycleGAN allows them to translate videos of humans performing the task into videos of the robot performing the task, which the robot can then imitate. In order to make learning tractable, the demonstrations had to be divided up into 'key stages' so that the robot can learn a sequence of more manageable tasks. In this setup, the robot only needs supervision to ensure that it's copying each stage properly before moving on to the next one. To test the method, the authors have the robot retrieve a coffee cup and make coffee. AVID significantly outperforms other imitation learning methods and can achieve 70% / 80% success rate on the tasks, respectively.

Zach's opinion: In general, I like the idea of 'translating' demonstrations from one domain into another. It's worth noting that there do exist methods for translating visual demonstrations into latent policies. I'm a bit surprised that we didn't see any comparisons with other adversarial methods like GAIfO, but I understand that those methods have high sample complexity so perhaps the methods weren't useful in this context. It's also important to note that these other methods would still require demonstration translation. Another criticism is that AVID is not fully autonomous since it relies on human feedback to progress between stages. However, compared to kinetic teaching or teleoperation, sparse feedback from a human overseer is a minor inconvenience.

Read more: Paper: AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Preventing bad behavior

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors (Stuart Armstrong) (summarized by Flo): Suppose we were uncertain about which arm in a bandit provides reward (and we don’t get to observe the rewards after choosing an arm). Then, maximizing expected value under this uncertainty is equivalent to picking the most likely reward function as a proxy reward and optimizing that; Goodhart’s law doesn’t apply and is thus not universal. This means that our fear of Goodhart effects is actually informed by more specific intuitions about the structure of our preferences. If there are actions that contribute to multiple possible rewards, optimizing the most likely reward does not need to maximize the expected reward. Even if we optimize for that, we have a problem if value is complex and the way we do reward learning implicitly penalizes complexity. Another problem arises if the correct reward is comparatively difficult to optimize: if we want to maximize the average, it can make sense to only care about rewards that are both likely and easy to optimize. Relatedly, we could fail to correctly account for diminishing marginal returns in some of the rewards.

Goodhart effects are a lot less problematic if we can deal with all of the mentioned factors. Independent of that, Goodhart effects are most problematic when there is little middle ground that all rewards can agree on.

Flo's opinion: I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

Rohin's opinion: Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which seems necessary to me (AN #41), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.


AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (Dan Hendrycks, Norman Mu et al) (summarized by Dan H): This paper introduces a data augmentation technique to improve robustness and uncertainty estimates. The idea is to take various random augmentations such as random rotations, produce several augmented versions of an image with compositions of random augmentations, and then pool the augmented images into a single image by way of an elementwise convex combination. Said another way, the image is augmented with various traditional augmentations, and these augmented images are “averaged” together. This produces highly diverse augmentations that have similarity to the original image. Unlike techniques such as AutoAugment, this augmentation technique uses typical resources, not 15,000 GPU hours. It also greatly improves generalization to unforeseen corruptions, and it makes models more stable under small perturbations. Most importantly, even as the distribution shifts and accuracy decreases, this technique produces models that can remain calibrated under distributional shift.

Miscellaneous (Alignment)

Defining and Unpacking Transformative AI (Ross Gruetzemacher et al) (summarized by Flo): The notion of transformative AI (TAI) is used to highlight that even narrow AI systems can have large impacts on society. This paper offers a clearer definition of TAI and distinguishes it from radical transformative AI (RTAI).

"Discontinuities or other anomalous patterns in metrics of human progress, as well as irreversibility are common indicators of transformative change. TAI is then broadly defined as an AI technology, which leads to an irreversible change of some important aspects of society, making it a (multi-dimensional) spectrum along the axes of extremity, generality and fundamentality. " For example, advanced AI weapon systems might have strong implications for great power conflicts but limited effects on people's daily lives; extreme change of limited generality, similar to nuclear weapons. There are two levels: while TAI is comparable to general-purpose technologies (GPTs) like the internal combustion engine, RTAI leads to changes that are comparable to the agricultural or industrial revolution. Both revolutions have been driven by GPTs like the domestication of plants and the steam engine. Similarly, we will likely see TAI before RTAI. The scenario where we don't is termed a radical shift.

Non-radical TAI could still contribute to existential risk in conjunction with other factors. Furthermore, if TAI precedes RTAI, our management of TAI can affect the risks RTAI will pose.

Flo's opinion: I enjoyed this article and the proposed factors match my intuitions. There seem to be two types of problems: extreme beliefs and concave Pareto boundaries. Dealing with the second is more important since a concave Pareto boundary favours extreme policies, even for moderate beliefs. Luckily, diminishing returns can be used to bend the Pareto boundary. However, I expect it to be hard to find the correct rate of diminishing returns, especially in novel situations.

Six AI Risk/Strategy Ideas (Wei Dai) (summarized by Rohin): This post briefly presents three ways that power can become centralized in a world with Comprehensive AI Services (AN #40), argues that under risk aversion "logical" risks can be more concerning than physical risks because they are more correlated, proposes combining human imitations and oracles to remove the human in the loop and become competitive, and suggests doing research to generate evidence of difficulty of a particular strand of research.