[AN #65]: Learning useful skills by watching humans “play”

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Learning Latent Plans from Play (Corey Lynch et al) (summarized by Cody): This paper collects unsupervised data of humans playing with robotic control systems, and uses that data to thread a needle between two problems in learning. One problem is that per-task demonstration data is costly, especially as number of tasks grows; the other is that randomly sampled control actions will rarely stumble across complex motor tasks in ways that allow robots to learn. The authors argue that human play data is a good compromise because humans at play tend to explore different ways of manipulating objects in ways that give robots nuggets of useful information like "how do I move this block inside a drawer", which can be composed into more complicated and intentional tasks.

The model works by learning to produce vectors that represent plans (or sequences of actions), and jointly learning to decode those vectors into action sequences. This architecture learns to generate plan vectors by using an autoencoder-like structure that uses KL divergence to align (1) a distribution of plan vectors predicted from the start and end state of a window of play data, and (2) a distribution of plan vectors predicted by looking back at all the actions taken in that window. Because we're jointly learning to unroll the (2) lookback-summarized vector such that it matches the actions actually taken, we'll ideally end up with a system that can take in a given plan vector and produce a sequence of actions to execute that plan. And, because we're learning to predict a vector that aligns with actions successfully taken to get to an end state from a starting one, the model at test time should be able to produce a play vector corresponding to feasible actions that will get it from its current state to a goal state we'd like it to reach. The authors found that their Play-trained model was able to outperform single-task models on a range of manipulation tasks, even though those single-task models were trained with explicit demonstrations of the task.

Cody's opinion: I really liked this paper: it was creative in combining conceptual components from variational methods and imitation learning, and it was pragmatic in trying to address the problem of how to get viable human-demonstration data in a way that avoids having to get distinct datasets for a huge set of different discrete tasks.

Technical AI alignment

Iterated amplification

Aligning a toy model of optimization (Paul Christiano) (summarized by Rohin): Current ML capabilities are centered around local search: we get a gradient (or an approximation to one, as with evolutionary algorithms), and take a step in that direction to find a new model. Iterated amplification takes advantage of this fact: rather than a sequence of gradient steps on a fixed reward, we can do a sequence of amplification steps and distillation gradient steps.

However, we can consider an even simpler model of ML capabilities: function maximization. Given a function from n-bit strings to real numbers, we model ML as allowing us to find the input n-bit string with the maximum output value, in only O(n) time (rather than the O(2^n) time that brute force search would take). If this were all we knew about ML capabilities, could we still design an aligned, competitive version of it? While this is not the actual problem we face, due to its simplicity it is more amenable to theoretical analysis, and so is worth thinking about.

We could make an unaligned AI that maximizes some explicit reward using only 2 calls to Opt: first, use Opt to find a good world model M that can predict the dynamics and reward, and then use Opt to find a policy that does well when interacting with M. This is unaligned for all the usual reasons: most obviously, it will try to seize control of the reward channel.

An aligned version does need to use Opt, since that's the only way of turning a naively-exponential search into a linear one; without using Opt the resulting system won't be competitive. We can't just generalize iterated amplification to this case, since iterated amplification relies on a sequence of applications of ML capabilities: this would lead to an aligned AI that uses Opt many times, which will not be competitive since the unaligned AI only requires 2 calls to Opt.

One possible approach is to design an AI with good incentives (in the same way that iterated amplification aims to approximate HCH (AN #34)) that "knows everything that the unaligned AI knows". However, it would also be useful to produce a proof of impossibility: this would tell us something about what a solution must look like in more complex settings.

Rohin's opinion: Amusingly, I liked this post primarily because comparing this setting to the typical setting for iterated amplification was useful for seeing the design choices and intuitions that motivated iterated amplification.

Forecasting

Coordination Surveys: why we should survey to organize responsibilities, not just predictions (Andrew Critch) (summarized by Rohin): This post suggests that when surveying researchers about the future impact of their technology, we should specifically ask them about their beliefs about what actions other people will take, and what they personally are going to do, rather than just predicting total impact. (For example, we could ask how many people will invest in safety.) Then, by aggregating across survey respondents, we can see whether or not the researchers beliefs about what others will do match the empirical distribution of what researchers are planning to do. This can help mitigate the effect where everyone thinks that everyone else will deal with a problem, and the effect where everyone tries to solve a problem because they all think no one else is planning to solve it. Critch has offered to provide suggestions on including this methodology in any upcoming surveys; see the post for details.

Rohin's opinion: This is a cool idea, and seems worth doing to me. I especially like that the survey would simply reveal problems by collecting two sources of information from people and checking their consistency with each other: there isn't any particular argument being made; you are simply showing inconsistency in people's own beliefs to them, if and only if such inconsistency exists. In practice, I'm sure there will be complications -- for example, perhaps the set of researchers taking the survey is different from the set of "others" whose actions and beliefs they are predicting -- but it still seems worth at least trying out.

AI Forecasting Dictionary (Jacob Lagerros and Ben Goldhaber) (summarized by Rohin): One big challenge with forecasting the future is operationalizing key terms unambiguously, so that a question can be resolved when the future actually arrives. Since we'll probably need to forecast many different questions, it's crucial that we make it as easy as possible to create and answer well-operationalized questions. To that end, the authors have created and open-sourced an AI Forecasting Dictionary, which gives precise meanings for important terms, along with examples and non-examples to clarify further.

AI Forecasting Resolution Council (Jacob Lagerros and Ben Goldhaber) (summarized by Rohin): Even if you operationalize forecasting questions well, often the outcome is determined primarily by factors other than the one you are interested in. For example, progress on a benchmark might be determined more by the number of researchers who try to beat the benchmark than by improvements in AI capabilities, even though you were trying to measure the latter. To deal with this problem, an AI Forecasting Resolution Council has been set up: now, forecasters can predict what the resolution council will say at some particular time in the future. This allows for questions that get at what we want: in the previous case, we could now forecast how the resolution council will answer the question "would current methods be able to beat this benchmark" in 2021.

How to write good AI forecasting questions + Question Database (Jacob Lagerros and Ben Goldhaber) (summarized by Rohin): As discussed above, operationalization of forecasting questions is hard. This post collects some of the common failure modes, and introduces a database of 76 questions about AI progress that have detailed resolution criteria that will hopefully avoid any pitfalls of operationalization.

Miscellaneous (Alignment)

The strategy-stealing assumption (Paul Christiano) (summarized by Rohin): We often talk about aligning AIs in a way that is competitive with unaligned AIs. However, you might think that we need them to be better: after all, unaligned AIs only have to pursue one particular goal, whereas aligned AIs have to deal with the fact that we don't yet know what we want. We might hope that regardless of what goal the unaligned AI has, any strategy it uses to achieve that goal can be turned into a strategy for acquiring flexible influence (i.e. influence useful for many goals). In that case, as long as we control a majority of resources, we can use any strategies that the unaligned AIs can use. For example, if we control 99% of the resources and unaligned AI controls 1%, then at the very least we can split up into 99 "coalitions" that each control 1% of resources and use the same strategy as the unaligned AI to acquire flexible influence, and this should lead to us obtaining 99% of the resources in expectation. In practice, we could do even better, e.g. by coordinating to shut down any unaligned AI systems.

The premise that we can use the same strategy as the unaligned AI, despite the fact that we need flexible influence, is called the strategy-stealing assumption. Solving the alignment problem is critical to strategy-stealing -- otherwise, unaligned AI would have an advantage at thinking that we could not steal and the strategy-stealing assumption would break down. This post discusses ten other ways that the strategy-stealing assumption could fail. For example, the unaligned AI could pursue a strategy that involves threatening to kill humans, and we might not be able to use a similar strategy in response because the unaligned AI might not be as fragile as we are.

Rohin's opinion: It does seem to me that if we're in a situation where we have solved the alignment problem, we control 99% of resources, and we aren't infighting amongst each other, we will likely continue to control at least 99% of the resources in the future. I'm a little confused about how we get to this situation though -- the scenarios I usually worry about are the ones in which we fail to solve the alignment problem, but still deploy unaligned AIs, and in these scenarios I'd expect unaligned AIs to get the majority of the resources. I suppose in a multipolar setting with continuous takeoff, if we have mostly solved the alignment problem but still accidentally create unaligned AIs (or some malicious actors create them deliberately), then this setting where we control 99% of the resources could arise.

Other progress in AI

Exploration

Making Efficient Use of Demonstrations to Solve Hard Exploration Problems (Caglar Gulcehre, Tom Le Paine et al) (summarized by Cody): This paper combines ideas from existing techniques to construct an architecture (R2D3) capable of learning to solve hard exploration problems with a small number (N~100) of demonstrations. R2D3 has two primary architectural features: its use of a recurrent head to learn Q values, and its strategy of sampling trajectories from separate pools of agent and demonstrator experience, with sampling prioritized by highest-temporal-difference-error transitions within each pool.

As the authors note, this approach is essentially an extension of an earlier paper, Deep Q-Learning from Demonstrations, to use a recurrent head rather than a feed-forward one, allowing it to be more effectively deployed on partial-information environments. The authors test on 8 different environments that require long sequences of task completion to receive any reward, and find that their approach is able to reach human level performance on four of the tasks, while their baseline comparisons essentially never succeed on any task. Leveraging demonstrations can be valuable for solving these kinds of difficult exploration tasks, because demonstrator trajectories provide examples of how to achieve reward in a setting where the trajectories of a randomly exploring agent would rarely ever reach the end of the task to find positive reward.

Cody's opinion: For all that this paper's technique is a fairly straightforward merging of existing techniques (separately-prioritized demonstration and agent pools, and the off-policy SotA R2D2), its results are surprisingly impressive: the tasks tested on require long and complex chains of correct actions that would be challenging for a non-imitation based system to discover, and high levels of environment stochasticity that make a pure imitation approach difficult.

Reinforcement learning

Emergent Tool Use from Multi-Agent Interaction (Bowen Baker et al) (summarized by Rohin): We have such a vast diversity of organisms and behaviors on Earth because of evolution: every time a new strategy evolved, it created new pressures and incentives for other organisms, leading to new behaviors. The multiagent competition led to an autocurriculum. This work harnesses this effect: they design a multiagent environment and task, and then use standard RL algorithms to learn several interesting behaviors. Their task is hide-and-seek, where the agents are able to move boxes, walls and ramps, and lock objects in place. The agents find six different strategies, each emerging from incentives created by the previous strategy: seekers chasing hiders, hiders building shelters, seekers using ramps to get into shelters, hiders locking ramps away from seekers, seekers surfing boxes to hiders, and hiders locking both boxes and ramps.

The hope is that this can be used to learn general skills that can then be used for specific tasks. This makes it a form of unsupervised learning, with a similar goal as e.g. curiosity (AN #20). We might hope that multiagent autocurricula would do better than curiosity, because they automatically tend to use features that are important for control in the environment (such as ramps and boxes), while intrinsic motivation methods often end up focusing on features we wouldn't think are particularly important. They empirically test this by designing five tasks in the environment and checking whether finetuning the agents from the multiagent autocurricula learns faster than direct training and finetuning curiosity-based agents. They find that the multiagent autocurricula agents do best, but only slightly. To explain this, they hypothesize that the learned skill representations are still highly entangled and so are hard to finetune, whereas learned feature representations transfer more easily.

Rohin's opinion: This is somewhat similar to AI-GAs (AN #63): both depend on environment design, which so far has been relatively neglected. However, AI-GAs are hoping to create learning algorithms, while multiagent autocurricula leads to tool use, at least in this case. Another point of similarity is that they both require vast amounts of compute, as discovering new strategies can take significant exploration. That said, it seems that we might be able to drastically decrease the amount of compute needed by solving the exploration problem using e.g. human play data or demonstrations (discussed in two different papers above).

More speculatively, I hypothesize that it will be useful to have environments where you need to identify what strategy your opponent is using. In this environment, each strategy has the property that it beats all of the strategies that preceded it. As a result, it was fine for the agent to undergo catastrophic forgetting: even though it was trained against past agents, it only needed to learn the current strategy well; it didn't need to remember previous strategies. As a result, it may have forgotten prior strategies and skills, which might have reduced its ability to learn new tasks quickly.

Applications

Tackling Climate Change with Machine Learning (David Rolnick et al) (summarized by Rohin): See Import AI.