[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos (Annie S. Chen et al) (summarized by Sudhanshu): This work demonstrates a method that learns a generalizable multi-task reward function in the context of robotic manipulation; at deployment, this function can be conditioned on a human demonstration of an unseen task to generate reward signals for the robot, even in new environments.

A key insight here was to train a discriminative model that learned whether two given video clips were performing the same actions. These clips came from both a (large) dataset of human demonstrations and a relatively smaller set of robot expert trajectories, and each clip was labelled with a task-id. This training pipeline thus leveraged huge quantities of extant human behaviour from a diversity of viewpoints to learn a metric of 'functional similarity' between pairs of videos, independent of whether they were executed by human or machine.

Once trained, this model (called the 'Domain-agnostic Video Discriminator' or DVD) can determine if a candidate robotic behaviour is similar to a desired human-demonstrated action. Such candidates are drawn from an action-conditioned video predictor, and the best-scoring action sequence is selected for execution on the (simulated or real) robot.

Read more: Paper

Sudhanshu's opinion: Performance increased with the inclusion of human data, even that from unrelated tasks, so one intuition I updated on was "More data is better, even if it's not perfect". This also feels related to "Data as regularization": to some extent, noisy data combats model overconfidence, and perhaps this would play an important role in aligning future systems.

Another thing I like about such pipeline papers is the opportunity to look for where systems might break. For example, in this work, the robot does actually need (prior) experience in the test environments with which to train the video predictor to be able to generate candidate solutions at test time. So in spite of the given result -- that DVD itself needs limited robot trajectories and no data from the test environments -- there's a potential point-of-failure far sooner in the pipeline, where if the robot did not have sufficient background experience with diverse situations, it might not provide any feasible candidate actions for DVD's evaluation.

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation (Ajay Mandlekar et al) (summarized by Rohin): As you might expect from the title, this paper tests imitation learning and offline RL algorithms on a benchmark of robotic manipulation tasks in which the agent must learn to perform the task from human demonstrations. Most of the experiments were done in simulation, but they did do a final training run on a real robot using hyperparameters chosen in simulation, to demonstrate that their preferred algorithms could work in such a setting as well. Some findings I found particularly interesting:

1. It is important to have models with memory: behavioral cloning (BC) does significantly better on human demonstrations when it is training an RNN model (which has memory), especially on longer-horizon tasks. This is presumably because the humans providing the demonstrations chose actions based not only on the current state but also what had happened in the past, i.e. they were non-Markovian. To test this hypothesis, we could look at machine-generated demonstrations, where you get demonstrations from an expert agent trained using RL, which I think are guaranteed to be Markovian by construction. Unfortunately, we can only get reasonable RL experts on the shorter-horizon tasks where the effect is less pronounced; in these cases BC-RNN still outperforms BC without the RNN, weakly suggesting that it isn’t just about Markovian vs. non-Markovian data.

2. Offline RL algorithms work quite well on the machine-generated data, but don’t work very well on human demonstrations. It isn’t particularly clear why this is the case.

3. In addition, offline RL struggles when used on datasets where the demonstrations are of mixed quality; in comparison BC-RNN does quite well.

4. Policy selection is a challenging problem: in these settings, the training objective (e.g. predict the expert actions) is usually not the thing you actually care about (e.g. did you successfully pick up the cup). Ideally, you would evaluate many model checkpoints throughout the training process on the metric you actually care about and then choose the one that performs best. If you instead select the model checkpoint that achieved the lowest validation loss, performance on the correct metric can decrease by 50-100%; if you always use the last checkpoint (i.e. at the end of training), performance can decrease by 10-30%. This demonstrates that it is important to choose the right model during training – but there’s no clear way to do this, as often the evaluation of a policy is non-trivial.

5. The observation space (e.g. pixel observations vs. observations of joint angles and forces) and hyperparameters (e.g. learning rate) both matter quite a lot. For example, adding information about end effectors can drop performance by 49-88% (presumably due to overfitting).

6. For complex tasks, more data provides significant improvements.

Rohin's opinion: I like these sorts of empirical benchmark papers; it feels so much easier to learn what works from such papers (relative to reading the papers in which the algorithms were introduced). This paper in particular was also especially clear and easy to read; my summary of the results is in large part just a restatement of Section 5 of the paper.

VILD: Variational Imitation Learning with Diverse-quality Demonstrations (Voot Tangkaratt et al) (summarized by Rohin): We saw in the previous summary that existing methods struggle to cope with datasets of demonstrations of mixed quality. This paper aims to tackle exactly this problem. They consider a model in which there are k demonstrators with varying levels of quality. Each demonstrator is modeled as computing an action Boltzmann-rationally and then applying some Gaussian noise; the standard deviation of the Gaussian noise differs across the demonstrators (with higher standard deviation corresponding to lower quality).

They use variational inference to derive an algorithm for this problem that infers the reward function as well as an optimal policy to go along with it. In addition, they oversample data from the demonstrations that the model thinks are high quality in order to get more informative gradients. (They use an importance sampling correction in order to keep the gradient estimate unbiased.)

Their experiments on machine-generated data show significant improvement over existing imitation learning algorithms, both in the case where we synthetically add Gaussian noise (matching the model) and when we add time-signal-dependent (TSD) noise (in which case the model is misspecified).

Rohin's opinion: This seems like a reasonable approach. It has a similar ethos as Boltzmann rationality. In Boltzmann rationality, it seems like all you need to do is model the demonstrator as having some noise but still being more likely to choose higher-reward actions, and that’s enough to get decent performance; similarly here you just need to model different demonstrators as applying different amounts of Gaussian noise to the optimal policy and that’s enough to distinguish good from bad.

Note that, while the experimental results are good, the paper doesn’t have experiments with real human demonstrations; as we saw in the previous summary these can often be quite different (in ways that matter) from machine-generated demonstrations.

IQ-Learn: Inverse soft-Q Learning for Imitation (Divyansh Garg et al) (summarized by Zach): A popular way to view imitation learning is as a distribution matching problem. In this approach, the goal is to have the imitator induce a state-action distribution that closely matches that of the expert. Methods such as GAIL (AN #17) and Value-DICE (AN #98) propose adversarial methods, similar to GANs, to carry out the distribution matching. However, such methods can be difficult to train due to the difficulty of solving saddle-point problems. In this paper, the authors present a non-adversarial method that allows distribution matching to be carried out in a fully offline and non-adversarial fashion. They do this by building on Value-DICE and introducing a soft-Bellman operator which allows the saddle-point problem to be reduced to estimating a Q-function. In fact, the authors show this reduction is related to off-policy RL algorithms with the reward set to zero. In experiments, the method is shown to be competitive with other state-of-the-art methods in both the offline and image-based setting.

Zach's opinion: I found the experimental comparisons to be a bit misleading. If you compare the results in this paper with the results in the original ValueDICE and SQIL paper, the algorithms are closer in performance than this paper implies. It's also not clear that you need to use the soft-Bellman operator especially in the continuous-control setting which was what ValueDICE originally focused on. However, overall, non-adversarial methods are generally more stable so I found this paper to be a good contribution.

Learning the Preferences of Uncertain Humans with Inverse Decision Theory (Cassidy Laidlaw et al) (summarized by Zach): Human preference learning has been studied from various perspectives such as inverse reinforcement learning (IRL) and active learning. However, the IRL problem is underspecified, that is, even with access to the full behavioral policy, you cannot uniquely determine the preferences that led to that policy. Meanwhile, active learning often has a description-experience gap: the stated preferences in response to a question in active learning may not be the same as the preferences that would be revealed from demonstrations.

In this work, the authors study an alternative paradigm known as inverse decision theory (IDT) that aims to learn a loss function for binary classification using strictly observational data while returning unique solutions. (Such a loss function effectively specifies how good correct predictions are and how bad incorrect predictions are.) The authors show that preferences can be uniquely determined whenever there is uncertainty in the classification problem. This happens because we need observations predicting classes at different levels of certainty to identify a transition point where we switch from predicting one class over another. In contrast, without uncertainty, we won’t be able to precisely identify that threshold. The authors then strengthen this result by showing it holds even in cases where the underlying decision rule is sub-optimal.

The authors argue that since learning could be done efficiently in this setting, IDT could have broader applicability. For example, one application to fairness could be to collect a set of decisions from a trained classifier, split them across groups (e.g. race or gender), and compare the inferred loss functions to detect bias in the trained classifier.

Zach's opinion: The paper is organized well and I found the examples to be interesting in their own right. On the other hand, binary classification is a fairly restrictive setting and IDT in this paper seems to require access to class posterior probabilities. These probabilities generally are not easy to estimate. Moreover, if you have access to that function it seems you could elicit the loss function with exponentially fewer human observations by sorting/sub-sampling the class posterior values. Despite these shortcomings, I'm interested to see how this work can be extended further.

Reward Identification in Inverse Reinforcement Learning (Kuno Kim et al) (summarized by Rohin): As mentioned in the previous summary, a major challenge with inverse reinforcement learning is that rewards are unidentifiable: even given perfect knowledge of the policy, we cannot recover the reward function that produces it. This is partly for boring reasons like “you can add a constant to a reward function without changing anything”, but even if you exclude those kinds of reasons, others remain. For example, since every policy is optimal for the constant reward function, the zero reward function can rationalize any policy.

For this reason, the authors instead focus on the case where we assume the policy is a solution to the maximum entropy RL objective (you can think of this as Boltzmann rationality, if you’re more familiar with that). The solution to MaxEnt RL for a zero reward is a uniformly random policy, so the zero reward no longer rationalizes every policy. Perhaps rewards are identifiable in this case?

(You might have noticed that I neglected the question of whether the MaxEnt RL model was better than the regular RL model in cases that we care about. As far as I can tell the paper doesn’t address this. But if they did so, perhaps they might say that in realistic situations we are dealing with boundedly-rational agents, and Boltzmann rationality / MaxEnt RL is a common model in such situations.)

Well, we still need to deal with the “additive constant” argument. To address this, the authors define two reward functions to be equivalent if they agree up to an additive constant. There are actually two versions of this: “trajectory equivalence” means that they agree on the rewards for all feasible trajectories, while “state-action equivalence” means that they agree on the rewards for all state-action pairs. Correspondingly, “weak identifiability” means that you can identify rewards up to trajectory equivalence, while “strong identifiability” means you can identify them up to state-action equivalence. Strong identifiability implies weak identifiability, since if you know the rewards on state-action pairs, that determines the reward for any given trajectory.

All deterministic MDPs are weakly identifiable under the MaxEnt RL model, since in this case a trajectory τ is selected with probability p(τ) proportional to exp(r(τ)), so the probability p(τ) can then be inverted to get r(τ). However, stochastic MDPs need not be weakly identifiable. Imagine an MDP in which no matter what you do, you are teleported to a random state. In such an MDP, the agent has no control over the trajectory, and so the MaxEnt RL objective will choose a uniformly random policy, no matter what the reward is, and so the reward must be unidentifiable.

Now the question is, assuming you have weak identifiability (i.e. you can infer r(τ)), when do you also have strong identifiability (i.e. you can infer r(s, a))? Intuitively, there needs to be sufficient “diversity” of feasible trajectories τ that cover a wide variety of possible (s, a) pairs, so that you can use the r(τ) values to infer the r(s, a) values. The authors prove a sufficient condition called “coverage”: there exists some timestep T, such that for every state there is some feasible trajectory that reaches that state at timestep T. (They also require the horizon to be at least 2T.) Coverage can be a fairly easy property to have; for example, if you can get to any state from any other state in some number of steps, then all you need is a single self-loop somewhere in the MDP that allows you to “waste time” so that you reach the desired state at exactly timestep T (instead of reaching too early).

Read more: Identifiability in inverse reinforcement learning has the same motivation and studies a very similar setting, but has a few different results. It's also easier to read if you're not as familiar with MaxEnt methods.

NEWS

Cooperative AI Workshop 2021 (summarized by Rohin): The Cooperative AI (AN #133) NeurIPS workshop (AN #116) is running again this year! The paper submission deadline is September 25.

NIST AI Risk Management Framework (summarized by Rohin): The National Institute of Standards and Technology (NIST) has put out a formal Request For Information (RFI) in the process of developing an AI Risk Management Framework that is intended for voluntary use in order to improve trustworthiness and mitigate risks of AI systems. According to the legislative mandate, aspects of trustworthiness include “explainability, transparency, safety, privacy, security, robustness, fairness, bias, ethics, validation, verification, interpretability, and other properties related to artificial intelligence systems that are common across all sectors”. Multiple AI safety organizations are submitting responses to the RFI and would like additional AI safety researchers to engage with it. Responses are due September 15; if you'd like to help out, email Tony Barrett at tbambarrett@gmail.com.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.