[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Happy New Year!

Audio version here (may not be up yet).

Highlights

AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling (Lucas Perry and Jan Leike) (summarized by Rohin): While Jan originally worked on theory (specifically AIXI), DQN, AlphaZero and others demonstrated that deep RL was a plausible path to AGI, and so now Jan works on more empirical approaches. In particular, when selecting research directions, he looks for techniques that are deeply integrated with the current paradigm, that could scale to AGI and beyond. He also wants the technique to work for agents in general, rather than just question answering systems, since people will want to build agents that can act, at least in the digital world (e.g. composing emails). This has led him to work on recursive reward modeling (AN #34), which tries to solve the specification problem in the SRA framework (AN #26).

Reward functions are useful because they allow the AI to find novel solutions that we wouldn't think of (e.g. AlphaGo's move 37), but often are incorrectly specified, leading to reward hacking. This suggests that we should do reward modeling, where we learn a model of the reward function from human feedback. Of course, such a model is still likely to have errors leading to reward hacking, and so to avoid this, the reward model needs to be updated online. As long as it is easier to evaluate behavior than to produce behavior, reward modeling should allow AIs to find novel solutions that we wouldn't think of.

However, we would eventually like to apply reward modeling to tasks where evaluation is also hard. In this case, we can decompose the evaluation task into smaller tasks, and recursively apply reward modeling to train AI systems that can perform those small helper tasks. Then, assisted by these helpers, the human should be able to evaluate the original task. This is essentially forming a "tree" of reward modeling agents that are all building up to the reward model for the original, hard task. While currently the decomposition would be done by a human, you could in principle also use recursive reward modeling to automate the decomposition. Assuming that we can get regular reward modeling working robustly, we then need to make sure that the tree of reward models doesn't introduce new problems. In particular, it might be the case that as you go up the tree, the errors compound: errors in the reward model at the leaves lead to slightly worse helper agents, which lead to worse evaluations for the second layer, and so on.

He recommends that rather than spending a lot of time figuring out the theoretically optimal way to address a problem, AI safety researchers should alternate between conceptual thinking and trying to make something work. The ML community errs on the other side, where they try out lots of techniques, but don't think as much about how their systems will be deployed in the real world. Jan also wants the community to focus more on clear, concrete technical explanations, rather than vague blog posts that are difficult to critique and reason about. This would allow us to more easily build on past work, rather than reasoning from first principles and reinventing the wheel many times.

DeepMind is taking a portfolio approach to AI safety: they are trying many different lines of attack, and hoping that some of them will pan out. Currently, there are teams for agent alignment (primarily recursive reward modeling), incentive theory, trained agent analysis, policy, and ethics. They have also spent some time thinking about AI safety benchmarks, as in AI Safety Gridworlds, since progress in machine learning is driven by benchmarks, though Jan does think it is quite hard to create a well-made benchmark.

Rohin's opinion: I've become more optimistic about recursive reward modeling since the original paper (AN #34), primarily (I think) because I now see more value in approaches that can be used to perform specific tasks (relative to approaches that try to infer "human values").

I also appreciated the recommendations for the AI safety community, and agree with them quite a lot. Relative to Jan, I see more value in conceptual work described using fuzzy intuitions, but I do think that more effort should be put into exposition of that kind of work.

Technical AI alignment

Learning human intent

Learning human objectives by evaluating hypothetical behaviours (Siddharth Reddy et al) (summarized by Rohin): Deep RL from Human Preferences updated its reward model by collecting human comparisons on on-policy trajectories where the reward model ensemble was most uncertain about what the reward should be. However, we want our reward model to be accurate off policy as well, even in unsafe states. To this end, we would like to train our reward model on hypothetical trajectories. This paper proposes learning a generative model of trajectories from some dataset of environment dynamics, such as safe expert demonstrations or rollouts from a random policy, and then finding trajectories that are "useful" for training the reward model. They consider four different criteria for usefulness of a trajectory: uncertain rewards (which intuitively are areas where the reward model needs training), high rewards (which could indicate reward hacking), low rewards (which increases the number of unsafe states that the reward model is trained on), and novelty (which covers more of the state space). Once a trajectory is generated, they have a human label it as good, neutral, or unsafe, and then train the reward model on these labels.

The authors are targeting an agent that can explore safely: since they already have a world model and a reward model, they use a model-based RL algorithm to act in the environment. Specifically, to act, they use gradient descent to optimize a trajectory in the latent space that maximizes expected rewards under the reward model and world model, and then take the first action of that trajectory. They argue that the world model can be trained on a dataset of safe human demonstrations (though in their experiments they use rollouts from a random policy), and then since the reward model is trained on hypothetical behavior and the model-based RL algorithm doesn't need any training, we get an agent that acts without us ever getting to an unsafe state.

Rohin's opinion: I like the focus on integrating active selection of trajectory queries into reward model training, and especially the four different kinds of active criteria that they consider, and the detailed experiments (including an ablation study) on the benefits of these criteria. These seem important for improving the efficiency of reward modeling.

However, I don't buy the argument that this allows us to train an agent without visiting unsafe states. In their actual experiments, they use a dataset gathered from a random policy, which certainly will visit unsafe states. If you instead use a dataset of safe human demonstrations, your generative model will only place probability mass on safe demonstrations, and so you'll never generate trajectories that visit unsafe states, and your reward model won't know that they are unsafe. (Maybe your generative model will generalize properly to the unsafe states, but that seems unlikely to me.) Such a reward model will either be limited to imitation learning (sticking to the same trajectories as in the demonstrations, and never finding something like AlphaGo's move 37), or it will eventually visit unsafe states.

Causal Confusion in Imitation Learning (Pim de Haan et al) (summarized by Asya): This paper argues that causal misidentification is a big problem in imitation learning. When the agent doesn't have a good model of what actions cause what state changes, it may mismodel the effects of a state change as a cause-- e.g., an agent learning to drive a car may incorrectly learn that it should turn on the brakes whenever the brake light on the dashboard is on. This leads to undesirable behavior where more information actually causes the agent to perform worse.

The paper presents an approach for resolving causal misidentification by (1) Training a specialized network to generate a "disentangled" representation of the state as variables, (2) Representing causal relationships between those variables in a graph structure, (3) Learning policies corresponding to each possible causal graph, and (4) Performing targeted interventions, either by querying an expert, or by executing a policy and observing the reward, to find the correct causal graph model.

The paper experiments with this method by testing it in environments artificially constructed to have confounding variables that correlate with actions but do not cause them. It finds that this method is successfully able to improve performance with confounding variables, and that it performs significantly better per number of queries (to an expert or of executing a policy) than any existing methods. It also finds that directly executing a policy and observing the reward is a more efficient strategy for narrowing down the correct causal graph than querying an expert.

Asya's opinion: This paper goes into detail arguing why causal misidentification is a huge problem in imitation learning and I find its argument compelling. I am excited about attempts to address the problem, and I am tentatively excited about the method the paper proposes for finding representative causal graphs, with the caveat that I don't feel equipped to evaluate whether it could efficiently generalize past the constrained experiments presented in the paper.

Rohin's opinion: While the conclusion that more information hurts sounds counterintuitive, it is actually straightforward: you don't get more data (in the sense of the size of your training dataset); you instead have more features in the input state data. This increases the number of possible policies (e.g. once you add the car dashboard, you can now express the policy "if brake light is on, apply brakes", which you couldn't do before), which can make you generalize worse. Effectively, there are more opportunities for the model to pick up on spurious correlations instead of the true relationships. This would happen in other areas of ML as well; surely someone has analyzed this effect for fairness, for example.

The success of their method over DAgger comes from improved policy exploration (for their environments): if your learned policy is primarily paying attention to the brake light, it's a very large change to instead focus on whether there is an obstacle visible, and so gradient descent is not likely to ever try that policy once it has gotten to the local optimum of paying attention to the brake light. In contrast, their algorithm effectively trains separate policies for scenarios in which different parts of the input are masked, which means that it is forced to explore policies that depend only on the brake light, and policies that depend only on the view outside the windshield, and so on. So, the desired policy has been explored already, and it only requires a little bit of active learning to identify the correct policy.

Like Asya, I like the approach, but I don't know how well it will generalize to other environments. It seems like an example of quality diversity, which I am generally optimistic about.

Humans Are Embedded Agents Too (John S Wentworth) (summarized by Rohin): Embedded agency (AN #31) is not just a problem for AI systems: humans are embedded agents too; many problems in understanding human values stem from this fact. For example, humans don't have a well-defined output channel: we can't say "anything that comes from this keyboard is direct output from the human", because the AI could seize control of the keyboard and wirehead, or a cat could walk over the keyboard, etc. Similarly, humans can "self-modify", e.g. by drinking, which often modifies their "values": what does that imply for value learning? Based on these and other examples, the post concludes that "a better understanding of embedded agents in general will lead to substantial insights about the nature of human values".

Rohin's opinion: I certainly agree that many problems with figuring out what to optimize stem from embedded agency issues with humans, and any formal account (AN #36) of this will benefit from general progress in understanding embeddedness. Unlike many others, I do not think we need a formal account of human values, and that a "common-sense" understanding will suffice, including for the embeddedness problems detailed in this post. (See also this comment thread and the next summary.)

What's the dream for giving natural language commands to AI? (Charlie Steiner) (summarized by Rohin): We could try creating AI systems that take the "artificial intentional stance" towards humans: that is, they model humans as agents that are trying to achieve some goals, and then we get the AI system to optimize for those inferred goals. We could do this by training an agent that jointly models the world and understands natural language, in order to ground the language into actual states of the world. The hope is that with this scheme, as the agent gets more capable, its understanding of what we want improves as well, so that it is robust to scaling up. However, the scheme has no protection against Goodharting, and doesn't obviously care about metaethics.

Rohin's opinion: I agree with the general spirit of "get the AI system to understand common sense; then give it instructions that it interprets correctly". I usually expect future ML research to figure out the common sense part, so I don't look for particular implementations (in this case, simultaneous training on vision and natural language), but just assume we'll have that capability somehow. The hard part is then how to leverage that capability to provide correctly interpreted instructions. It may be as simple as providing instructions in natural language, as this post suggests. I'm much less worried about instrumental subgoals in such a scenario, since part of "understanding what we mean" includes "and don't pursue this instruction literally to extremes". But we still need to figure out how to translate natural language instructions into actions.

Forecasting

Might humans not be the most intelligent animals? (Matthew Barnett) (summarized by Rohin): We can roughly separate intelligence into two categories: raw innovative capability (the ability to figure things out from scratch, without the benefit of those who came before you), and culture processing (the ability to learn from accumulated human knowledge). It's not clear that humans have the highest raw innovative capability; we may just have much better culture. For example, feral children raised outside of human society look very "unintelligent", The Secret of Our Success documents cases where culture trumped innovative capability, and humans actually don't have the most neurons, or the most neurons in the forebrain.

(Why is this relevant to AI alignment? Matthew claims that it has implications on AI takeoff speeds, though he doesn't argue for that claim in the post.)

Rohin's opinion: It seems very hard to actually make a principled distinction between these two facets of intelligence, because culture has such an influence over our "raw innovative capability" in the sense of our ability to make original discoveries / learn new things. While feral children might be less intelligent than animals (I wouldn't know), the appropriate comparison would be against "feral animals" that also didn't get opportunities to explore their environment and learn from their parents, and even so I'm not sure how much I'd trust results from such a "weird" (evolutionarily off-distribution) setup.

Walsh 2017 Survey (Charlie Giattino) (summarized by Rohin): In this survey, AI experts, robotics experts, and the public estimated a 50% chance of high-level machine intelligence (HLMI) by 2061, 2065, and 2039 respectively. The post presents other similar data from the survey.

Rohin's opinion: While I expected that the public would expect HLMI sooner than AI experts, I was surprised that AI and robotics experts agreed so closely -- I would have thought that robotics experts would have longer timelines.

Field building

What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address. (David Krueger) (summarized by Rohin): When making the case for work on AI x-risk to other ML researchers, what should we focus on? This post suggests arguing for three core claims:

1. Due to Goodhart's law, instrumental goals, and safety-performance trade-offs, the development of advanced AI increases the risk of human extinction non-trivially.

2. To mitigate this x-risk, we need to know how to build safe systems, know that we know how to build safe systems, and prevent people from building unsafe systems.

3. So, we should mitigate AI x-risk, as it is impactful, neglected, and challenging but tractable.

Rohin's opinion: This is a nice concise case to make, but I think the bulk of the work is in splitting the first claim into subclaims: this is the part that is usually a sticking point (see also the next summary).

Miscellaneous (Alignment)

A list of good heuristics that the case for AI x-risk fails (David Krueger) (summarized by Flo): Because human attention is limited and a lot of people try to convince us of the importance of their favourite cause, we cannot engage with everyone’s arguments in detail. Thus we have to rely on heuristics to filter out insensible arguments. Depending on the form of exposure, the case for AI risks can fail on many of these generally useful heuristics, eight of which are detailed in this post. Given this outside view perspective, it is unclear whether we should actually expect ML researchers to spend time evaluating the arguments for AI risk.

Flo's opinion: I can remember being critical of AI risk myself for similar reasons and think that it is important to be careful with the framing of pitches to avoid these heuristics from firing. This is not to say that we should avoid criticism of the idea of AI risk, but criticism is a lot more helpful if it comes from people who have actually engaged with the arguments.

Rohin's opinion: Even after knowing the arguments, I find six of the heuristics quite compelling: technology doomsayers have usually been wrong in the past, there isn't a concrete threat model, it's not empirically testable, it's too extreme, it isn't well grounded in my experience with existing AI systems, and it's too far off to do useful work now. All six make me distinctly more skeptical of AI risk.

Other progress in AI

Reinforcement learning

Procgen Benchmark (Karl Cobbe et al) (summarized by Asya): Existing game-based benchmarks for reinforcement learners suffer from the problem that agents constantly encounter near-identical states, meaning that the agents may be overfitting and memorizing specific trajectories rather than learning a general set of skills. In an attempt to remedy this, in this post OpenAI introduces Procgen Benchmark, 16 procedurally-generated video game environments used to measure how quickly a reinforcement learning agent learns generalizable skills.

The authors conduct several experiments using the benchmark. Notably, they discover that:

- Agents strongly overfit to small training sets and need access to as many as 10,000 levels to generalize appropriately.

- After a certain threshold, training performance improves as the training set grows, counter to trends in other supervised learning tasks.

- Using a fixed series of levels for each training sample (as other benchmarks do) makes agents fail to generalize to randomly generated series of levels at test time.

- Larger models improve sample efficiency and generalization.

Asya's opinion: This seems like a useful benchmark. I find it particularly interesting that their experiment testing non-procedurally generated levels as training samples implies huge overfitting effects in existing agents trained in video-game environments.

Adaptive Online Planning for Continual Lifelong Learning (Kevin Lu et al) (summarized by Nicholas): Lifelong learning is distinct from standard RL benchmarks because

1. The environment is sequential rather than episodic; it is never reset to a new start state.

2. The current transition and reward function are given, but they change over time.

Given this setup, there are two basic approaches: first, run model-free learning on simulated future trajectories and rerun it every time the dynamics change, and second, run model-based planning on the current model. If you ignore computational constraints, these should be equivalent; however, in practice, the second option tends to be more computationally efficient. The contribution of this work is to make this more efficient, rather than improving final performance, by starting with the second option and then using model-free learning to “distill” the knowledge produced by the model-based planner allowing for more efficient planning in the future.

Specifically, Adaptive Online Planning (AOP) balances between the model-based planner MPPI (a variant of MPC) and the model-free algorithm TD3. MPPI uses the given model to generate a trajectory up to a horizon and then uses an ensemble of value functions to estimate the cumulative reward. This knowledge is then distilled into TD3 for later use as a prior for MPPI. During future rollouts, the variance and Bellman error of the value function ensemble are used to determine how long the horizon should be, and therefore how much computation is used.

Nicholas's opinion: I agree that episodic training and fixed world dynamics seem like unlikely conditions for most situations we would expect agents to encounter in the real world. Accounting for them seems particularly important to ensure safe exploration and robustness to distributional shift, and I think that these environments could serve as useful benchmarks for these safety problems as well.