Alignment Newsletter #42

Rohin Shah

Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell) (summarized by Richard): Dylan puts forward his conception of Cooperative Inverse Reinforcement Learning as a definition of what it means for a human-AI system to be rational, given the information bottleneck between a human's preferences and an AI's observations. He notes that there are some clear mismatches between this problem and reality, such as the CIRL assumption that humans have static preferences, and how fuzzy the abstraction of "rational agents with utility functions" becomes in the context of agents with bounded rationality. Nevertheless, he claims that this is a useful unifying framework for thinking about AI safety.

Dylan argues that the process by which a robot learns to accomplish tasks is best described not just as maximising an objective function but instead in a way which includes the system designer who selects and modifies the optimisation algorithms, hyperparameters, etc. In fact, he claims, it doesn't make sense to talk about how well a system is doing without talking about the way in which it was instructed and the type of information it got. In CIRL, this is modeled via the combination of a "teaching strategy" and a "learning strategy". The former can take many forms: providing rankings of options, or demonstrations, or binary comparisons, etc. Dylan also mentions an extension of this in which the teacher needs to learn their own values over time. This is useful for us because we don't yet understand the normative processes by which human societies come to moral judgements, or how to integrate machines into that process.

On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): In human-robot interaction (HRI), we often require a model of the human that we can plan against. Should we use a specific model of the human (a so-called "theory of mind", where the human is approximately optimizing some unknown reward), or should we simply learn a model of the human from data? This paper presents empirical evidence comparing three algorithms in an autonomous driving domain, where a robot must drive alongside a human.

The first algorithm, called Theory of Mind based learning, models the human using a theory of mind, infers a human reward function, and uses that to predict what the human will do, and plans around those actions. The second algorithm, called Black box model-based learning, trains a neural network to directly predict the actions the human will take, and plans around those actions. The third algorithm, model-free learning, simply applies Proximal Policy Optimization (PPO), a deep RL algorithm, to directly predict what action the robot should take, given the current state.

Quoting from the abstract, they "find that there is a significant sample complexity advantage to theory of mind methods and that they are more robust to covariate shift, but that when enough interaction data is available, black box approaches eventually dominate". They also find that when the ToM assumptions are significantly violated, then the black-box model-based algorithm will vastly surpass ToM. The model-free learning algorithm did not work at all, probably because it cannot take advantage of knowledge of the dynamics of the system and so the learning problem is much harder.

Rohin's opinion: I'm always happy to see an experimental paper that tests how algorithms perform, I think we need more of these.

You might be tempted to think of this as evidence that in deep RL, a model-based method should outperform a model-free one. This isn't exactly right. The first ToM and black box model-based algorithms use an exact model of the dynamics of the environment modulo the human, that is, they can exactly predict the next state given the current state, the robot action, and the human action. The model-free algorithm must learn this from scratch, so it isn't an apples-to-apples comparison. (Typically in deep RL, both model-based and model-free algorithms have to learn the environment dynamics.) However, you can think of the ToM as a model-based method and the Black-box model-based algorithm as a model-free algorithm, where both algorithms have to learn the human modelinstead of the more traditional environment dynamics. With that analogy, you would conclude that model-based algorithms will be more sample efficient and more performant in low-data regimes, but will be outperformed by model-free algorithms with sufficient data, which agrees with my intuitions.

This kind of effect is a major reason for my position that the first powerful AI systems will be modular (analogous to model-based systems), but that they will eventually be replaced by more integrated, end-to-end systems (analogous to model-free systems). Initially, we will be in a (relatively speaking) low-data regime, where modular systems excel, but over time there will be more data and compute and we will transition to regimes where end-to-end systems will solve the same tasks better (though we may then have modular systems for more difficult tasks).

Technical AI alignment

Iterated amplification sequence

The reward engineering problem (Paul Christiano): The reward engineering problem is the problem of designing some form of reward feedback such that if the agent learns to get high reward, then we are happy with the resulting behavior. We assume that an overseer H is providing reward for an agent A, with H smarter than A.

(Aside: Why are we assuming H is smarter, when we are typically worried about AI that is more intelligent than we are? Because we could hope to use AI tools to create this reward function, so that the human + AI team is smarter than the AI being trained. In iterated amplification, this corresponds to the distillation step.)

We could imagine three approaches: direct supervision (where H provides a reward for every action A takes), imitation learning (where A tries to mimic H, perhaps by trying to fool a discriminator that distinguishes between A and H), and inverse reinforcement learning (where we infer a reward function by observing H's behavior and use that to supervise A).

With all of these approaches, we might hope that since H is smarter than A, A will not be able to trick H with some behavior that gets high reward but is actually undesirable. However, this argument has a major flaw. When A chooses an action, the process or computation by which it chooses that action is important information, which by default (especially with neural nets) only A has access to, and so this can allow it to fool H. For example, if A is tasked with drawing original artwork, it might copy and slightly change existing artwork, which H may not realize if it cannot see how the artwork was made. We could hope to fix this problem with transparency or explanations, but this requires a lot more research.

Imitation learning and IRL have the problem that A may not be capable of doing what H does. In that case, it will be off-distribution and may have weird behavior. Direct supervision doesn't suffer from this problem, but it is very time-inefficient. This could potentially be fixed using semi-supervised learning techniques.

Rohin's opinion: The information asymmetry problem between H and A seems like a major issue. For me, it's the strongest argument for why transparency is a necessary ingredient of a solution to alignment. The argument against imitation learning and IRL is quite strong, in the sense that it seems like you can't rely on either of them to capture the right behavior. These are stronger than the arguments against ambitious value learning (AN #31) because here we assume that H is smarter than A, which we could not do with ambitious value learning. So it does seem to me that direct supervision (with semi-supervised techniques and robustness) is the most likely path forward to solving the reward engineering problem.

There is also the question of whether it is necessary to solve the reward engineering problem. It certainly seems necessary in order to implement iterated amplification given current systems (where the distillation step will be implemented with optimization, which means that we need a reward signal), but might not be necessary if we move away from optimization or if we build systems using some technique other than iterated amplification (though even then it seems very useful to have a good reward engineering solution).

Capability amplification (Paul Christiano): Capability amplification is the problem of taking some existing policy and producing a better policy, perhaps using much more time and compute. It is a particularly interesting problem to study because it could be used to define the goals of a powerful AI system, and it could be combined with reward engineering above to create a powerful aligned system. (Capability amplification and reward engineering are analogous to amplification and distillation respectively.) In addition, capability amplification seems simpler than the general problem of "build an AI that does the right thing", because we get to start with a weak policy A rather than nothing, and were allowed to take lots of time and computation to implement the better policy. It would be useful to tell whether the "hard part" of value alignment is in capability amplification, or somewhere else.

We can evaluate capability amplification using the concepts of reachability and obstructions. A policy C is reachable from another policy A if there is some chain of policies from A to C, such that at each step capability amplification takes you from the first policy to something at least as good as the second policy. Ideally, all policies would be reachable from some very simple policy. This is impossible if there exists an obstruction, that is a partition of policies into two sets L and H, such that it is impossible to amplify any policy in L to get a policy that is at least as good as some policy in H. Intuitively, an obstruction prevents us from getting to arbitrarily good behavior, and means that all of the policies in H are not reachable from any policy in L.

We can do further work on capability amplification. With theory, we can search for challenging obstructions, and design procedures that overcome them. With experiment, we can study capability amplification with humans (something which Ought is now doing).

Rohin's opinion: There's a clear reason for work on capability amplification: it could be used as a part of an implementation of iterated amplification. However, this post also suggests another reason for such work -- it may help us determine where the "hard part" of AI safety lies. Does it help to assume that you have lots of time and compute, and that you have access to a weaker policy?

Certainly if you just have access to a weaker policy, this doesn't make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (In other variants, you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach "superhuman performance" by our current standards; capability amplification asks if we can do this in a particular structured way.

Value learning sequence

Reward uncertainty (Rohin Shah): Given that we need human feedback for the AI system to stay "on track" as the environment changes, we might design a system that keeps an estimate of the reward, chooses actions that optimize that reward, but also updates the reward over time based on feedback. This has a few issues: it typically assumes that the human Alice knows the true reward function, it makes a possibly-incorrect assumption about the meaning of Alice's feedback, and the AI system still looks like a long-term goal-directed agent where the goal is the current reward estimate.

This post takes the above AI system and considers what happens if you have a distribution over reward functions instead of a point estimate, and during action selection you take into account future updates to the distribution. (This is the setup of Cooperative Inverse Reinforcement Learning.) While we still assume that Alice knows the true reward function, and we still require an assumption about the meaning of Alice's feedback, the resulting system looks less like a goal-directed agent.

In particular, the system no longer has an incentive to disable the system that learns values from feedback: while previously it changed the AI system's goal (a negative effect from the goal's perspective), now it provides more information about the goal (a positive effect). In addition, the system has more of an incentive to let itself be shut down. If a human is about to shut it down, it should update strongly that whatever it was doing was very bad, causing a drastic update on reward functions. It may still prevent us from shutting it down, but it will at least stop doing the bad thing. Eventually, after gathering enough information, it would converge on the true reward and do the right thing. Of course, this is assuming that the space of rewards is well-specified, which will probably not be true in practice.

Following human norms (Rohin Shah): One approach to preventing catastrophe is to constrain the AI system to never take catastrophic actions, and not focus as much on what to do (which will be solved by progress in AI more generally). In this setting, we hope that our AI systems accelerate our rate of progress, but we remain in control and use AI systems as tools that allow us make better decisions and better technologies. Impact measures / side effect penalties aim to define what not to do. What if we instead learn what not to do? This could look like inferring and following human norms, along the lines of ad hoc teamwork.

This is different from narrow value learning for a few reasons. First, narrow value learning also learns what to do. Second, it seems likely that norm inference only gives good results in the context of groups of agents, while narrow value learning could be applied in singe agent settings.

The main advantages of learning norms is that this is something that humans do quite well, so it may be significantly easier than learning "values". In addition, this approach is very similar to our ways of preventing humans from doing catastrophic things: there is a shared, external system of norms that everyone is expected to follow. However, norm following is a weaker standard than ambitious value learning (AN #31), and there are more problems as a result. Most notably, powerful AI systems will lead to rapidly evolving technologies, that cause big changes in the environment that might require new norms; norm-following AI systems may not be able to create or adapt to these new norms.

Agent foundations

CDT Dutch Book (Abram Demski)

CDT=EDT=UDT (Abram Demski)

Learning human intent

AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell): Summarized in the highlights!

On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): Summarized in the highlights!

What AI Safety Researchers Have Written About the Nature of Human Values (avturchin): This post categorizes theories of human values along three axes. First, how complex is the description of the values? Second, to what extent are "values" defined as a function of behavior (as opposed to being a function of eg. the brain's algorithm)? Finally, how broadly applicable is the theory: could it apply to arbitrary minds, or only to humans? The post then summarizes different positions on human values that different researchers have taken.

Rohin's opinion: I found the categorization useful for understanding the differences between views on human values, which can be quite varied and hard to compare.

Risk-Aware Active Inverse Reinforcement Learning (Daniel S. Brown, Yuchen Cui et al): This paper presents an algorithm that actively solicits demonstrations on states where it could potentially behave badly due to its uncertainty about the reward function. They use Bayesian IRL as their IRL algorithm, so that they get a distribution over reward functions. They use the most likely reward to train a policy, and then find a state from which that policy has high risk (because of the uncertainty over reward functions). They show in experiments that this performs better than other active IRL algorithms.

Rohin's opinion: I don't fully understand this paper -- how exactly are they searching over states, when there are exponentially many of them? Are they sampling them somehow? It's definitely possible that this is in the paper and I missed it, I did skim it fairly quickly.

Other progress in AI

Reinforcement learning

Soft Actor-Critic: Deep Reinforcement Learning for Robotics (Tuomas Haarnoja et al)

Deep learning

A Comprehensive Survey on Graph Neural Networks (Zonghan Wu et al)

Graph Neural Networks: A Review of Methods and Applications (Jie Zhou, Ganqu Cui, Zhengyan Zhang et al)

News

Olsson to Join the Open Philanthropy Project (summarized by Dan H): Catherine Olsson‏, a researcher at Google Brain who was previously at OpenAI, will be joining the Open Philanthropy Project to focus on grant making for reducing x-risk from advanced AI. Given her first-hand research experience, she has knowledge of the dynamics of research groups and a nuanced understanding of various safety subproblems. Congratulations to both her and OpenPhil.

Announcement: AI alignment prize round 4 winners (cousin_it): The last iteration of the AI alignment prize has concluded, with awards of $7500 each to Penalizing Impact via Attainable Utility Preservation (AN #39) and Embedded Agency (AN #31, AN #32), and $2500 each to Addressing three problems with counterfactual corrigibility (AN #30) and Three AI Safety Related Ideas/Two Neglected Problems in Human-AI Safety (AN #38).

11