The Alignment Newsletter #12: 06/25/18

Rohin Shah

Highlights

Factored Cognition (Andreas Stuhlmuller): This is a presentation that Andreas has given a few times on Factored Cognition, a project by Ought that is empirically testing one approach to amplification on humans. It is inspired by HCH and meta-execution. These approaches require us to break down complex tasks into small, bite-sized pieces that can be solved separately by copies of an agent. So far Ought has built a web app in which there are workspaces, nodes, pointers etc. that can allow humans to do local reasoning to answer a big global question.

My opinion: It is unclear whether most tasks can actually be decomposed as required for iterated distillation and amplification, so I'm excited to see experiments that can answer that question! The questions that Ought is trying seem quite hard, so it should be a good test of breaking down reasoning. There's a lot of detail in the presentation that I haven't covered, I encourage you to read it.

Summary: Inverse Reinforcement Learning

This is a special section this week summarizing some key ideas and papers behind inverse reinforcement learning, which seeks to learn the reward function an agent is optimizing given a policy or demonstrations from the agent.

Learning from humans: what is inverse reinforcement learning? (Jordan Alexander): This article introduces and summarizes the first few influential papers on inverse reinforcement learning. Algorithms for IRL attacked the problem by formulating it as a linear program, assuming that the given policy or demonstrations is optimal. However, there are many possible solutions to this problem -- for example, the zero reward makes any policy or demonstration optimal. Apprenticeship Learning via IRL lets you learn from an expert policy that is near-optimal. It assumes that the reward function is a weighted linear combination of features of the state. In this case, given some demonstrations, we only need to match the feature expectations of the demonstrations in order to achieve the same performance as the demonstrations (since the reward is linear in the features). So, they do not need to infer the underlying reward function (which may be ambiguous).

Maximum Entropy Inverse Reinforcement Learning (Brian D. Ziebart et al): While matching empirical feature counts helps to deal with the ambiguity of the reward functions, exactly matching featuer counts will typically require policies to be stochastic, in which case there are many stochastic policies that get the right feature counts. How do you pick among these policies? We should choose the distribution using the principle of maximum entropy, which says to pick the stochastic policy (or alternatively, a probability distribution over trajectories) that has maximum entropy (and so the least amount of information). Formally, we’re trying to find a function P(ζ) that maximizes H(P), subject to E[features(ζ)] = empirical feature counts, and that P(ζ) is a probability distribution (sums to 1 and is non-negative for all trajectories). For the moment, we’re assuming deterministic dynamics.

We solve this constrained optimization problem using the method of Lagrange multipliers. With simply analytical methods, we can get to the standard MaxEnt distribution, where P(ζ | θ) is proportional to exp(θ f(ζ)). But where did θ come from? It is the Lagrange multiplier for constraint on expected feature counts. So we’re actually not done with the optimization yet, but this intermediate form is interesting in and of itself, because we can identify the Lagrange multiplier θ as the reward weights. Unfortunately, we can’t finish the optimization analytically -- however, we can compute the gradient for θ, which we can then use in a gradient descent algorithm. This gives the full MaxEnt IRL algorithm for deterministic environments. When you have (known) stochastic dynamics, we simply tack on the probability of the observed transitions to the model P(ζ | θ) and optimize from there, but this is not as theoretically compelling.
One warning -- when people say they are using MaxEnt IRL, they are usually actually talking about MaxCausalEnt IRL, which we'll discuss next.

Modeling Interaction via the Principle of Maximum Causal Entropy (Brian D. Ziebart et al): When we have stochastic dynamics, MaxEnt IRL does weird things. It is basically trying to maximize the entropy H(A1, A2, ... | S1, S2, ...), subject to matching the feature expectations. However, when you choose the action A1, you don’t know what the future states are going to look like. What you really want to do is maximize the causal entropy, that is, you want to maximize H(A1 | S1) + H(A2 | S1, S2) + ..., so that each action’s entropy is only conditioned on the previous states, and not future states. You can then run through the same machinery as for MaxEnt IRL to get the MaxCausalEnt IRL algorithm.

A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress: This is a comprehensive survey of IRL that should be useful to researchers, or students looking to perform a deep dive into IRL. It's particularly useful because it can compare and contrast across many different IRL algorithms, whereas each individual IRL paper only talks about their method and a few particular weaknesses of other methods. If you want to learn a lot about IRL, I would start with the previous readings, then read this one, and perhaps after that read individual papers that interest you.

Technical AI alignment

Iterated distillation and amplification

Factored Cognition (Andreas Stuhlmuller): Summarized in the highlights!

AI strategy and policy

AI Nationalism (Ian Hogarth): As AI becomes more important in the coming years, there will be an increasing amount of "AI nationalism". AI policy will be extremely important and governments will compete on keeping AI talent. For example, they are likely to start blocking company takeovers and acquisitions that cross national borders -- for example, the UK could have been in a much stronger position had they blocked the acquisition of DeepMind (which is UK-based) by Google (which is US-based).

AI capabilities

Reinforcement learning

RUDDER: Return Decomposition for Delayed Rewards (Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich et al)

3