Alignment Newsletter #35

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

This week we don't have any explicit highlights, but remember to treat the sequences as though they were highlighted!

Technical AI alignment

Iterated amplification sequence

Corrigibility (Paul Christiano): A corrigible agent is one which helps its operator, even with tasks that would change the agent itself, such as correcting mistakes in AI design. Consider a good act-based agent, which chooses actions according to our preferences over that action. Since we have a short-term preference for corrigibility, the act-based agent should be corrigible. For example, if we are trying to turn off the agent, the agent will turn off because that's what we would prefer -- it is easy to infer that the overseer would not prefer that agents stop the overseer from shutting them down. Typically we only believe that the agent would stop us from shutting it down if it makes long-term plans, in which case being operational is instrumentally useful, but with act-based agents the agent only optimizes for its overseer's short term preferences. One potential objection is that the notion of corrigibility is not easy to learn, but it seems not that hard to answer the question "Is the operator being misled", and in any case we can try this with simple systems, and the results should improve with more capable systems, since as you get smarter you are more capable of predicting the overseer.

In addition, even if an agent has a slightly wrong notion of the overseer's values, it seems like it will improve over time. It is not hard to infer that the overseer wants the agent to make its approximation of the overseer's values more accurate. So, as long as the agent has enough of the overseer's preferences to be corrigible, it will try to learn about the preferences it is wrong about and will become more and more aligned over time. In addition, any slight value drifts caused by eg. amplification will tend to be fixed over time, at least on average.

Rohin's opinion: I really like this formulation of corrigibility, which I find quite different from MIRI's paper. This seems a lot more in line with the kind of reasoning that I want from an AI system, and it seems like iterated amplification or something like it could plausibly succeed at achieving this sort of corrigible behavior.

Iterated Distillation and Amplification (Ajeya Cotra): This is the first in a series of four posts describing the iterated amplification framework in different ways. This post focuses on the repetition of two steps. In amplification, we take a fast aligned agent and turn it into a slow but more capable aligned agent, by allowing a human to coordinate many copies of the fast agent in order to make better decisions. In distillation, we take a slow aligned agent and turn it a fast aligned agent (perhaps by training a neural net to imitate the judgments of the slow agent). This is similar to AlphaGoZero, in which MCTS can be thought of as amplification, while distillation consists of updating the neural net to predict the outputs of the MCTS.

This allows us to get both alignment and powerful capabilities, whereas usually the two trade off against each other. High capabilities implies a sufficiently broad mandate to search for good behaviors, allowing our AI systems to find novel behaviors that we never would have thought of, which could be bad if the objective was slightly wrong. On the other hand, high alignment typically requires staying within the realm of human behavior, as in imitation learning, which prevents the AI from finding novel solutions.

In addition to distillation and amplification robustly preserving alignment, we also need to ensure that given a human as a starting point, iterated distillation and amplification can scale to arbitrary capabilities. We would also want it be about as cost-efficient as alternatives. This seems to be true at test time, when we are simply executing a learned model, but it could be that training is much more expensive.

Rohin's opinion: This is a great simple explanation of the scheme. I don't have much to say about the idea since I've talked about iterated amplification so much in this newsletter already.

Benign model-free RL (Paul Christiano): This post is very similar to the previous one, just with different language: distillation is now implemented through reward modeling with robustness. The point of robustness is to ensure that the distilled agent is benign even outside of the training distribution (though it can be incompetent). There's also an analysis of the costs of the scheme. One important note is that this approach only works for model-free RL systems -- we'll need something else for eg. model-based RL, if it enables capabilities that we can't get with model-free RL.

Value learning sequence

Intuitions about goal-directed behavior and Coherence arguments do not imply goal-directed behavior (Rohin Shah) (summarized by Richard): Rohin discusses the "misspecified goal argument for AI risk": that even a small misspecification in goals can lead to adversarial behaviour in advanced AI. He argues that whether behaviour is goal-directed depends on whether it generalises to new situations in ways that are predictable given that goal. He also raises the possibility that thinking of an agent as goal-directed becomes less useful the more we understand about how it works. If true, this would weaken the misspecified goal argument.

In the next post, Rohin argues against the claim that "simply knowing that an agent is intelligent lets us infer that it is goal-directed". He points out that all behaviour can be rationalized as expected utility maximisation over world-histories - but this may not meet our criteria for goal-directed behaviour, and slightly misspecifying such a utility function may well be perfectly safe. What's more interesting - and dangerous - is expected utility maximisation over world-states - but he claims that we shouldn't assume that advanced AI will have this sort of utility function, unless we have additional information (e.g. that it has a utility function simple enough to be explicitly represented). There are plenty of intelligent agents which aren't goal-directed - e.g. ones which are very good at inference but only take trivial actions.

Richard's opinion: I broadly agree with Rohin's points in these posts, and am glad that he's making these arguments explicit. However, while goal-directedness is a tricky property to reason about, I think it's still useful to consider it a property of an agent rather than a property of our model of that agent. It's true that when we have a detailed explanation of how an agent works, we're able to think of cases in which its goal-directedness breaks down (e.g. adversarial examples). However, when these examples are very rare, they don't make much practical difference (e.g. knowing that AlphaGo has a blind spot in certain endgames might not be very helpful in beating it, because you can't get to those endgames).

Agent foundations

Robust program equilibrium (Caspar Oesterheld)

Bounded Oracle Induction (Diffractor)

Oracle Induction Proofs (Diffractor)

Learning human intent

Guiding Policies with Language via Meta-Learning (John D. Co-Reyes) (summarized by Richard): The authors train an agent to perform tasks specified in natural language, with a "correction" after each attempt (also in natural language). They formulate this as a meta-learning problem: for each instruction, several attempt-correction cycles are allowed. Each attempt takes into account previous attempts to achieve the same instruction by passing each previous trajectory and its corresponding correction through a CNN, then using the mean of all outputs as an input to a policy module.

In their experiments, all instructions and corrections are generated automatically, and test-time performance is evaluated as a function of how many corrections are allowed. In one experiment, the tasks is to navigate rooms to reach a goal, where the correction is the next subgoal required. Given 4 corrections, their agent outperforms a baseline which was given all 5 subgoals at the beginning of the task. In another experiment, the task is to move a block to an ambiguously-specified location, and the corrections narrow down the target area; their trained agent scores 0.9, as opposed to 0.96 for an agent given the exact target location.

Richard's opinion: This paper explores an important idea: correcting poorly-specified instructions using human-in-the-loop feedback. The second task in particular is a nice toy example of iterative preference clarification. I'm not sure whether their meta-learning approach is directly relevant to safety, particularly because each correction is only "in scope" for a single episode, and also only occurs after a bad attempt has finished. However, the broad idea of correction-based learning seems promising.

Interpretability

Deeper Interpretability of Deep Networks (Tian Xu et al)

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks (David Bau et al)

Please Stop Explaining Black Box Models for High Stakes Decisions (Cynthia Rudin)

Representer Point Selection for Explaining Deep Neural Networks (Chih-Kuan Yeh, Joon Sik Kim et al)

Adversarial examples

Robustness via curvature regularization, and vice versa (Moosavi-Dezfooli et al) (summarized by Dan H): This paper proposes a distinct way to increase adversarial perturbation robustness. They take an adversarial example generated with the FGSM, compute the gradient of the loss for the clean example and the gradient of the loss for the adversarial example, and they penalize this difference. Decreasing this penalty relates to decreasing the loss surface curvature. The technique works slightly worse than adversarial training.

Uncertainty

Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings (Aviral Kumar et al)

Forecasting

How rapidly are GPUs improving in price performance? (gallabytes)

Time for AI to cross the human performance range in diabetic retinopathy (Aysja Johnson)

Near-term concerns

Fairness and bias

50 Years of Test (Un)fairness: Lessons for Machine Learning (Ben Hutchinson)

AI strategy and policy

Robust Artificial Intelligence and Robust Human Organizations (Thomas G. Dietterich)

Handful of Countries – Including the US and Russia – Hamper Discussions to Ban Killer Robots at UN

Other progress in AI

Exploration

Montezuma’s Revenge Solved by Go-Explore, a New Algorithm for Hard-Exploration Problems (Adrien Ecoffet et al) (summarized by Richard): This blog post showcases an agent which achieves high scores in Montezuma’s Revenge and Pitfall by keeping track of a frontier of visited states (and the trajectories which led to them). In each training episode, a state is chosen from the frontier, the environment is reset to that state, and then the agent randomly explores further and updates the frontier. The authors argue that this addresses the tendency of intrinsic motivation algorithms to forget about promising areas they've already explored. To make state storage tractable, each state is stored as a downsampled 11x8 image.

The authors note that this solution exploits the determinism of the environment, which makes it brittle. So they then use imitation learning to learn a policy from demonstrations by the original agent. The resulting agents score many times higher than state-of-the-art on Montezuma’s Revenge and Pitfall.

Richard's opinion: I’m not particularly impressed by this result, for a couple of reasons. Firstly, I think that exploiting determinism by resetting the environment (or even just memorising trajectories) fundamentally changes the nature of the problem posed by hard Atari games. Doing so allows us to solve them in the same ways as any other search problem - we could, for instance, just use the AlphaZero algorithm to train a value network. In addition, the headline results are generated by hand-engineering features like x-y coordinates and room number, a technique that has been eschewed by most other attempts. When you take those features away, their agent’s total reward on Pitfall falls back to 0.

Read more: Quick Opinions on Go-Explore

Prioritizing Starting States for Reinforcement Learning (Arash Tavakoli, Vitaly Levdik et al)

Reinforcement learning

Learning Actionable Representations with Goal-Conditioned Policies (Dibya Ghosh)

Unsupervised Control Through Non-Parametric Discriminative Rewards (David Warde-Farley)

Hierarchical RL

Hierarchical visuomotor control of humanoids (Josh Merel, Arun Ahuja et al)