# Alignment Newsletter #45

10 min readNo comments

# 12

NewslettersAI
Frontpage

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

## Highlights

Learning Preferences by Looking at the World (Rohin Shah and Dmitrii Krasheninnikov): The key idea with this project that I worked on is that the state of the world is already optimized for our preferences, and so simply by looking at the world we can infer these preferences. Consider the case where there is a vase standing upright on the table. This is an unstable equilibrium -- it's very easy to knock over the vase so it is lying sideways, or is completely broken. The fact that this hasn't happened yet suggests that we care about vases being upright and intact; otherwise at some point we probably would have let it fall.

Since we have optimized the world for our preferences, the natural approach is to model this process, and then invert it to get the preferences. You could imagine that we could consider all possible reward functions, and put probability mass on them in proportion to how likely they make the current world state if a human optimized them. Basically, we are simulating the past in order to figure out what must have happened and why. With the vase example, we would notice that in any reward function where humans wanted to break vases, or were indifferent to broken vases, we would expect the current state to contain broken vases. Since we don't observe that, it must be the case that we care about keeping vases intact.

Our algorithm, Reward Learning by Simulating the Past (RLSP), takes this intuition and applies it in the framework of Maximum Causal Entropy IRL (AN #12), where you assume that the human was acting over T timesteps to produce the state that you observe. We then show a few gridworld environments in which applying RLSP can fix a misspecified reward function.

Rohin's opinion: In addition to this blog post and the paper, I also wrote a post on the Alignment Forum expressing opinions about the work. There are too many disparate opinions to put in here, so I'd recommend reading the post itself. I guess one thing I'll mention is that to infer preferences with a single state, you definitely need a good dynamics model, and a good set of features. While this may seem difficult to get, it's worth noting that dynamics are empirical facts about the world, and features might be, and there is already lots of work on learning both dynamics and features.

# Technical AI alignment

### Iterated amplification sequence

Security amplification (Paul Christiano): If we imagine humans as reasoners over natural language, there are probably some esoteric sentences that could cause "failure". For example, maybe there are unreasonably convincing arguments that cause the human to believe something, when they shouldn't have been convinced by the argument. Maybe they are tricked or threatened in a way that "shouldn't" have happened. The goal with security amplification is to make these sorts of sentences difficult to find, so that we will not come across them in practice. As with Reliability amplification (AN #44), we are trying to amplify a fast agent A into a slow agent A* that is "more secure", meaning that it is multiplicatively harder to find an input that causes a catastrophic failure.

You might expect that capability amplification (AN #42) would also improve security, since the more capable agent would be able to notice failure modes and remove them. However, this would likely take far too long.

Instead, we can hope to achieve security amplification by making reasoning abstract and explicit, with the hope that when reasoning is explicit it becomes harder to trigger the underlying failure mode, since you have to get your attack "through" the abstract reasoning. I believe a future post will talk about this more, so I'll leave the details till then. Another option would be for the agent to act stochastically; for example, when it needs to generate a subquestion, it generates many different wordings of the subquestion and chooses one randomly. If only one of the wordings can trigger the failure, then this reduces the failure probability.

Rohin's opinion: This is the counterpoint to Reliability amplification (AN #44) from last week, and the same confusion I had last week still apply, so I'm going to refrain from an opinion.

### Problems

Constructing Goodhart (johnswentworth): This post makes the point that Goodhart's Law is so common in practice because if there are several things that we care about, then we are probably at or close to a Pareto-optimal point with respect to those things, and so choosing any one of them as a proxy metric to optimize will cause the other things to become worse, leading to Goodhart effects.

Rohin's opinion: This is an important point about Goodhart's Law. If you take some "random" or unoptimized environment, and then try to optimize some proxy for what you care about, it will probably work quite well. It's only when the environment is already optimized that Goodhart effects are particularly bad.

Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function) (Peter Eckersley) (summarized by Richard): This paper discusses some impossibility theorems related to the Repugnant conclusion in population ethics (i.e. theorems showing that no moral theory simultaneously satisfies certain sets of intuitively desirable properties). Peter argues that in the context of AI it's best to treat these theorems as uncertainty results, either by allowing incommensurate outcomes or by allowing probabilistic moral judgements. He hypothesises that "the emergence of instrumental subgoals is deeply connected to moral certainty", and so implementing uncertain objective functions is a path to making AI safer.

Richard's opinion: The more general argument underlying this post is that aligning AGI will be hard partly because ethics is hard (as discussed here). I agree that using uncertain objective functions might help with this problem. However, I'm not convinced that it's useful to frame this issue in terms of impossibility theorems and narrow AI, and would like to see these ideas laid out in a philosophically clearer way.

### Iterated amplification

HCH is not just Mechanical Turk (William Saunders): In Humans Consulting HCH (HCH) (AN #34) a human is asked a question and is supposed to return an answer. The human can ask subquestions, which are delegated to another copy of the human, who can ask subsubquestions, ad infinitum. This post points out that HCH has a free parameter -- the base human policy. We could imagine e.g. taking a Mechanical Turk worker and using them as the base human policy, and we could argue that HCH would give good answers in this setting as long as the worker is well-motivated, since he is using "human-like" reasoning. However, there are other alternatives. For example, in theory we could formalize a "core" of reasoning. For concreteness, suppose we implement a lookup table for "simple" questions, and then use this lookup table. We might expect this to be safe because of theorems that we proved about the lookup table, or by looking at the process by which the development team created the lookup table. In between these two extremes, we could imagine that the AI researchers train the human overseers about how to corrigibly answer questions, and then the human policy is used in HCH. This seems distinctly more likely to be safe than the first case.

Rohin's opinion: I strongly agree with the general point that we can get significant safety by improving the human policy (AN #43), especially with HCH and iterated amplification, since they depend on having good human overseers, at least initially.

Reinforcement Learning in the Iterated Amplification Framework (William Saunders): This post and its comments clarify how we can use reinforcement learning for the distillation step in iterated amplification. The discussion is still happening so I don't want to summarize it yet.

### Learning human intent

Learning Preferences by Looking at the World (Rohin Shah and Dmitrii Krasheninnikov): Summarized in the highlights!

### Preventing bad behavior

Test Cases for Impact Regularisation Methods (Daniel Filan): This post collects various test cases that researchers have proposed for impact regularization methods. A summary of each one would be far too long for this newsletter, so you'll have to read the post itself.

Rohin's opinion: These test cases and the associated commentary suggest to me that we haven't yet settled on what properties we'd like our impact regularization methods to satisfy, since there are pairs of test cases that seem hard to solve simultaneously, as well as test cases where the desired behavior is unclear.

### Interpretability

Neural Networks seem to follow a puzzlingly simple strategy to classify images (Wieland Brendel and Matthias Bethge): This is a blog post explaining the paper Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet, which was summarized in AN #33.

### Robustness

AI Alignment Podcast: The Byzantine Generals’ Problem, Poisoning, and Distributed Machine Learning (Lucas Perry and El Mahdi El Mahmdi) (summarized by Richard): Byzantine resilience is the ability of a system to operate successfully when some of its components have been corrupted, even if it's unclear which ones they are. In the context of machine learning, this is relevant to poisoning attacks in which some training data is altered to affect the batch gradient (one example being the activity of fake accounts on social media sites). El Mahdi explains that when data is very high-dimensional, it is easy to push a neural network into a bad local minimum by altering only a small fraction of the data. He argues that his work on mitigating this is relevant to AI safety: even superintelligent AGI will be vulnerable to data poisoning due to time constraints on computation, and the fact that data poisoning is easier than resilient learning.

Trustworthy Deep Learning Course (Jacob Steinhardt, Dawn Song, Trevor Darrell) (summarized by Dan H): This underway course covers topics in AI Safety topics for current deep learning systems. The course includes slides and videos.

# AI strategy and policy

How Sure are we about this AI Stuff? (Ben Garfinkel) (summarized by Richard): Ben outlines four broad arguments for prioritising work on superintelligent AGI: that AI will have a big influence over the long-term future, and more specifically that it might cause instability, lock-in or large-scale "accidents". He notes the drawbacks of each line of argument. In particular, the "AI is a big deal" argument doesn't show that we have useful leverage over outcomes (compare a Victorian trying to improve the long-term effects of the industrial revolution). He claims that the next two arguments have simply not been researched thoroughly enough to draw any conclusions. And while the argument from accidents has been made by Bostrom and Yudkowsky, there hasn't been sufficient elaboration or criticism of it, especially in light of the recent rise of deep learning, which reframes many ideas in AI.

Richard's opinion: I find this talk to be eminently reasonable throughout. It highlights a concerning lack of public high-quality engagement with the fundamental ideas in AI safety over the last few years, relative to the growth of the field as a whole (although note that in the past few months this has been changing, with three excellent sequences released on the Alignment Forum, plus Drexler's technical report). This is something which motivates me to spend a fair amount of time writing about and discussing such ideas.

One nitpick: I dislike the use of "accidents" as an umbrella term for AIs behaving in harmful ways unintended by their creators, since it's misleading to describe deliberately adversarial behaviour as an "accident" (although note that this is not specific to Ben's talk, since the terminology has been in use at least since the Concrete problems paper).

# Other progress in AI

### Reinforcement learning

The Hanabi Challenge: A New Frontier for AI Research (Nolan Bard, Jakob Foerster et al) (summarized by Richard): The authors propose the cooperative, imperfect-information card game Hanabi as a target for AI research, due to the necessity of reasoning about the beliefs and intentions of other players in order to win. They identify two challenges: firstly, discovering a policy for a whole team that allows it to win (the self-play setting); and secondly, discovering an individual policy that allows an agent to play with an ad-hoc team without previous coordination. They note that successful self-play policies are often very brittle in the ad-hoc setting, which makes the latter the key problem. The authors provide an open-source framework, an evaluation benchmark and the results of existing RL techniques.

Richard's opinion: I endorse the goals of this paper, but my guess is that Hanabi is simple enough that agents can solve it using isolated heuristics rather than general reasoning about other agents' beliefs.

Rohin's opinion: I'm particularly excited to see more work on ad hoc teamwork, since it seems like very similar to the setting we are in, where we would like to deploy AI system among groups of humans and have things go well. See Following human norms (AN #42) for more details.

A Comparative Analysis of Expected and Distributional Reinforcement Learning (Clare Lyle et al) (summarized by Richard): Distributional RL systems learn distributions over the value of actions rather than just their expected values. In this paper, the authors investigate the reasons why this technique improves results, by training distribution learner agents and expectation learner agents on the same data. They provide evidence against a number of hypotheses: that distributional RL reduces variance; that distributional RL helps with policy iteration; and that distributional RL is more stable with function approximation. In fact, distributional methods have similar performance to expectation methods when using tabular representations or linear function approximators, but do better when using non-linear function approximators such as neural networks (especially in the earlier layers of networks).

Richard's opinion: I like this sort of research, and its findings are interesting (even if the authors don't arrive at any clear explanation for them). One concern: I may be missing something, but it seems like the coupled samples method they use doesn't allow investigation into whether distributional methods benefit from generating better data (e.g. via more effective exploration).

Recurrent Experience Replay in Distributed Reinforcement Learning (Steven Kapturowski et al): See Import AI.

Visual Hindsight Experience Replay (Himanshu Sahni et al)

A Geometric Perspective on Optimal Representations for Reinforcement Learning (Marc G. Bellemare et al)

The Value Function Polytope in Reinforcement Learning (Robert Dadashi et al)

### Deep learning

A Conservative Human Baseline Estimate for GLUE: People Still (Mostly) Beat Machines (Nikita Nangia et al) (summarized by Dan H): BERT tremendously improves performance on several NLP datasets, such that it has "taken over" NLP. GLUE represents performance of NLP models across a broad range of NLP datasets. Now GLUE has human performance measurements. According to the current GLUE leaderboard, the gap between human performance and models fine-tuned on GLUE datasets is a mere 4.7%. Hence many current NLP datasets are nearly "solved."

# News

Governance of AI Fellowship (Markus Anderljung): The Center for the Governance of AI is looking for a few fellows to work for around 3 months on AI governance research. They expect that fellows will be at the level of PhD students or postdocs, though there are no strict requirements. The first round application deadline is Feb 28, and the second round application deadline is Mar 28.

New Comment