[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

AI Alignment Research Overview (Jacob Steinhardt) (summarized by Dan H): It has been over three years since Concrete Problems in AI Safety. Since that time we have learned more about the structure of the safety problem. This document represents an updated taxonomy of problems relevant for AI alignment. Jacob Steinhardt decomposes the remaining technical work into “technical alignment (the overcoming of conceptual or engineering issues needed to create aligned AI), detecting failures (the development of tools for proactively assessing the safety/alignment of a system or approach), methodological understanding (best practices backed up by experience), and system-building (how to tie together the three preceding categories in the context of many engineers working on a large system).”

The first topic under “technical alignment” is “Out-of-Distribution Robustness,” which receives more emphasis than it did in Concrete Problems. Out-of-Distribution Robustness is in part motivated by the fact that transformative AI will lead to substantial changes to the real world, and we should like our systems to perform well even under these large and possibly rapid data shifts. Specific subproblems include some work on adversarial examples and out-of-distribution detection. Next, the problem of Reward Learning is described. For this, there are challenges including learning human values and ensuring those lossily represented human values can remain aligned under extreme optimization. While we have attained more conceptual clarity about reward learning since Concrete Problems, reward learning still remains largely “uncharted,” and it is still not clear “how approach the problem.” The next section on Scalable Reward Generation points out that, in the future, labeling meaning or providing human oversight will prove increasingly difficult. Next, he proposes that we ought to study how to make systems “act conservatively,” such as endowing systems with the ability to activate a conservative fallback routine when they are uncertain. The final topic under technical alignment is Counterfactual Reasoning. Here one possible direction is generating a family of simulated environments to generate counterfactuals.

The “technical alignment” section is the majority of this document. Later sections such as “Detecting Failures in Advance” highlight the importance of deep neural network visualization and recent model stress-test datasets. “Methodological Understanding” suggests that we are more likely to build aligned AI systems if we improve our best practices for building and evaluating models, and “System Building” speculates about how to do this for future multi-faceted ML systems.

Dan H's opinion: This is a welcome update to Concrete Problems since it is slightly more concrete, current, and discusses improving safety in both deep learning and RL rather than mostly RL. While the document mentions many problems, the set of problems retains precision and fortunately does not include every capabilities concern that may possibly one day impact safety. A takeaway is that value learning and model transparency still need groundwork, but fortunately other problems including out-of-distribution robustness are more concretized and mostly need time and continued effort.

Rohin's opinion: One thing I particularly like about this agenda is that the connection to AI alignment is significantly clearer than in Concrete Problems.

Technical AI alignment

Iterated amplification

Ought Progress Update October 2019 (Jungwon Byun and Andreas Stuhlmüller) (summarized by Rohin): While this update provides details about Ought as a whole, I will focus only on the research they've done. As a reminder, Ought's goal is to understand how we can delegate hard questions to machine and human experts. They initially focused on Factored Cognition (AN #36), where each question was recursively decomposed into subquestions that would help find the answer. They now call this "Factored Generation", and use "Factored Cognition" as the umbrella term for any approach that helps answer hard questions using some kind of decomposition.

While Ought has run some ML experiments aimed at automating decompositions, they have primarily worked on human experiments for Factored Evaluation (AN #62), where the decompositions are used to help evaluate answers. They are focusing on evaluating answers instead of finding answers for a few reasons. First, there may already exist experts whose answers we can use: this can be directly plugged into Factored Evaluation, but not Factored Generation. Second, evaluation is often easier than generation. Third, factored evaluation is more amenable to time-saving tricks like only focusing on a single branch of the tree (as in AI safety via debate (AN #5)). Finally, when training ML systems, if a model is unable to exactly imitate the experts, we need something that can distinguish between better and worse behavior, which is possible with Factored Evaluation but not Factored Generation. (See also Against Mimicry.)

The experiments involve a judge trying to decide which of two answers to a question is best, where one answer is provided by an honest expert and one by a malicious expert. (Subquestions may also get different answers from the two experts.) The goal is to find a setup where the judge reliably picks the honest answer, regardless of how powerful the malicious expert is: if this could be established, it would be strong evidence that the setup incentivizes honest answers. Of course, this requires us to quantify over all malicious strategies, but humans may not find the best malicious strategies. To mitigate this, the experiments provide the malicious expert with lots of advantages that the honest one doesn't have, to simulate very powerful malicious experts.

They have already learned from their experiments. Initially, they hoped participants would develop good meta-level strategies for determining the truth. (Meta-level here means that the strategies would generalize to other domains, e.g. a heuristic of always splitting an answer into separate claims and asking for the evidence for each claim separately.) They found that these strategies don't emerge organically, and so are planning to spend concentrated staff time on finding good strategies. They also found that malicious experts sometimes won due to avoidable mistakes, and are hoping to eliminate this by ensembling work from multiple people for increased robustness.

Rohin's opinion: This is distinct progress since the last update, though I think the experiments are still exploratory enough that it's hard to have any big takeaways. The difficulty of generating good strategies suggests that it's particularly important that we train our human overseers (as suggested in AI Safety Needs Social Scientists (AN #47)) to provide the right kind of feedback, for example if we would like them to reward only corrigible reasoning (AN #35). I'm particularly excited for the next update, where we could see experiments powerful enough to come to more solid conclusions.

Learning human intent

Norms, Rewards, and the Intentional Stance: Comparing Machine Learning Approaches to Ethical Training (Daniel Kasenberg et al) (summarized by Asya) (H/T Xuan Tan): This paper argues that norm inference is a plausible alternative to inverse reinforcement learning (IRL) for teaching a system what people want. Existing IRL algorithms rely on the Markov assumption: that the next state of the world depends only on the previous state of the world and the action that the agent takes from that state, rather than on the agent’s entire history. In cases where information about the past matters, IRL will either fail to infer the right reward function, or will be forced to make challenging guesses about what past information to encode in each state. By contrast, norm inference tries to infer what (potentially temporal) propositions encode the reward of the system, keeping around only past information that is relevant to evaluating potential propositions. The paper argues that norm inference results in more interpretable systems that generalize better than IRL -- systems that use norm inference can successfully model reward-driven agents, but systems that use IRL do poorly at learning temporal norms.

Asya's opinion: This paper presents an interesting novel alternative to inverse reinforcement learning and does a good job of acknowledging potential objections. Deciding whether and how to store information about the past seems like an important problem that inverse reinforcement learning has to reckon with. My main concern with norm inference, which the paper mentions, is that optimizing over all possible propositions is in practice extremely slow. I don't anticipate that norm inference will be a performance-tractable strategy unless a lot of computation power is available.

Rohin's opinion: The idea of "norms" used here is very different from what I usually imagine, as in e.g. Following human norms (AN #42). Usually, I think of norms as imposing a constraint upon policies rather than defining an optimal policy, (often) specifying what not to do rather than what to do, and being a property of groups of agents, rather than of a single agent. (See also this comment.) The "norms" in this paper don't satisfy any of these properties: I would describe their norm inference as performing IRL with history-dependent reward functions, with a strong inductive bias towards "logical" reward functions (which comes from their use of Linear Temporal Logic). Note that some inductive bias is necessary, as without inductive bias history-dependent reward functions are far too expressive, and nothing could be reasonably learned. I think despite how it's written, the paper should be taken not as a denouncement of IRL-the-paradigm, but a proposal for better IRL algorithms that are quite different from the ones we currently have.

Improving Deep Reinforcement Learning in Minecraft with Action Advice (Spencer Frazier et al) (summarized by Asya): This paper uses maze-traversal in Minecraft to look at the extent to which human advice can help with aliasing in 3D environments, the problem where many states share nearly identical visual features. The paper compares two advice-giving algorithms that rely on neural nets which are trained to explore and predict the utilities of possible actions they can take, sometimes accepting human advice. The two algorithms differ primarily in whether they provide advice for the current action, or provide advice that persists for several actions.

Experimental results suggest that both algorithms, but especially the one that applies to multiple actions, help with the problem of 3D aliasing, potentially because the system can rely on the movement advice it got in previous timesteps rather than having to discern tricky visual features in the moment. The paper also varies the frequency and accuracy of the advice given, and finds that receiving more advice significantly improves performance, even if that advice is only 50% accurate.

Asya's opinion: I like this paper, largely because learning from advice hasn't been applied much to 3D worlds, and this is a compelling proof of concept. I think it's also a noteworthy though expected result that advice that sticks temporally helps a lot when the ground truth visual evidence is difficult to interpret.

Forecasting

Two explanations for variation in human abilities (Matthew Barnett) (summarized by Flo): How quickly might AI exceed human capabilities? One piece of evidence is the variation of intelligence within humans: if there isn’t much variation, we might expect AI not to stay at human level intelligence for long. It has been argued that variation in human cognitive abilities is small compared to such variation for arbitrary agents. However, the variation of human ability in games like chess seems to be quite pronounced, and it took chess computers more than forty years to transition from beginner level to beating the best humans. The blog post presents two arguments to reconcile these perspectives:

First, similar minds could have large variation in learning ability: If we break a random part of a complex machine, it might perform worse or stop working altogether, even if the broken machine is very similar to the unbroken one. Variation in human learning ability might be mostly explainable by lots of small "broken parts" like harmful mutations.

Second, small variation in learning ability can be consistent with large variation in competence, if the latter is explained by variation in another factor like practice time. For example, a chess match is not very useful to determine who's smarter, if one of the players has played a lot more games than the other. This perspective also reframes AlphaGo's superhumanity: the version that beat Lee Sedol had played around 2000 times as many games as him.

Flo's opinion: I liked this post and am glad it highlighted the distinction between learning ability and competence that seems to often be ignored in debates about AI progress. I would be excited to see some further exploration of the "broken parts" model and its implication about differing variances in cognitive abilities between humans and arbitrary intelligences.

Miscellaneous (Alignment)

Chris Olah’s views on AGI safety (Evan Hubinger) (summarized by Matthew): This post is Evan's best attempt to summarize Chris Olah's views on how transparency is a vital component for building safe artificial intelligence, which he distinguishes into four separate approaches:

First, we can apply interpretability to audit our neural networks, or in other words, catch problematic reasoning in our models. Second, transparency can help safety by allowing researchers to deliberately structure their models in ways that systematically work, rather than using machine learning as a black box. Third, understanding transparency allows us to directly incentivize for transparency in model design and decisions -- similar to how we grade humans on their reasoning (not just the correct answer) by having them show their work. Fourth, transparency might allow us to reorient the field of AI towards microscope AI: AI that gives us new ways of understanding the world, enabling us to be more capable, without itself taking autonomous actions.

Chris expects that his main disagreement with others is whether good transparency is possible as models become more complex. He hypothesizes that as models become more advanced, they will counterintuitively become more interpretable, as they will begin using more crisp human-relatable abstractions. Finally, Chris recognizes that his view implies that we might have to re-align the ML community, but he remains optimistic because he believes there's a lot of low-hanging fruit, research into interpretability allows low-budget labs to remain competitive, and interpretability is aligned with the scientific virtue to understand our tools.

Matthew's opinion: Developing transparency tools is currently my best guess for how we can avoid deception and catastrophic planning in our AI systems. I'm most excited about applying transparency techniques via the first and third routes, which primarily help us audit our models. I'm more pessimistic about the fourth approach because it predictably involves restructuring the incentives for machine learning as a field, which is quite difficult. My opinion might be different if we could somehow coordinate the development of these technologies.

Misconceptions about continuous takeoff (Matthew Barnett) (summarized by Flo): This post attempts to clarify the author's notion of continuous AI takeoff, defined as the growth of future AI capabilities being in line with extrapolation from current trends. In particular, that means that no AI project is going to bring sudden large gains in capabilities compared to its predecessors.

Such a continuous takeoff does not necessarily have to be slow. For example, generative adversarial networks have become better quite rapidly during the last five years, but progress has still been piecemeal. Furthermore, exponential gains, for example due to recursive self-improvement, can be consistent with a continuous takeoff, as long as the gains from one iteration of the improvement process are modest. However, this means that a continuous takeoff does not preclude large power differentials from arising: slight advantages can compound over time and actors might use their lead in AI development to their strategic advantage even absent discontinuous progress, much like western Europe used its technological advantage to conquer most of the world.

Knowing whether or not AI takeoff happens continuously is important for alignment research: A continuous takeoff would allow for more of an attitude of "dealing with things as they come up" and we should shift our focus on specific aspects that are hard to deal with as they come up. If the takeoff is not continuous, an agent might rapidly gain capabilities relative to the rest of civilization and it becomes important to rule out problems, long before they come up.

Flo's opinion: I believe that it is quite important to be aware of the implications that different forms of takeoff should have on our prioritization and am glad that the article highlights this. However, I am a bit worried that this very broad definition of continuous progress limits the usefulness of the concept. For example, it seems plausible that a recursively self-improving agent which is very hard to deal with once deployed still improves its capabilities slow enough to fit the definition, especially if its developer has a significant lead over others.

AI strategy and policy

Special Report: AI Policy and China – Realities of State-Led Development

Other progress in AI

Reinforcement learning

Let's Discuss OpenAI's Rubik's Cube Result (Alex Irpan) (summarized by Rohin): This post makes many points about OpenAI's Rubik's cube result (AN #70), but I'm only going to focus on two. First, the result is a major success for OpenAI's focus on design decisions that encourage long-term research success. In particular, it relied heavily on the engineering-heavy model surgery and policy distillation capabilities that allow them to modify e.g. the architecture in the middle of a training run (which we've seen with OpenAI Five (AN #19)). Second, the domain randomization doesn't help as much as you might think: OpenAI needed to put a significant amount of effort into improving the simulation to get these results, tripling the number of successes on a face rotation task. Intuitively, we still need to put in a lot of effort to getting the simulation to be "near" reality, and then domain randomization can take care of the last little bit needed to robustly transfer to reality. Given that domain randomization isn't doing that much, it's not clear if the paradigm of zero-shot sim-to-real transfer is the right one to pursue. To quote the post's conclusion: I see two endgames here. In one, robot learning reduces to building rich simulators that are well-instrumented for randomization, then using ludicrous amounts of compute across those simulators. In the other, randomization is never good enough to be more than a bootstrapping step before real robot data, no matter what the compute situation looks like. Both seem plausible to me, and we’ll see how things shake out.

Rohin's opinion: As usual, Alex's analysis is spot on, and I have nothing to add beyond strong agreement.

16