AI Alignment Research Overview (by Jacob Steinhardt)

I'm really excited to see someone outline all the work they think needs solving in AI alignment - to describe what the problem looks like, what a solution looks like, and what work has been done so far. Especially from Jacob, who is a coauthor of the Concrete Problems in AI Safety paper.

Below, I've included some excerpts from doc. I've included the introduction, the following section describing the categories of technical work, and some high-level information from the long sections on 'technical alignment problem' and the 'detecting failures in advance'.

Introduction

This document gives an overview of different areas of technical work that seem necessary, or at least desirable, for creating safe and aligned AI systems. The focus is on safety and alignment of powerful AI systems, i.e. systems that may exceed human capabilities in a broad variety of domains, and which likely act on a large scale. Correspondingly, there is an emphasis on approaches that seem scalable to such systems.

By “aligned”, I mean that the actions it pursues move the world towards states that humans want, and away from states that humans don’t want. Some issues with this definition are that different humans might have different preferences (I will mostly ignore this issue), and that there are differences between stated preferences, “revealed” preferences as implied by actions, and preferences that one endorses upon reflection (I won’t ignore this issue).

I think it is quite plausible that some topics are missing, and I welcome comments to that regard. My goal is to outline a critical mass of topics in enough detail that someone with knowledge of ML and some limited familiarity with AI alignment as an area would have a collection of promising research directions, a mechanistic understanding of why they are promising, and some pointers for what work on them might look like.

To that end, below I outline four broad categories of technical work: technical alignment (the overcoming of conceptual or engineering issues needed to create aligned AI), detecting failures (the development of tools for proactively assessing the safety/alignment of a system or approach), methodological understanding (best practices backed up by experience), and system-building (how to tie together the three preceding categories in the context of many engineers working on a large system). These are described in more detail in the next section.

In each section I give examples of problems we might want to solve. I imagine these in the context of future powerful AI systems, which means that most of the concrete scenarios are speculative, vague, and likely incorrect if interpreted as a prediction about the future. If I were to give the strongest justification for the research topics below, I would instead focus on near-future and existing systems, which already exhibit many of the issues I discuss. Nevertheless, I think this imaginative exercise can be helpful both for stimulating research and for keeping the focus on scalable solutions.

Caveats. I found it difficult to write a research overview of a field as nascent as AI alignment, as anything I could write sounded either too authoritative relative to my confidence, or so full of caveats and qualifications as to be unreadable. I settled for eliding many of the qualifications and providing this single caveat up front: that this document reflects an imperfect snapshot of my current thinking, that it expresses many ideas more sloppily than I would usually feel comfortable putting into writing, and that I hope readers will forgive this sloppiness in the service of saying something about a topic that I feel is important.

This document is not meant to be a description of my personal interests, but rather of potentially promising topics within a field I care about. My own interests are neither a subset nor superset of the topics in this document, although there is high overlap. Even confined to AI alignment, this document is out-of-date and omits some of my recent thinking on economic aspects of ML.

Finally, I make a number of claims below about what research directions I think are promising or un-promising. Some of these claims are likely wrong, and I could even imagine changing my mind after 1 hour of conversation with the right person. I decided that this document would be more informative and readable if I gave my unfiltered take (rather than only opinions I thought I would likely defend upon consideration), but the flip side is that if you think I’m wrong about something, you should let me know!

Categories of technical work

In this document, I will discuss four broad categories of technical work:

Technical alignment problem. Research on the “technical alignment problem” either addresses conceptual obstacles to making AI aligned with humans (e.g. robustness, reward mis-specification), or creates tools and frameworks that aid in making AI aligned (e.g. scalable reward generation).

Detecting failures in advance. Independently of having solved various alignment problems, we would like to have ways of probing systems / blueprints of systems to know whether they are likely to be safe. Example topics include interpretability, red-teaming, or accumulating checklists of failure modes to watch out for.

Methodological understanding. There is relatively little agreement or first-hand knowledge of how to make systems aligned or safe, and even less about which methods for doing so will scale to very powerful AI systems. I am personally skeptical of our ability to get alignment right based on purely abstract arguments without also having a lot of methodological experience, which is why I think work in this category is important. An example of a methodology-focused document is Martin Zinkevich’s Rules of Reliable ML, which addresses reliability of existing large systems.

System-building. It is possible that building powerful AI will involve a large engineering effort (say, 100+ engineers, 300k+ lines of code). In this case we need a framework for putting many components together in a safe way.

Technical alignment problem

We would ideally like to build AI that acts according to some specification of human values, and that is robust both to errors in the specification and to events in the world. To achieve this robustness, the system likely needs to represent uncertainty about both its understanding of human values and its beliefs about the world, and to act appropriately in the face of this uncertainty to avoid any catastrophic events.

I split the technical alignment problem correspondingly into four sub-categories:

Scalable reward generation. Powerful AI systems will potentially have to make decisions in situations that are foreign to humans or otherwise difficult to evaluate---for instance, on scales far outside human experience, or involving subtle but important downstream consequences. Since modern ML systems are primarily trained through human-labeled training data (or more generally, human-generated reward functions), this presents an obstacle to specifying which decisions are good in these situations. Scalable reward generation seeks to build processes for generating a good reward function.

Reward learning. Many autonomous agents seek to maximize the expected value of some reward function (or more broadly, to move towards some specified goal state / set of states). Optimizing against the reward function in this way can cause even slight errors in the reward to lead to large errors in behavior--typically, increased reward will be well-correlated with human-desirability for a while, but will become anti-correlated after a point. Reward learning seeks to reason about differences between the observed (proxy) reward and the true reward, and to converge to the true reward over time.

Out-of-distribution robustness is the problem of getting systems to behave well on inputs that are very different from their training data. This might be done by a combination of transfer learning (so the system works well in a broader variety of situations) and having more uncertainty in the face of unfamiliar/atypical inputs (so the system can at least notice where it is likely to not do well).

Acting conservatively. Safe outcomes are more likely if systems can notice situations where it is unclear how to act, and either avoid encountering them, take actions that reduce the uncertainty, or take actions that are robustly good. This would, for instance, allow us to specify an ambiguous reward function that the system could clarify as needed, rather than having to think about every possible case up-front.

Acting conservatively interfaces with reward learning and out-of-distribution robustness, as the latter two focus on noticing uncertainty while the former focuses on what to do given the uncertainty. Unfortunately, current methods for constructing uncertainty estimates seem inadequate to drive such decisions, and even given a good uncertainty estimate little work has been done on how the system should use it to shape its actions.

A toy framework. Conceptually, it may be useful to think in terms of the standard rational agent model, where an agent has a value function or utility function $V$ , and beliefs $P$ , and then takes actions $A$ that maximize the expected value of $V$ under $P$ (conditioned on the action $A$ ). Failures of alignment could come from incorrect beliefs $P$ , or a value function $V$ that does not lead to what humans want. Out-of-distribution robustness seeks to avoid or notice problems with $P$ , while scalable reward generation seeks to produce accurate information about some value function $V$ that is aligned with humans. Reward learning seeks to correct for inaccuracies in the reward generation process, as well as the likely limited amount of total data about rewards. Finally, acting conservatively takes into account the additional uncertainty due to acting out-of-distribution and having a learned reward function, and seeks to choose actions in a correspondingly conservative manner.

In an RL setting where we take actions via a learned policy, we can tell the same story but with a slightly modified diagram. Instead of an action $A$ we have a learned policy $θ$ , and instead of $P *$ and $~ P$ denoting beliefs, they denote distributions over environments ( $P *$ is the true on-policy environment at deployment time, while $~ P$ is the distribution of training environments).

Other topics. Beyond the topics above, the problem of counterfactual reasoning cuts across multiple categories and seems worth studying on its own. There may be other important categories of technical work as well.

Detecting failures in advance

The previous section lays out a list of obstacles to AI alignment and technical directions for working on them. This list may not be exhaustive, so we should also develop tools for discovering new potential alignment issues. Even for the existing issues, we would like ways of being more confident that we have solved them and what sub-problems remain.

While machine learning often prefers to hew close to empirical data, much of the roadmap for AI alignment has instead followed from more abstract considerations and thought experiments, such as asking “What would happen if this reward function were optimized as far as possible? Would the outcome be good?” I actually think that ML undervalues this abstract approach and expect it to continue to be fruitful, both for pointing to useful high-level research questions and for analyzing concrete systems and approaches.

At the same time, I am uncomfortable relying solely on abstract arguments for detecting potential failures. Rigorous empirical testing can make us more confident that a problem is actually solved and expose issues we might have missed. Finding concrete instantiations of a problem can both more fruitfully direct work and convince a larger set of people to care about it (as in the case of adversarial examples for images). More broadly, empirical investigations have the potential to reveal new issues that were missed under purely abstract considerations.

Two more empirically-focused ways of detecting failures are model probing/visualization and red-teaming, discussed below. Also valuable is examining trends in ML. For instance, it looks to me like reward hacking in real deployed systems is becoming a bigger issue over time; this provides concrete instances of the problem to examine for insight, gives us a way to measure how well we’re doing at the problem, and helps rally a community around the problem. Examining trends is also a good way to take an abstract consideration and make it more concrete.

AI ALIGNMENT FORUM
AF