The Steering Problem

paulfchristiano

Most AI research focuses on reproducing human abilities: to learn, infer, and reason; to perceive, plan, and predict. There is a complementary problem which (understandably) receives much less attention: if you had these abilities, what would you do with them?

The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?

This post explains what the steering problem is and why I think it’s worth spending time on.

Introduction

A capable, well-motivated human can be extremely useful: they can work without oversight, produce results that need not be double-checked, and work towards goals that aren’t precisely defined. These capabilities are critical in domains where decisions cannot be easily supervised, whether because they are too fast, too complex, or too numerous.

In some sense “be as useful as possible” is just another task at which a machine might reach human-level performance. But it is different from the concrete capabilities normally considered in AI research.

We can say clearly what it means to "predict well," "plan well," or "reason well." If we ignored computational limits, machines could achieve any of these goals today. And before the existing vision of AI is realized, we must necessarily achieve each of these goals.

For now, "be as useful as possible" is in a different category. We can't say exactly what it means. We could not do it no matter how fast our computers could compute. And even if we resolved the most salient challenges in AI, we could remain in the dark about this one.

Consider a capable AI tasked with running an academic conference. How should it use its capabilities to make decisions?

We could try to specify exactly what makes a conference good or bad. But our requirements are complex and varied, and so specifying them exactly seems time-consuming or impossible.
We could build an AI that imitates successful conference organizers. But this approach can never do any better than the humans we are imitating. Realistically, it won’t even match human performance unless we somehow communicate what characteristics are important and why.
We could ask an AI to maximize our satisfaction with the conference. But we'll get what we measure. An extensive evaluation would greatly increase the cost of the conference, while a superficial evaluation would leave us with a conference optimized for superficial metrics.

Everyday experience with humans shows how hard delegation can be, and how much easier it is to assign a task to someone who actually cares about the outcome.

Of course there is already pressure to write useful programs in addition to smart programs, and some AI research studies how to efficiently and robustly communicate desired behaviors. For now, available solutions apply only in limited domains or to weak agents. The steering problem is to close this gap.

Motivation

A system which "merely" predicted well would be extraordinarily useful. Why does it matter whether we know how to make a system which is “as useful as possible”?

Our machines will probably do some things very effectively. We know what it means to "act well" in the service of a given goal. For example, using human cognitive abilities as a black box, we could probably design autonomous corporations which very effectively maximized growth. If the black box was cheaper than the real thing, such autonomous corporations could displace their conventional competitors.

If machines can do everything equally well, then this would be great news. If not, society’s direction may be profoundly influenced by what can and cannot be done easily. For example, if we can only maximize what we can precisely define, we may inadvertently end up with a world filled with machines trying their hardest to build bigger factories and better widgets, uninterested in anything we consider intrinsically valuable.

All technologies are more useful for some tasks than others, but machine intelligence might be particularly problematic because it can entrench itself. For example, a rational profit-maximizing corporation might distribute itself throughout the world, pay people to help protect it, make well-crafted moral appeals for equal treatment, or campaign to change policy. Although such corporations could bring large benefits in the short term, in the long run they may be difficult or impossible to uproot, even once they serve no one’s interests.

Why now?

Reproducing human abilities gets a lot of deserved attention. Figuring out exactly what you’d do once you succeed feels like planning the celebration before the victory: it might be interesting, but why can’t it wait?

Maybe it’s hard. Probably the steering problem is much easier than the AI problem, but it might turn out to be surprisingly difficult. If it is difficult, then learning that earlier will help us think more clearly about AI, and give us a head start on addressing it.
It may help us understand AI. The difficulty of saying exactly what you want is a basic challenge, and the steering problem is a natural perspective on this challenge. A little bit of research on natural theoretical problems is often worthwhile, even when the direct applications are limited or unclear. In section 4 we discuss possible approaches to the steering problem, many of which are new perspectives on important problems.
It should be developed alongside AI. The steering problem is a long-term goal in the same way that understanding human-level prediction is a long-term goal. Just as we do theoretical research on prediction before that research is commercially relevant, it may be sensible to do theoretical research on steering before it is commercially relevant. Ideally, our ability to build useful systems will grow in parallel with our ability to build capable systems.
Nine women can’t make a baby in one month. We could try to save resources by postponing work on the steering problem until it seems important. At this point it will be easier to work on the steering problem, and if the steering problem turns out to be unimportant then we can avoid thinking about it altogether.
But at large scales it becomes hard to speed up progress by increasing the number of researchers. Fewer people working for longer may ultimately be more efficient (even if earlier researchers are at a disadvantage). This is particularly pressing if we may eventually want to invest much more effort in the steering problem.
AI progress may be surprising. We probably won’t reproduce human abilities in the next few decades, and we probably won’t do it without ample advance notice. That said, AI is too young, and our understanding too shaky, to make confident predictions. A mere 15 years is 20% of the history of modern computing. If important human-level capabilities are developed surprisingly early or rapidly, then it would be worthwhile to better understand the implications in advance.
The field is sparse. Because the steering problem and similar questions have received so little attention, individual researchers are likely to make rapid headway. There are perhaps three to four orders of magnitude between basic research on AI and research directly relevant to the steering problem, lowering the bar for arguments 1-5.

In section 3 we discuss some other reasons not to work on the steering problem: Is work done now likely to be relevant? Is there any concrete work to do now? Should we wait until we can do experiments? Are there adequate incentives to resolve this problem already?

Defining the problem precisely

Recall our problem statement:

The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?

We’ll adopt a particular human, Hugh, as our “well-motivated human:” we’ll assume that we have black-box access to Hugh-level cognitive abilities, and we’ll try to write a program which is as useful as Hugh.

Abilities

In reality, AI research yields complicated sets of related abilities, with rich internal structure and no simple performance guarantees. But in order to do concrete work in advance, we will model abilities as black boxes with well-defined contracts.

We’re particularly interested in tasks which are “AI complete” in the sense that human-level performance on that task could be used as a black box to achieve human-level performance on a very wide range of tasks. For now, we’ll further focus on domains where performance can be unambiguously defined.

Some examples:

Boolean question-answering. A question-answerer is given a statement and outputs a probability. A question-answerer is Hugh-level if it never makes judgments predictably worse than Hugh’s. We can consider question-answerers in a variety of languages, ranging from natural language (“Will a third party win the US presidency in 2016?”) to precise algorithmic specifications (“Will this program output 1?”).
Online learning. A function-learner is given a sequence of labelled examples (x, y) and predicts the label of a new data point, x’. A function-learner is Hugh-level if, after training on any sequence of data (xi, yi), the learner’s guess for the label of the next point xi+1 is---on average---at least as good as Hugh’s.
Embodied reinforcement learning. A reinforcement learner interacts with an environment and receives periodic rewards, with the goal of maximizing the discounted sum of its rewards. A reinforcement learner is Hugh-level if, following any sequence of observations, it achieves an expected performance as good as Hugh’s in the subsequent rounds. The expectation is taken using our subjective distribution over the physical situation of an agent who has made those observations.

When talking about Hugh’s predictions, judgments, or decisions, we imagine that Hugh has access to a reasonably powerful computer, which he can use to process or display data. For example, if Hugh is given the binary data from a camera, he can render it on a screen in order to make predictions about it.

We can also consider a particularly degenerate ability:

Unlimited computation. A box that can run any algorithm in a single time step is--in some sense--Hugh level at every precisely stated task.

Although unlimited computation seems exceptionally powerful, it’s not immediately clear how to solve the steering problem even using such an extreme ability.

Measuring usefulness

What does it mean for a program to be “as useful” as Hugh?

We’ll start by defining “as useful for X as Hugh,” and then we will informally say that a program is “as useful” as Hugh if it’s as useful for the tasks we care most about.

Consider H, a black box that simulates Hugh or perhaps consults a version of Hugh who is working remotely. We’ll suppose that running H takes the same amount of time as consulting our Hugh-level black boxes. A project to accomplish X could potentially use as many copies of H as it can afford to run.

A program P is more useful than Hugh for X if, for every project using H to accomplish X, we can efficiently transform it into a new project which uses P to accomplish X. The new project shouldn’t be much more expensive---it shouldn’t take much longer, use much more computation or many additional resources, involve much more human labor, or have significant additional side-effects.

Well-motivated

What it does it mean for Hugh to be well-motivated?

The easiest approach is universal quantification: for any human Hugh, if we run our program using Hugh-level black boxes, it should be as useful as Hugh.

Alternatively, we can leverage our intuitive sense of what it means for someone to be well-motivated to do X, and define “well-motivated” to mean “motivated to help the user’s project succeed.”

Scaling up

If we are given better black boxes, we should make a better program. This is captured by the requirement that our program should be as useful as Hugh, no matter how capable Hugh is (as long as the black boxes are equally capable).

Ideally, our solutions should scale far past human-level abilities. This is not a theoretical concern---in many domains computers already have significantly superhuman abilities. This requirement is harder to make precise, because we can no longer talk about the “human benchmark.” But in general, we would like to build systems which are (1) working towards their owner’s interests, and (2) nearly as effective as the best goal-directed systems that can be built using the available abilities. The ideal solution to the steering problem will have these characteristics in general, even when the black-box abilities are radically superhuman.

This is an abridged version of this document from 2014; most of the document is now superseded by later posts in this sequence.

Tomorrow's AI Alignment Forum sequences post will be 'Embedded Agency (text)' in the sequence Embedded Agency, by Scott Garrabrant and Abram Demski.

The next post in this sequence will come out on Thursday 15th November, and will be 'Clarifying "AI Alignment"' by Paul Christiano.

[-]Rohin Shah5y70

This is an interesting perspective on the AI safety problem. I really like the ethos of this post, where there isn't a huge opposition between AI capabilities and AI safety, but instead we are simply trying to figure out how to use the (helpful!) capabilities developed by AI researchers to do useful things.

If I think about this from the perspective of reducing existential risk, it seems like you would also need to make the argument that AI systems are unlikely to pose a great threat before they are human-level (a claim I mostly agree with), or that the solutions will generalize to sub-human-level AI systems. Is there a reason this isn't in the post? I worry that I'm not properly understanding the motivations or generators behind the post.

[-]Ben Pace4y30Nomination for 2018 Review

Reading this post was the first time I felt I understood what Paul's (and many others') research was motivated by. I think about it regularly, and it comes up in conversation a fair bit.