I think of ambitious value learning as a proposed solution to the specification problem, which I define as the problem of *defining* the behavior that we would want to see from our AI system. I italicize “defining” to emphasize that this is *not* the problem of actually *computing* behavior that we want to see -- that’s the full AI safety problem. Here we are allowed to use hopelessly impractical schemes, as long as the resulting definition would allow us to *in theory* compute the behavior that an AI system would take, perhaps with assumptions like infinite computing power or arbitrarily many queries to a human. (Although we do prefer specifications that seem like they could admit an efficient implementation.) In terms of DeepMind’s classification, we are looking for a design specification that exactly matches the ideal specification. HCH and indirect normativity are examples of attempts at such specifications.

We will consider a model in which our AI system is maximizing the expected utility of some *explicitly* represented utility function that can depend on history. (It does not matter materially whether we consider utility functions or reward functions, as long as they can depend on history.) The utility function may be learned from data, or designed by hand, but it must be an explicit part of the AI that is then maximized.

I will not justify this model for now, but simply assume it by fiat and see where it takes us. I’ll note briefly that this model is often justified by the VNM utility theorem and AIXI, and as the natural idealization of reinforcement learning, which aims to maximize the expected sum of rewards, although typically rewards in RL depend only on states.

A lot of conceptual arguments, as well as experiences with specification gaming, suggest that we are unlikely to be able to simply think hard and write down a good specification, since even small errors in specifications can lead to bad results. However, machine learning is particularly good at narrowing down on the correct hypothesis among a vast space of possibilities using data, so perhaps we could determine a good specification from some suitably chosen source of data? This leads to the idea of ambitious value learning, where we *learn* an explicit utility function from human behavior for the AI to maximize.

This is very related to inverse reinforcement learning (IRL) in the machine learning literature, though not all work on IRL is relevant to ambitious value learning. For example, much work on IRL is aimed at *imitation learning*, which would in the best case allow you to match human performance, but not to exceed it. Ambitious value learning is, well, more ambitious -- it aims to learn a utility function that captures “what humans care about”, so that an AI system that optimizes this utility function more capably can *exceed* human performance, making the world better for humans than they could have done themselves.

It may sound like we would have solved the entire AI safety problem if we could do ambitious value learning -- surely if we have a good utility function we would be done. Why then do I think of it as a solution to just the specification problem? This is because ambitious value learning by itself would not be enough for safety, except under the assumption of as much compute and data as desired. These are really powerful assumptions -- for example, I'm assuming you can get data where you put a human in an arbitrarily complicated simulated environment with fake memories of their life so far and see what they do. This allows us to ignore many things that would likely be a problem in practice, such as:

- Attempting to use the utility function to choose actions before it has converged
- Distributional shift causing the learned utility function to become invalid
- Local minima preventing us from learning a good utility function, or from optimizing the learned utility function correctly

The next few posts in this sequence will consider the suitability of ambitious value learning as a solution to the specification problem. Most of them will consider whether ambitious value learning is possible in the setting above (infinite compute and data). One post will consider practical issues with the application of IRL to infer a utility function suitable for ambitious value learning, while still assuming that the resulting utility function can be perfectly maximized (which is equivalent to assuming infinite compute and a perfect model of the environment *after* IRL has run).