Preface to the sequence on value learning

Rohin Shah

This is a meta-post about the upcoming sequence on Value Learning that will start to be published this Thursday. This preface will also be revised significantly once the second half of the sequence is fully written.

Purpose of the sequence

The first part of this sequence will be about the tractability of ambitious value learning, which is the idea of inferring a utility function for an AI system to optimize based on observing human behavior. After a short break, we will (hopefully) continue with the second part, which will be about why we might want to think about techniques that infer human preferences, even if we assume we won’t do ambitious value learning with such techniques.

The aim of this part of the sequence is to gather the current best public writings on the topic, and provide a unifying narrative that ties them into a cohesive whole. This makes the key ideas more discoverable and discussable, and provides a quick reference for existing researchers. It is meant to teach the ideas surrounding one specific approach to aligning advanced AI systems.

We’ll explore the specification problem, in which we would like to define the behavior we want to see from an AI system. Ambitious value learning is one potential avenue of attack on the specification problem, that assumes a particular model of an AI system (maximizing expected utility) and a particular source of data (human behavior). We will then delve into conceptual work on ambitious value learning that has revealed obstructions to this approach. There will be pointers to current research that aims to circumvent these obstructions.

The second part of this sequence is currently being assembled, and this preface will be updated with details once it is ready.

The first half of this sequence takes you near the cutting edge of conceptual work on the ambitious value learning problem, with some pointers to work being done at this frontier. Based on the arguments in the sequence, I am confident that the obvious formulation of ambitious value learning has major, potentially insurmountable conceptual hurdles given the ways that AI systems work currently, but it may be possible to pose a different formulation that does not suffer from these issues, or to add hardcoded assumptions to the AI system to avoid impossibility results. If you try to disprove the arguments in the posts, or to create formalisms that sidestep the issues brought up, you may very well generate a new interesting direction of work that has not been considered before.

There is also a community of researchers working on inverse reinforcement learning without focusing on its application to ambitious value learning; this is out of the scope of the first half of this sequence, even though such work may still be relevant to long term safety.

Requirements for the sequence

Understanding these posts will require at least a passing familiarity with the basic principles of machine learning (not deep learning), such as “the parameters of a model are chosen to maximize the log probability that the model assigns to the observed dataset”. No other knowledge about value learning is required. If you do not have this background, I am not sure how easy it will be to grasp the points made; many of the points feel intuitive to me even without an ML background, but this could be because I no longer remember what it was like to not have ML intuitions.

There are many different subcultures interested in AI safety, and the posts I have chosen to include involve linguistic choices and assumptions from different places. I have tried to make this sequence understandable to all people who are interested and who understand the basic principles of ML, and so if something seems odd/confusing, please do let me know, either in the comments or via the PM system.

Learning from this sequence

When collating this sequence, I tried to pick content that makes the most important points simply and concisely. I recommend reading through each post carefully, taking the time to understand each paragraph. The posts range from informal arguments to formal theorems, but even for the formal theorems the formalization of the problem could be changed to invalidate the theorem. Learn from this however you best learn; my preferred method is to try and disprove the argument in the post until I feel like I understand what the post actually conveys.

While this sequence as it stands has no exercises, what it does have is a surrounding forum and community. Here are a few actions you can take to aid both your and others’ understanding of the core concepts:

Leave a comment with a concise summary of what you understand to be the post/paper’s main point
Leave a comment outlining a confusion you have with paper/post
Respond to someone else’s comment to help them understand it better

While I can’t commit to responding to the majority of the comments, I am also excited to help readers understand the content, and please let me know if something I write is confusing.

Each post has a note at the top saying what the post covers and who should read it. You can read through these notes and decide whether they are important for you. That said, the posts are written and organized assuming that you have read prior posts in the sequence, and many points will not make sense if read out of order.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Preface to the sequence on value learning

21

Purpose of the sequence

Requirements for the sequence

Learning from this sequence