At the recent EAGx Oxford meetup, I ended up talking with a lot of people (18 people, back to back, on Sunday - for some reason, that day is a bit of a blur). Naturally, many of the conversations turned to value extrapolation/concept extrapolation, the main current focus of our Aligned AI startup. I explained the idea I explained multiple times and in multiple different ways. Different presentations were useful for people from different backgrounds.

So I've collected the different presentations in this post. Hopefully this will allow people to find the explanation that provides the greatest clarity for them. I think many will also find it interesting to read some of the other presentations: from our perspective, these are just different facets of the same phenomenon[1].

For those worried about AI existential risk

An superintelligence trained on videos of happy humans may well tile the universe with videos of happy humans - that is a standard alignment failure mode. But "make humans happy" is also a reward function compatible with the data.

So let be the training data of videos of happy humans, the correct "make humans happy" reward function, and the degenerate reward function "make videos of happy humans"[2].

We'd want the AI to deduce from . But even just generating as a candidate is a good success. The AI could then get feedback as to whether or is correct, or maximise a conservative mix of and (e.g. ). Maximising that conservative mix will result in a lot of videos of happy humans - but also a lot of happy humans.

For philosophers

Can you define what a human being is? Could you make a definition that works, in all circumstances and in all universe, no matter how bizarre or alien the world becomes?

A full definition has eluded philosophers ever since humans were categorised as "featherless bipeds with broad flat nails".

Concept extrapolation has another way of generating this definition. We would point at all living humans in the world and say "these are humans[3]."

Then we would instruct the AI: "please extrapolate the concept of 'human' from this data". As long as the AI is capable of doing that extrapolation better than we could ourselves, this would give us an extrapolation of the concept "human" to new circumstances without needing to write out a full definition.

For ML engineers into image classification

Paper Diversify and Disambiguate discusses a cow-grass-camel-sand example which is quite similar to the husky-wolf example of this post.

Suppose that we have two labelled sets, consisting of cows on grass, and consisting of camels on sand.

We'd like to train two classifiers that distinguish from , but use different features to do so. Ideally, the first classifier would end up distinguishing cows from camels, while the second distinguishes grass from sand. Of course, we'd want them to do so independently, without needing humans labelling cows, grass, camels, or sand.

For ML engineers focusing on current practical problems

An AI classifier was trained on xray images to detect pneumothorax (collapse lungs). It was quite successful - until further analysis revealed that it was acting as a chest drain detector. The chest drain is a treatment for pneumothorax, making that classification useless.

We would want the classifier to generate "collapsed lung detector" and "chest drain detector" as separate classification, and then ask its programmers which one it should be classifying on.

For RL engineers

CoinRun is a procedurally generated set of environments, a simplified Mario-style platform game. The reward is given by reaching the coin on the right:

Since the coin is always at the right of the level, there are two equally valid simple explanations of the reward: the agent must reach the coin, or the agent must reach the right side of the level.

When agents trained on CoinRun are tested on environments that move the coin to another location, they tend to ignore the coin and go straight to the right side of the level. Note that the agent is following a policy, rather than generating a reward; still, the policy it follows is one that implicitly follows the "reach the right" reward rather than the "reach the coin" one.

We need an alternative architecture that generates both of these rewards[4] and is then capable of either choosing between them or becoming conservative between them (so that it would, eg, go to the right while picking up the coin along the way). This needs to be done in a generalisable way.

For investors

A major retail chain wants to train their CCTV cameras to automatically detect shoplifters. They train it on examples they have in their databases.

The problem is that those examples are correlated with other variables. They may end up training a racial classifier, or they may end up training an algorithm that identifies certain styles of clothes.

That is disastrous, firstly for the potential PR problems, but secondly because the classifier won't successfully identify shoplifters.

The ideal is if the AI implicitly generates "shoplifters", "racial groups", and "clothes style" as separate classifiers. And then enquires, using active learning, as to what its purpose actually is. This allows the AI to classify properly for the purposes that it was designed for - and only those purposes.

For those working in AI alignment

Sometimes someone develops a way to keep AIs safe, by adding some constraints. For example, attainable utility preservation developed a formula to try and encode the concept of "power" for an AI, with a penalty term for having too much power:

With some difficulty, I constructed a situation where that formula failed to constrain the AI, via a subagent.

Essentially, the formal definition and the intuitive concept of power overlap in typical environments. But in extreme situations, they come apart. What is needed is an AI that can extrapolate the concept of power rather than the formal definition.

Doing this for other concepts allow a lot alignment methods to succeed such are avoiding side-effects, low-impact, corrigibility, and others.

For those using GPT-3

As detailed here, we typed "ehT niar ni niapS syats ylniam ni eht" into GPT-3. This is "The rain in Spain stays mainly in the", with the words spelt backwards. The correct completion is "nialp", the reverse of "plain".

GPT-3 correctly "noticed" that the words were spelt backwards, but failed to extend its goal and complete the sentence in a human-coherent way.

For those focused on how humans extrapolate their own values

A well-behaved child, brought up in a stable society, will learn, typically in early adolescence, that there is a distinction between "lawful" and "good". The concept of "well-behaved" has splintered into two, and now the child has to sort out how they should behave[5].

Recall also people's first reactions to hearing the trolley problem, especially the "large man" variant. They often want to deny the premises, or find a third option. The challenge is that "behave well and don't murder" is being pushed away from "do good in the world", while they are typically bound together.

In the future, we humans will continue to encounter novel situations where our past values are not clear guides to what to do. My favourite example is what to do if someone genetically engineers a humanoid slave race, that strongly want to be slaves, but don't enjoy being slaves. We can develop moral values to deal with the complexity of situations like this, but it requires some work: we don't know what are values are, we have to extrapolate them.

And, ideally, an AI would extrapolate as least as well as we would.


  1. Note that concept extrapolation has two stages: generating the possible extrapolations, and then choosing among them - diversify and disambiguate, in the terminology of this paper. We'll typically focus on the first part, the "diversify" part, mainly because that has to be done first, but also because there might not be any unambiguous choices at the disambiguate stage - what's the right extrapolation of "liberty", for instance? ↩︎

  2. There are going to be many more reward functions in practice. But the simplest ones will fit into two rough categories, those that are defined over the video feed, and those defined by the humans in the world that were the inputs to the video feed. ↩︎

  3. We could also point at things like brain-dead people and say "these have many human features, but are not full humans". Or point at some apes and ants and say "these are non-human, but the apes are more human-like than the ants". The more the dataset captures our complex intuitions about humanness, the better. ↩︎

  4. Conceptually, this is much easier to do if we think "generate both rewards" -> "choose conservative mix" -> "choose policy that maximises conservative mix", but it might be the case that the policy is constructed directly via some process. Learning policies seems easier than learning rewards, but mixing rewards seems easier than mixing policies, so I'm unsure what will be the best algorithm here. ↩︎

  5. It doesn't help that "well-behaved" was probably called "good" when the child was younger. So the concept has splintered, but the name has not. ↩︎

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 6:36 PM

I really like this. Ever since I read your first model splintering post, it's been a central part of my thinking too.

I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y'all are thinking about).

I remain much more confused about what to do when that detector goes off, in a future AGI.

I imagine a situation where some inscrutably complicated abstract concept in the AGI’s world-model comes apart / splinters away from a different inscrutably complicated abstract concept in the AGI’s world-model. OK, what do we do now?

Ideally the AGI would query the human about what to do. But I don't know how one might write code that does that. It's not like an image classifier where you can just print out some pictures and show them to the person.

(One solution would be: leverage the AGI’s own intelligence for how to query the human, by making the AGI motivated to learn what the human in fact wants in that ambiguous situation. But then it stops being a safety feature for that aspect of the AGI’s motivation itself.)

When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)

So anyway, in the context of the 2022 Review, I would be sad if there were a compilation of intellectual progress in AI alignment on lesswrong that made no mention of “concept extrapolation” (or its previous term “model splintering”). This post seems the best introduction. I gave it my highest vote.