Different perspectives on concept extrapolation

I really like this. Ever since I read your first model splintering post, it's been a central part of my thinking too.

I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y'all are thinking about).

I remain much more confused about what to do when that detector goes off, in a future AGI.

I imagine a situation where some inscrutably complicated abstract concept in the AGI’s world-model comes apart / splinters away from a different inscrutably complicated abstract concept in the AGI’s world-model. OK, what do we do now?

Ideally the AGI would query the human about what to do. But I don't know how one might write code that does that. It's not like an image classifier where you can just print out some pictures and show them to the person.

(One solution would be: leverage the AGI’s own intelligence for how to query the human, by making the AGI motivated to learn what the human in fact wants in that ambiguous situation. But then it stops being a safety feature for that aspect of the AGI’s motivation itself.)

Note that concept extrapolation has two stages: generating the possible extrapolations, and then choosing among them - diversify and disambiguate, in the terminology of this paper. We'll typically focus on the first part, the "diversify" part, mainly because that has to be done first, but also because there might not be any unambiguous choices at the disambiguate stage - what's the right extrapolation of "liberty", for instance? ↩︎
There are going to be many more reward functions in practice. But the simplest ones will fit into two rough categories, those that are defined over the video feed, and those defined by the humans in the world that were the inputs to the video feed. ↩︎
We could also point at things like brain-dead people and say "these have many human features, but are not full humans". Or point at some apes and ants and say "these are non-human, but the apes are more human-like than the ants". The more the dataset captures our complex intuitions about humanness, the better. ↩︎
Conceptually, this is much easier to do if we think "generate both rewards" -> "choose conservative mix" -> "choose policy that maximises conservative mix", but it might be the case that the policy is constructed directly via some process. Learning policies seems easier than learning rewards, but mixing rewards seems easier than mixing policies, so I'm unsure what will be the best algorithm here. ↩︎
It doesn't help that "well-behaved" was probably called "good" when the child was younger. So the concept has splintered, but the name has not. ↩︎

[-]Steven Byrnes4y80

[-]Steven Byrnes2y*70Review for 2022 Review

When I think of useful concepts in AI alignment that I frequently refer to, there are a bunch from the olden days (e.g. “instrumental convergence”, “treacherous turn”, …), and a bunch of idiosyncratic ones that I made up myself for my own purposes, and just a few others, one of which is “concept extrapolation”. For example I talk about it here. (Others in that last category include “goal misgeneralization” [here’s how I use the term] (which is related to concept extrapolation) and “inner and outer alignment” [here’s how I use the term].)

So anyway, in the context of the 2022 Review, I would be sad if there were a compilation of intellectual progress in AI alignment on lesswrong that made no mention of “concept extrapolation” (or its previous term “model splintering”). This post seems the best introduction. I gave it my highest vote.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

24

Different perspectives on concept extrapolation

24

For those worried about AI existential risk

For philosophers

For ML engineers into image classification

For ML engineers focusing on current practical problems

For RL engineers

For investors

For those working in AI alignment

For those using GPT-3

For those focused on how humans extrapolate their own values