Models Modeling Models

[-]Steven Byrnes4y30

I think you're taking the perspective that the task at hand is to form an (external-to-the-human) model of what a human wants and is trying to do, and there are different possible models which tend to agree in-distribution but not OOD.

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD". For example, in the trolley problem, I possess a "I don't like killing people" preference, and I also possess a "I like saving lives" preference, and here they've come into conflict. This is basically a "subagent" perspective.

Do you agree?

I bring this up because "I have lots of preferences and sometimes they come into conflict" is a thing that I think about every day, whereas "my preferences can be fit by different models and sometimes those models come into conflict" is slightly weird to me. (Not that there's anything wrong with that.)

[-]Charlie Steiner4y30

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD".

Yes, I'm fine with this rephrasing. But I wouldn't write a post using only the "the human has the preferences" way of speaking, because lots of different ways of thinking about the world use that same language.

This is basically a "subagent" perspective.

I think this post is pretty different from how people typically describe humans in terms of subagents, but it does contain that description.

Any physical system can have multiple descriptions of it, it doesn't have to act like it's made of subagents. (By "act like it's made of subagents," I include some things people do, like psych themselves up, or reward themselves for doing chores, or try to hide objects of temptation from themselves.) You can have several different models of a thermostat, for instance. Reconciling the different models of a thermostat might look a bit like bargaining between subagents, but if so these are atrophied male anglerfish subagents; they don't model each other and bargain on their own behalf, they are just dumb inputs in a bigger, smarter process.

If we make a bunch of partial models of a human, some of these models are going to look like subagents, or drive subagenty behavior. But a lot of other ones are going to look like simple patterns, or bigger models that contain the subagent bargaining within themselves and hold aggregated preferences, or psychological models that are pretty complicated and interesting but don't have anything to do with subagenty behavior.

And maybe a value learning AI would capture human subagenty behavior, not only in the models that contain subagent interactions as parts of themselves, but in the learned meta-preferences that determine how different models that we'd think of as human subagents get aggregated into one big story about what's good. Such an AI might help humans psych themselves up, or reward them for doing chores.

But I'd bet that most of the preference aggregation work would look about as subagenty as aggregating the different models of a thermostat. In the trolley problem my "save people" and "don't kill people" preferences don't seem subagenty at all - I'm not about to work out some internal bargain where I push the lever in one direction for a while in exchange for pushing it the other way the rest of the time, for instance.

In short, even though I agree that in a vacuum you could call each model a "subagent," what people normally think of when they hear that word is about a couple dozen entities, mostly distinct. And what's going on in the picture I'm promoting here is more like 10^4 entities, mostly overlapping.

[-]Steven Byrnes4y20

Hmm. I think you missed my point…

There are two different activities:

ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.

ACTIVITY B: Think about the gears underlying human intelligence and motivation.

You're doing Activity A every day. I'm doing Activity B every day.

My comment was trying to say: "The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There's a conceptual parallel between these two different discussions."

And I think you thought I was saying: "We both agree that the real ultimate goal right now is Activity A. I'm leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents)."

Does that help?

[-]Charlie Steiner4y30

This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.

But I feel like I kind of gave a reply anyway - I don't think the parallel with subagents is very deep. But there's a very strong parallel (or maybe not even a parallel, maybe this is just the thing I'm talking about) with generative modeling.

^{^}

At least, up to some finite amount of shuffling that's like a choice of prior, or universal Turing machine, or definition of "agent-shaped."

^{^}

You may recognize a resemblance to inferring human values.

^{^}

That would lead to unpalatable positions like "whatever the human did, that's what they wanted" or "the human wants to follow the laws of physics."

^{^}

Comparing preferences across models is currently an open problem. If you take this post's picture of inferring human preferences literally (rather than e.g. imagining we'll be able to train a big neural network that does all this internally), we had better figure out how to translate between ontologies better.

^{^}

And as with Extremal, we would rather not go to the part of phase space where the models of us all disagree with each other.

^{^}

My addition of the variance-seeking pressure under the umbrella of Regressional Goodhart really highlights the similarities between it and Extremal Goodhart. Both are simplifications of the same overarching math, it's just that in the Regressional case we're doing even more simplification (requiring there to be a noise term with nice properties), allowing for a more specific picture of the optimization process.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

12