A sketch of a value-learning sovereign

[-]paulfchristiano10y10

In this post, it seems like ontology identification is acting as a partial solution to the easy goal inference problem, by explaining how the human's behavior would change if their model of the world changed. I would be more interested in seeing any credible strategy for the easy goal inference problem. I don't really see how we are going to solve that problem in a way that doesn't incidentally address ontology identification, and I don't really see how a solution to ontology identification is going to help with that problem.

This may reflect a more general methodological difference. My impression is that the MIRI view is something like "obviously any solution will have to address X, so X is a reasonable way to approach the big problem," for X = ontology identification, tiling, "realistic world models," etc.

My impression is more like "obviously any solution will have to address the ontology identification problem, so it doesn't matter whether we find some special-case solution to that."

Both situations come up a lot in computer science, and I do both of them a ton in my own research. I don't really have a clean story about when one or the other is a more appropriate response, but in these particular cases I have pretty strong intuitions.

[-]Vanessa Kosoy10y00

Regarding the methodological difference. My perspective is that head-on attempts to solve AI safety are not very promising since we lack the tools to answer basic questions in the general area of the problem such as "what is intelligence?", "what is the computational resource cost of constructing an agent of given intelligence?" or "what is the growth curve of self-improving agents?" Therefore, what we should be doing is constructing a general theory capable of answering questions of this type (I would call it abstract intelligence theory). Thinking about problems such naturalized induction and Vingean reflection seems to me a useful way to approach this, not because they are subproblems of the AI safety problem but because they are handles to getting a mathematical grip on the entire area.

[-]jessicata10y00

Good point. In this post I was trying to solve something like the easy goal inference problem by factoring it into some different problems, including ontology identification, but it's not clear whether this is the right factoring.

It seems like your intuition is something like: "any way of correctly modelling human mistakes when the human and AI share an ontology will also correctly handle mistakes arising from the human and AI having a different ontology". I think I mostly agree with this intuition. My motivation for working on ontology identification despite this intuition is some combination of (a) "easy" versions of ontology identification seem useful outside the domain of value learning (e.g. making a genie for concrete physical tasks), and (b) I don't see many promising approaches for directly attacking the easy goal inference problem.

But after writing this, I think I have updated towards looking for approaches to the easy goal inference problem that avoid ontology identification. The most promising thing I can think of right now seems to be some variation on planning algorithm 2, but with an adjustment so that the planning can take into account the AI's different predictions (but not the AI's internal representation). It does seem plausible that something in this space would work without directly solving the ontology identification problem.

[-]paulfchristiano10y10

I don’t see many promising approaches for directly attacking the easy goal inference problem.

I would agree with that. But I don't see how this situation will change no matter what you learn about ontology identification. It looks to me like the easy goal inference problem is probably just ill-posed/incoherent, and we should avoid any approach that rests on us solving it. The kind of insight that would be required to change this view looks very unlike the kind required to solve ontology identification, and also unlike the kind required to make conventional progress in AI, so to the extent that there are workable approaches to the easy goal inference problem it seems like we could work towards them now. If we can't see how to attack the problem, then by the same token that leads me to be pessimistic.

On that perspective we might ask: how we are avoiding the problem? The dodges I know would also dodge ontology identification, by cashing everything out in terms of human behavior. It's harder for me to know what the situation is like for solutions to the goal inference problem--because I don't yet see any plausible solution strategies--but I would guess that the situation will turn out to be similar.

[-]paulfchristiano10y10

Both 1. "mimic the human" and 2. "maximize according to the human's ontology" will only work well if the human can actually develop a world model (and in the case of 1 also plans) as well as the AI can. If you can do this, then we are probably set on value learning (at least at the level of detail in this post). Moreover, if we can produce world models as good as the AI then we can probably also produce plans as good as the AI, so probably we can just focus on [1]. I'm obviously much more optimistic about this than about approach [3].

Note: I think that the only reason to be interested in approval-directed agents rather than straightforward imitation learners is that it may be harder to effectively imitate behavior than to solve the same task in a very different way. So it seems wrong to say that imitation is most useful as an input into approval-directed agents.

[-]paulfchristiano10y10

If you concede that you need some kind of "multi-level" model of the world to capture human beliefs about their environment, and in particular if you think that this is necessary for value learning, it seems like you must agree that the game doesn't stop there. Much human knowledge can't be simply described as facts about the world at any level of coarse-graining, at least not in any stronger sense than facts about my dog are facts about the underlying physical data.

That is, facts about my dog can definitely be cashed out as logical facts about the relationship between the underlying physical data + the laws of physics. But they are definitely not explicitly represented as such or conveniently understood as such.

It may be that coarse graining is literally the only way that complex beliefs of this kind work. I would find that surprising in the extreme.

Is anyone defending a position like this, or is the view more something like "well, we know that this is at least one thing that humans do, so we will either (a) address this and then address the next thing and so on, or (b) learn something important about the representation of beliefs/etc. in the course of understanding multi-level models"? Or something very different?

It seems to me like the game probably doesn't stop anywhere sane, so the only option is really for it to stop immediately (probably before you even assume the human is making non-trivial ontological assumptions).

[-]jessicata10y00

Yeah, I think just having coarse-grained facts would not be enough. I'm referring to a more general idea when I say "multi-level models": something that can represent concepts at different levels of abstraction, probably not with the high-level facts being a function of the low-level facts. My goal would be to have at least some concrete model for a multi-level model that, for example, preserves the "diamond" concept as it learns new physics. I think (a) and (b) are both reasons why I want to do this; I want to know if it's possible to create an AI with goals related to concrete physical things (which requires something like multi-level models, but maybe not much more?), and I also want to have a better understanding of more abstract concepts to see if it's possible to have an AI do anything useful with them.

Could our disagreement be stated as: I think it is plausible that, with a few years of work, a small number of researchers could make useful models for things like diamond-maximization; whereas you don't think it is plausible?

[-]Vanessa Kosoy10y00

Regarding acting under uncertainty, there is another natural approach namely applying the Nash bargaining solution to an imagery collection of agents with utility functions sampled from the given ensemble.

[-]paulfchristiano10y00

It's not really clear to me what one would want out of "more explicit models of this instrumental preference for autonomy." It's a complex and messy preference that is tied up with other similarly complex and messy preferences. It probably doesn't have a simple or natural definition in any reasonable ontology.

What concrete questions about this preference would you hope to answer?

To the extent this preference causes a system to have good behavior, it will be because it affects humans' behavior, e.g. a human would predictably and systematically decline actions that significantly reduce their own autonomy. So we need to set up our system so that these effects on human behavior lead to it also avoiding actions that significantly reduce the user's autonomy.

You seem to have in mind a particular version, where the agent infers some latent structure which can then be used to correctly evaluate situations unlike those that appear in the training data (and in particular, to compare the plans put forth by an AI rather than those put forth by a human). So maybe you want to know something about what kind of concepts can robustly transfer from one domain to another quite different domain. It feels to me like you are only going to find bad news here, unless you first make some significant conceptual contributions in AI. So it seems like the first step would be to look for some good news anywhere and see what kind of good news you have to work with. (Or to wait until the AI community produces some good news on its own, and work on other problems in the meantime.)

A somewhat different angle:

The "instrumental goal pursuer" is no more or less dynamically inconsistent than the human. The human wouldn't lock the current goal in place, and so obviously any preferences that successfully explain the human's short-term behavior won't lock the current goal in place. This is a simple observation that already appears in the training set.

This doesn't require learning a complex concept of autonomy. It just requires learning a model of human preferences that roughly reproduces human behavior. If you don't get this kind of thing right, then it seems pretty clear that you aren't going to get useful behavior out of the system in general. Now you may take this as a general argument against value learning, or that value learning will be difficult, but it doesn't seem like we should consider these kinds of preferences as any different from normal preferences.

[-]jessicata10y00

A model I might want to make would be something like a hierarchical planning algorithm. It would have some supergoal, and then find subgoals of the supergoal. If the system just naively optimized for the subgoals, then it might do silly things like lock its subgoals in place. Instead, this algorithm should prefer plans that maximize the agent's autonomy (in case the agent changes subgoals). If this model works, then maybe we can use it to derive a partial solution to the hard problem of corrigibility. So the real question I want to answer is something like "what kind of AI would a agent with a preference for autonomy choose to build"; I suspect that this AI design will be corrigible in some way. I think this is even useful if the agent is much simpler than a human.

You seem to have in mind a particular version, where the agent infers some latent structure which can then be used to correctly evaluate situations unlike those that appear in the training data (and in particular, to compare the plans put forth by an AI rather than those put forth by a human). So maybe you want to know something about what kind of concepts can robustly transfer from one domain to another quite different domain.

Yeah, this seems accurate. I think this goes back to you being slightly more pessimistic than me about making progress on ontology identification (though I'm still somewhat pessimistic).

This doesn’t require learning a complex concept of autonomy. It just requires learning a model of human preferences that roughly reproduces human behavior.

Right, a good supervised learner should learn this. This is more of a problem if we're using the model's internal representation, not just its predictions.

[-]paulfchristiano10y00

This is more of a problem if we're using the model's internal representation, not just its predictions.

But you aren't directly using the model's internal representation, are you? You are using it only to make predictions about the human's preferences in some novel domain (e.g. over the consequences of novel kinds of plans).

It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.

Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)? I would be very surprised by that. It seems like if you succeed you must be able to robustly transfer lots of human judgments to unfamiliar situations. And for that kind of solution, it's not clear how an understanding of particular aspects of human preferences really helps.

[-]jessicata10y00

It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.

I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you're right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.

Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)?

I think I mostly want some story for why the preference for autonomy is even in the model's hypothesis space. It seems that if we're already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it's more of a problem if we aren't confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).

[-]paulfchristiano10y00

I don't think the difference is pessimism about ontology identification per se. You overall approach (if successful) seems like it would do zero-shot transfer learning. My perspective would be something like: OK, let's try and understand when we can do zero-shot transfer learning at all, and what assumptions we need to rely on (incidentally, I am also pessimistic about this). You are instead focusing on a different simplification of the problem, one which (I feel) is less likely to be connected to the most important underlying difficulties, and less likely to quickly provide information about whether the overall approach can work.

[-]paulfchristiano10y00

Note that "non-deterministic" usually means something different than "stochastic," especially in the context of transition functions.

[-]jessicata10y00

Good point, fixed.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

A sketch of a value-learning sovereign

6

Values and ontologies

Planning using an ontology

Learning setups

Learn the human's utility function expressed in the human ontology, from human behavior

The AI's planning, once it has the human ontology and utility function

Learn the human's utility function expressed in the AI's ontology, from human behavior

Instrumental or terminal goals?

An instrumental preference for autonomy

Acting under uncertainty

The necessity of overpowered ontology mapping

Human understanding of plans

Conclusion