I saw a presentation covering a bunch of this back in February, and the graphs I found most informative were those showing the training flop distributions before updating against already-achievable levels. There is one graph along these lines on page 13 in part 1 in the google docs, but it doesn't show the combined distribution without the update against already achievable flops.
Am I correct in remembering that the combined distribution before that update was distinctly bimodal? That was one of my main takeaways from the presentation, and I want to make sure I'm remembering it correctly.
Setting up the "locality of goals" concept: let's split the variables in the world model into observables XO, action variables XA, and latent variables XL. Note that there may be multiple stages of observations and actions, so we'll only have subsets SO and SA of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses XASA to maximize
... but we can rewrite that as
Defining a new utility function u′(XO,XA)=EXL[u(X)|XO,XA], the original problem is equivalent to:
In English: given the original utility function on the ("non-local") latent variables, we can integrate out the latents to get a new utility function defined only on the ("local") observation & decision variables. The new utility function yields completely identical agent behavior to the original.
So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the "local" observation & decision variables.
This makes a lot of sense.
I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral - i.e. it tells us when the agent defers to the AI's world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.
What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)
I see a couple different claims mixed together here:
The main thing I disagree with is the idea that there probably isn't a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using "prosaic" AI world-models. It's the sort of thing which would dramatically improve the success chances of alignment by default.
I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I'd prefer metaphilosophical tools to be confident that something like that would work.
Well, if there were unique values, we could say "maximize the unique values." Since there aren't, we can't. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we're going to have to do with that wrong-seeming.
Before I get into the meat of the response... I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human's world-model, which still gives rise to all the same problems as a total order in the human's world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don't think that's the main disconnect here...
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. "The utility function I hard-coded into the robot" is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that "the utility function I hard-coded into the robot" would correspond to some unambiguous thing in the real world - although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?
Could you uncompress this comment a bit please?
This comment seems wrong to me in ways that make me think I'm missing your point.
Some examples and what seems wrong about them, with the understanding that I'm probably misunderstanding what you're trying to point to:
we're non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn't make sense
I have no idea why this would be tied to non-Cartesian-ness.
But in the real world, humans don't have a unique set of True Values or even a unique model of the world
There are certainly ways in which humans diverge from Bayesian utility maximization, but I don't see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that's quite different from having multiple distinct world-models.
Thus in the real world we cannot require that the AI has to maximize humans' True Values, we can only ask that it models humans [...] and satisfy the modeled values.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say "just pick one set of values/one world model and satisfy that", which seems wrong.
One way to interpret all this is that you're pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don't see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it's not a conceptual bottleneck. There are still many angles to pursue - I still don't have efficient abstraction learning algorithms, there's probably good ways to generalize it, and of course there's empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven't clearly written up every little piece of evidence.)
Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that's an answer to a question I don't know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don't know yet.
I like to think that I influenced your choice of subject.
Yup, you did.
it seems that "head-state" is what would usually called "state" in TMs.
Correct. Really, the "state" of a TM (as the word is used most often in other math/engineering contexts) is both the head-state and whatever's on the tape.
In a technical sense, the "state" of a system is usually whatever information forms a Markov blanket between future and past - i.e. the interaction between everything in the future and everything in the past should be completely mediated by the system state. There are lots of exceptions to this, and the word isn't used consistently everywhere, but that's probably the most useful heuristic.
I don't think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing". I don't think that's actually the right way to do things, but it should at least give a reduction of the problem.
Anyway, I'd say the most important difference between this and various existing strategies is that we can learn "at the outermost level". We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn't know about it), though it's tricky to really make everything subject to override later on.
Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction - there's no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.
In theory I could treat myself as a black box, though even then I'm going to need at least a functional self model (i.e. model of what outputs yield what inputs) in order to get predictions out of the model for anything in my future light cone.
But usually I do assume that we want a "complete" world model, in the sense that we're not ignoring any parts by fiat. We can be uncertain about what my internal structure looks like, but that still leaves us open to update if e.g. we see some FMRI data. What I don't want is to see some FMRI data and then go "well, can't do anything with that, because this here black box is off-limits". When that data comes in, I want to be able to update on it somehow.