Ah, looks like I missed this question for quite a while!
I agree that it's not quite one or the other. I think that like wireheading, we can split delusion box into "the easy problem" and "the hard problem". The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn't real. Then, much like observation-utility functions, the agent won't think entering into the delusion box is a good idea when it's planning -- and also, won't get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened).
But the hard problem of delusion box would be: we can't make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids "generalized delusion boxes"?
Why not both? ;3
I have nothing against justifications being circular (IE the same ideas recurring on many levels), just as I have nothing in principle against finding a foundationalist explanation. A circular argument is just a particularly simple form of infinite regress.
But my main argument against only the circular reasoning explanation is that attempted versions of it ("coherentist" positions) don't seem very good when you get down to details.
Pure coherentist positions tend to rely on a stipulated notion of coherence (such as probabilistic coherence, or weighted constraint satisfaction, or something along those lines). These notions are themselves fixed. This could be fine if the coherence notions were sufficiently "assumption-lite" so as to not be necessarily Goodhart-prone etc, but so far it doesn't seem that way to me.
I'm predicting that you'll agree with me on that, and grant that the notion of coherence should itself be up for grabs. I don't actually think the coherentist/foundationalist/infinitist trilemma is that good a characterization of our disagreement here. My claim isn't so much the classical claim that there's an infinite regress of justification, as much as a claim that there's an infinite regress of uncertainty -- that we're uncertain at all the levels, and need to somehow manage that. This fits the ship-of-theseus picture just fine.
In other words, one can unroll a ship of theseus into an infinite hierarchy where each level says something about how the next level down gets re-adjusted over time. The reason for doing this is to achieve the foundationalist goal of understanding the system better, without the foundationalist method of fixing foundational assumptions. The main motive here is amplification. Taking just a ship of theseus, it's not obvious how to make it better besides running it forward faster (and even this has its risks, since the ship may become worse). If we unroll the hierarchy of wanting-to-become better, we can EG see what is good and bad about merely running it forward faster, and try to run it forward in good ways rather than bad ways (as well as other, more radical departures from simple fast-forward amplification).
One disagreement I have with your story is the argument "given the finitude of human brain architecture". The justification of a belief/norm/algorithm needn't be something already present in the head. A lot of what we do is given to us by evolution. We can notice those things and question whether they make sense by our current standards. Calling this process finite is kind of like calling a Turing machine finite. There's a finite core to it, but we can be surprised by what this core does given more working tape.
I think one issue which this post sort of dances around, and which maybe a lot of discussion of inner optimizers leaves implicit or unaddressed, is the difference between having a loss function which you can directly evaluate vs one which you must estimate via some sort of sample.
The argument in this post about how inner optimizers misbehaving is necessarily behavioral, and therefore best addressed by behavioral loss functions, misses the point that these misbehaviors are on examples we don't check. As such, it comes off as:
Now, I personally think that "distributional shift" is a misleading framing, because in learning in general (EG Solomonoff induction) we don't have an IID distribution (unlike in EG classification tasks), so we don't have a "distribution" to "shift".
But to the extent that we can talk in this framing, I'm kinda like... what are you saying here? Are you really proposing that we should just check instances more thoroughly or something like that?
There seems to be a bit of a tension here. What you're outlining for most of the post still requires a formal system with assumptions within which to take the fixed point, but then that would mean that it can't change its mind about any particular thing. Indeed it's not clear how such a totally self-revising system could ever be a fixed point of constraints of rationality: since it can revise anything, it could only be limited by the physically possible.
It's sort of like the difference between a programmable computer vs an arbitrary blob of matter. A programmable computer provides a rigid structure which can't be changed, but the set of assumptions imposed really is quite light. When programming language designers aim for "totally self-revising systems" (languages with more flexibility in their assumptions, such as Lisp), they don't generally attack the assumption that the hardware should be fixed. (Although occasionally they do go as far as asking for FPGAs.)
(a finite approximation of) Solomonoff Induction can be said to make "very few assumptions", because it can learn a wide variety of programs. Certainly it makes less assumptions than more special-case machine learning systems. But it also makes a lot more assumptions than the raw computer. In particular, it has no allowance for updating against the use of Bayes' Rule for evaluating which program is best.
I'm aiming for something between the Solomonoff induction and the programmable computer. It can still have a rigid learning system underlying it, but in some sense it can learn any particular way of selecting hypotheses, rather than being stuck with one.
Now if we're foundationalists, we say that thats because you didn't actually believe anything, and that that was just a linguistic token passed around your head but failing to be meaningful, because you didn't implement The Laws correctly. But if we want to have a theory like yours, it treats this cognitively, and so such beliefs must meaningful in some sense. I'm very curious what this would look like.
This seems like a rather excellent question which demonstrates a high degree of understanding of the proposal.
I think the answer from my not-necessarily-foundationalist but not-quite-pluralist perspective (a pluralist being someone who points to the alternative foundations proposed by different people and says "these are all tools in a well-equipped toolbox") is:
The meaning of a confused concept such as "the real word for X" is not ultimately given by any rigid formula, but rather, established by long deliberation on what it can be understood to mean. However, we can understand a lot of meaning through use. Pragmatically, what "the real word for X" seems to express is that there is a correct thing to call something, usually uniquely determined, which can be discovered through investigation (EG by asking parents). This implies that other terms are incorrect (EG other languages, or made-up terms). "Incorrect" here means normatively incorrect, which is still part of our map; but to cash out what that means to a greater degree, it means you can EG scold people who use wrong terms, and you should teach them better terms, etc.
To sum up, meaning in this view is broadly more inferentialist and less correspondence-based: the meaning of a thing is more closely tied with the inferences around that thing, than with how that thing corresponds to a territory.
So if you solved this it would propably just solve anthropics as well.
I'm not seeing that implication at all! The way I see it, the framework "stands over" decision-theoretic issues such as anthorpics, providing no answers (only providing an epistemic arena in which uncertainty about them can be validly expressed rather than requiring some solution in order to define correct reasoning in the first place).
I mean, that's fair. But what if your belief system justified almost everything ultimately in terms of "making ancestors happy", and relied on a belief that ancestors are still around to be happy/sad? There are several possible responses which a real human might be tempted to make:
So we can fix the scenario to make a more real ontological crisis.
It also bears mentioning -- the reason to be concerned about ontological crisis is, mostly, a worry that almost none of the things we express our values in terms of are "real" in a reductionistic sense. So an AI could possibly view the world through much different concepts and still be predictively accurate. The question then is, what would it mean for such an AI to pursue our values?
Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul's proposal of relaxed adversarial training is one possible method (look for "pseudo-inputs" which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don't know how to hit them with data).
The argument in the post seems to be "you can't incentivize virtue without incentivizing it behaviorally", but this seems untrue.
Right. I mean, I would clarify that the whole point isn't to learn to go up the hierarchy; in some sense, most of the point is learning at a few levels. But yeah.
Because the quote says that at some level you could stop caring (which means we can keep going meta until there's not significant improvement, and stop there)
Hmmm, that's not quite what I meant. It's not about stopping at some meta-level, but rather, stopping at some amount of learning in the system. The system should learn not just level-specific information, but also cross-level information (like overall philosophical heuristics), which means that even if you stop teaching the machine at some point, it can still produce new reasoning at higher levels which should be similar to feedback you might have given.
The point is that human philosophical taste isn't perfectly defined, and even if we also teach the machine everything we can about how to interpret human philosophical taste, that'll still be true. However, at some point our uncertainty and the machine's uncertainty will be close enough to the same that we don't care. (Note: what it even means for them to be closely matched depends on the question of what it means for humans to have specific philosophical taste, which, if we could answer, we would have perfectly defined human philosophical taste -- the thing we can't do. Yet, in some good-enough sense, our own uncertainty eventually becomes well-represented by the machine's uncertainty. That's the stopping point at which we no longer need to provide additional explicit feedback to the machine.)
This seems like it's only true if the humans would truly cling to their belief in spite of all evidence (IE if they believed in ghosts dogmatically), which seems untrue for many things (although I grant that some humans may have some beliefs like this). I believe the idea of the ghost example is to point at cases where there's an ontological crisis, not cases where the ontology is so dogmatic that there can be no crisis (though, obviously, both cases are theoretically important).
However, I agree with you in either case -- it's not clear there's "nothing to be done" for the ghost case (in either interpretation).
I definitely endorse this as a good explanation of the same pointers problem I was getting at. I particularly like the new framing in terms of a direct conflict between (a) the fact that what we care about can be seen as latent variables in our model, and (b) we value "actual states", not our estimates -- this seems like a new and better way of pointing out the problem (despite being very close in some sense to things Eliezer talked about in the sequences).
What I'd like to add to this post would be the point that we shouldn't be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don't think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don't need to.)
One reason is because finding a correspondence and applying it isn't what the agent should want. In this simple setup, where we suppose a perfect Bayesian agent, it's reasonable to argue that the AI should just use the agent's beliefs. That's what would maximize the expectation from the perspective of the agent -- not using the agent's utility function but substituting the AI's beliefs for the agent's. You mention that the agent may not have a perfect world-model, but this isn't a good argument from the agent's perspective -- certainly not an argument for just substituting the agent's model with some AI world-model.
This can be a real alignment problem for the agent (not just a mistake made by an overly dogmatic agent): if the AI believes that the moon is made of blue cheese, but the agent doesn't trust that belief, then the AI can make plans which the agent doesn't trust even if the utility function is perfect.
And if the agent does trust the AI's machine-learning-based model, then an AI which used the agent's prior would also trust the machine-learning model. So, nothing is lost by designing the AI to use the agent's prior in addition to its utility function.
So this is an argument that prior-learning is a part of alignment just as much as value-learning.
We don't usually think this way because when it comes to humans, well, it sounds like a terrible idea. Human beliefs -- as we encounter them in the wild -- are radically broken and irrational, and inadequate to the task. I think that's why I got a lot of push-back on my post about this:
I mean, I REALLY don't want that or anything like that.- jbash
I mean, I REALLY don't want that or anything like that.
But I think normativity gives us a different way of thinking about this. We don't want the AI to use "the human prior" in the sense of some prior we can extract from human behavior, or extract from the brain, or whatever. Instead, what we want to use is "the human prior" in the normative sense -- the prior humans reflectively endorse.
This gives us a path forward on the "impossible" cases where humans believe in ghosts, etc. It's not as if humans don't have experience dealing with things of value which turn out not to be a part of the real world. We're constantly forming and reforming ontologies. The AI should be trying to learn how we deal with it -- again, not quite in a descriptive sense of how humans actually deal with it, but rather in the normative sense of how we endorse dealing with it, so that it deals with it in ways we trust and prefer.