(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
In saying the above, I do not mean the following:
(1) Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this "goal slot" (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this]
Instead, I mean something more like the following:
(2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out "goal" that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
Making the AI even have something vaguely nearing a 'goal slot' that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.
Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).
(But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a "goal slot", it's not letting you touch it.)
Furthermore, insofar as the AI is capable of finding actions that force the future into some narrow band, I expect that it will tend to be reasonable to talk about the AI as if it is (more-or-less, most of the time) "pursuing some objective", even in the stage where it's in fact a giant kludgey mess that's sorting itself out over time in ways that are unpredictable to you.
I can see how my attempts to express these other beliefs could confuse people into thinking that I meant something more like (1) above (“Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective…”), when in fact I mean something more like (2) (“By default, the first minds humanity makes will be a terrible spaghetti-code mess…”).
In case it helps those who were previously confused: the "diamond maximizer" problem is one example of an attempt to direct researchers’ attention to the challenge of cleanly factoring cognition around something a bit like a 'goal slot'.
As evidence of a misunderstanding here: people sometimes hear me describe the diamond maximizer problem, and respond to me by proposing training regimes that (for all they know) might make the AI care a little about diamonds in some contexts.
This misunderstanding of what the diamond maximizer problem was originally meant to be pointing at seems plausibly related to the misunderstanding that this post intends to clear up. Perhaps in light of the above it's easier to understand why I see such attempts as shedding little light on the question of how to get cognition that cleanly pursues a particular objective, as opposed to a pile of kludges that careens around at the whims of reflection and happenstance.
I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.
So if the debate is “Are Eliezer & Nate right about ≳99% (or whatever) chance of doom?”, then I find myself on the optimistic side (at least, leaving aside the non-technical parts of the problem), whereas if the debate is “Do we have a strong reason to believe that thus-and-such plan will actually solve technical alignment?”, then I find myself on the pessimistic side.
Separately, I don’t think it’s true that reflectively-stable hard superintelligence needs to have a particular behavioral goal, for reasons here.
I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).
AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of pointing is hard" or "yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it's sorted out".
For instance, the latter response obtains if the "pointing" is done by naive training.
(Though I also have some sense that I see the situation as more fragile than you--there's lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)
Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.
That’s helpful, thanks!
UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post, for readers’ convenience. :)
Given a sufficiently Kludgy pile of heuristics, it won't make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won't code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does.
Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI's or reading human writings.
Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI's metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn't have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn't the only component of it's utility, but it is a big one.)
For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.