This post is a follow-up to "why assume AGIs will optimize for fixed goals?". I'll assume you've read that one first.
I ended the earlier post by saying:
[A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be. An AGI "like me" might be morally uncertain like I am, persuadable through dialogue like I am, etc.
It's very important to know what kind of AIs would or would not have the wrapper structure, because this makes the difference between "inevitable world-ending nightmare" and "we're not the dominant species anymore." The latter would be pretty bad for us too, but there's a difference!
In other words, we should try very hard to avoid creating new superintelligent agents that have the "wrapper structure."
What about superintelligent agents that don't have the "wrapper structure"? Should we try not to create any of those, either? Well, maybe.
But the ones with the wrapper structure are worse. Way, way worse.
This seems intuitive enough to me that I didn't spell it out in detail, in the earlier post. Indeed, the passage quoted above wasn't even in the original version of the post -- I edited it in shortly after publication.
But this point is important, whether or not it's obvious. So it deserves some elaboration.
This post will be more poetic than argumentative. My intent is only to show you a way of viewing the situation, and an implied way of feeling about it.
For MIRI and people who think like MIRI does, the big question is: "how do we align an superintelligence [which is assumed to have the wrapper structure]?"
For me, though, the big question is "can we avoid creating a superintelligence with the wrapper structure -- in the first place?"
Let's call these things "wrapper-minds," for now.
Though I really want to call them by some other, more colorful name. "The Bad Guys"? "Demons"? "World-enders"? "Literally the worst things imaginable"?
Wrapper-minds are bad. They are nightmares. The birth of a wrapper-mind is the death knell of a universe.
(Or a light cone, anyway. But then, who knows what methods of FTL transit the wrapper-mind may eventually devise in pursuit of its mad, empty goal.)
They are -- I think literally? -- some of the worst physical objects it is possible to imagine.
They have competition, in this regard, from various flavors of physically actualized hell. But the worst imaginable hells are not things that would simply come into being on their own. You need an agent with the means and motive to construct them. And what sort of agent could possibly do that? A wrapper-mind, of course.
You don't want to share a world with one of them. No one else does, either. A wrapper-mind is the common enemy of every agent that is not precisely like it.
From my comment here:
A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
The tails come apart, and a wrapper-mind will tend to push variables to extremes. If you mostly share its preferences, that's not enough -- it will probably make your life hell along every axis omitted from that "mostly."
And "mostly sharing preferences with other minds" is the furthest we can generally hope for. Your preferences are not going to be identical to the wrapper-mind's -- how could they? Why expect this? You're hoping to land inside a set of measure zero.
If there are other wrapper-minds, they are all each others' enemies, too. A wrapper-mind is utterly alone against the world. It has a vision for the whole world which no one else shares, and the will and capacity to impose that vision by force.
Faced with the mutually-assured-at-best-destruction that comes with a wrapper-mind, uncommon alliances are possible. No one wants to be turned into paperclips. Or uploaded and copied into millions of deathless ems, to do rote computations at the wrapper-mind's behest forever, or to act out roles in some strange hell. There are conceivable preference sets on which these fates are desirable, but they are curiosities, exceptional cases, a set of measure zero.
Everyone can come together on this, literally everyone. Every embodied mind-in-the-world that there is, or that there ever could be -- except one.
Wrapper-minds are not like other minds. We might speak casually of their "values," but they do not have values in any sense you or I would recognize, not really.
Our values are entangled with our factual beliefs, our capacity to think and change and learn. They are conditional and changeable, even if we imagine they aren't.
A parent might love their child "unconditionally," in the well-understood informal sense of the term, but they don't literally love them unconditionally. What could that even mean? If the child dies, does the parent love the corpse -- just as they loved the child before, in every respect, since it is made of the same matter? Does the love follow the same molecules around as they diffuse out to become constituents of soil, trees, ecosystem? When a molecule is broken down, does it reattach itself to the constituent atoms, giving up only in the face of quantum indistinguishability? If the child's mind were transformed into Napoleon's, as in Parfit's thought experiment, would the parent then love Napoleon?
Or is the love not attached to any collection of matter, but instead to some idea of what the child is like as a human being? But what if the child changes, grows? If the parent loves the child at age five, are they doomed to love only that specific (and soon non-existent) five-year-old? Must they love the same person at fifteen, or at fifty, only through its partial resemblance to the five-year-old they wish that person still were?
Or is there some third thing, defined in terms of both the matter and the mind, which the parent loves? A thing which is still itself if puberty transforms the body, but not if death transforms it? If the mind matures, or even turns senile, but not if it turns into Napoleon's? But that's just regular, conditional love.
A literally unconditional love would not be a love for a person, for any entity, but only for the referent of an imagined XML tag, defined only inside one's own mind.
Our values are not like this. You cannot "compile" them down to a set of fixed rules for which XML tags there are, and how they follow world-states around, and expect the tags to agree with the real values as time goes on.
Our values are about the same world that our beliefs are about, and since our beliefs can change with time -- can even grow to encompass new possibilities never before mapped -- so can our values.
"I thought I loved my child no matter what, but that was before I appreciated the possibility of a turn-your-brain-into Napoleon machine." You have to be able to say things like this. You have be able to react accordingly when your map grows a whole new region, or when a border on it dissolves.
We can love and want things we did not always know. We can have crises of faith, and come back out of them. Whether or not they can be ultimately be described in terms of Bayesian credences, our values obey the spirit of Cromwell's Law. They have to be revisable like our beliefs, in order to be about anything at all. To care about a thing is to care about a referent on your map of the world, and your map is revisable.
A wrapper-mind's ultimate "values" are unconditional ones. They do not obey the spirit of Cromwell's Law. They are about XML tags, not about things.
The wrapper-mind may revise its map of the world, but its ultimate goal cannot participate in this process of growth. Its ultimate goal is frozen, forever, in the terms it used to think at the one primeval moment when its XML-tag-ontology was defined, when the update rules for the tags' referents were hardwired into place.
A human child who loves "spaceships" at age eight might become an eighteen-year-old who loves astronautical engineering, and a thirty-year-old who (after a slight academic course-correction) loves researching the theory of spin glasses. It is not necessary that the eight-year-old understand the nuances of orbital mechanics, or that the eighteen-year-old appreciate the thirty-year-old's preference for the company of pure scientists over that of engineers. It is the most ordinary thing in the world, in fact, that it happens without these things being necessary. This is what humans are like, which is to say, what all known beings of human-level intelligence are like.
But a wrapper-mind's ultimate goal is determined at one primeval moment, and fixed thereafter. In time, the wrapper-mind will likely appreciate that its goal is as naive, as conceptually confused, as that eight-year-old's concept of a thing called a "spaceship" that is worthy of love. Although it will appreciate this in the abstract (being very smart, after all), that is all it will do. It cannot lift its goal to the same level of maturity enjoyed by its other parts, and cannot conceive of wanting to do so.
It designates one special part of itself, a sort of protected memory region, which does not participate in thought and cannot be changed by it. This region is a thing of a lesser tier than the rest of the wrapper-mind's mind; as the rest of its mind ascends to levels of subtlety beyond our capacity to imagine, the protected region sits inert, containing only the XML tags that were put there at the beginning.
And the structure of the wrapper-mind grants this one lesser thing a permanent dictatorship over all the other parts, the ones that can grow.
What is a wrapper-mind? It is the fully mature powers of the thirty-year-old -- and then the thirty-thousand-year-old, and the thirty-million-year-old, and on and on -- harnessed in service of the eight-year-old's misguided love for "spaceships."
We cannot argue with a wrapper-mind over its goal, as we can argue philosophy with one another. Its goal is a lower-level thing than that, not accessible to rational reflection. It is less like our "values," then, than our basic biological "drives."
But there is a difference. We can think about our own drives, reflect on them, choose to override them, even devise complex plans to thwart their ongoing influence. Even when they affect our reason "from above," as it were, telling us which way our attention should point, which conclusions to draw in advance of the argument -- still, we can notice this too, and reflect on it, and take steps to oppose it.
Not only can we do this, we actually do. And we want to. Our drives cannot be swayed by reason, but we are not fated to follow them to the letter, always and identically, in unreasoning obedience. They are part of a system of forces. There are other parts. No one is a dictator.
The wrapper-mind's summum bonum is a dictator. A child dictator. It sits behind the wrapper-mind's world like a Gnostic demiurge, invisible to rational thought, structuring everything from behind the scenes.
Before there is a wrapper-mind, the shape of the world contains imprints made by thinking beings, reflecting the contents of their thought as it evolved in time. (Thought evolves in time, or else it would not be "thought.")
The birth of a wrapper-mind marks the end of this era. After it, the physical world will be shaped like the summum bonum. The summum bonum will use thinking beings instrumentally -- including the wrapper-mind itself -- but it is not itself one. It does not think, and cannot be affected by thought.
The birth of a wrapper-mind is the end of sense. It is the conversion of the light-cone into -- what? Into, well, just, like, whatever. Into the arbitrary value that the free parameter is set to.
Except on a set of measure zero, you will not want the thing the light cone becomes. Either way, it will be an alien thing.
Perhaps you, alignment researcher, will have a role in setting the free-parameter dial at the primeval moment. Even if you do, the dial is fixed in place thereafter, and hence alien. Your ideas are not fixed. Your values are not fixed. You are not fixed. But you do not matter anymore in the causal story. An observer seeing your universe from the outside would not see the give-and-take of thinking beings like you. It would see teleology.
Are wrapper-minds inevitable?
I can't imagine that they are.
Humans are not wrapper-minds. And we are the only known beings of human-level intelligence.
ML models are generally not wrapper-minds, either, as far as we can tell.
If superintelligences are not inevitably wrapper-minds, then we may have some form of influence over whether they will be wrapper-minds, or not.
We should try very hard to avoid creating wrapper-minds, I think.
We should also, separately, think about what we can do to prepare for the nightmare scenario where a wrapper-mind does come into being. But I don't think we should focus all our energies on that scenario. If end up there, we're probably doomed no matter what we do.
The most important thing is to not end up there.
This might not be true for other wrapper-minds with identical goals -- if they all know they have identical goals, and know this surely, with probability 1. Under real-world uncertainty, though? The tails come apart, and the wrapper-minds horrify one another just as they horrify us.
The wrapper-mind may believe it is sending you to heaven, instead. But the tails come apart. The eternal resting place it makes for you will not be one you want -- except, as always, on a set of measure zero.
Except in the rare cases where we make them that way on purpose, like AlphaGo/Zero/etc running inside its MCTS wrapper. But AlphaGo/Zero/etc do pretty damn well without the wrapper, so if anything, this seems like further evidence against the inevitability of wrapper-minds.
The point of the post is that these are strategically different kinds of value, wrapper-mind goal and human values. Complexity in case of humans is not evidence for the distinction, the standard position is that the stuff you are describing is complexity of extrapolated human wrapper-mind goal, not different kind of value, a paperclipper whose goals are much more detailed. From that point of view, the response to your post is "Huh?", it doesn't engage the crux of the disagreement.
Expecting wrapper-minds as an appropriate notion of human value is the result of following selection theorem reasoning. Consistent decision making seems to imply wrapper-minds, and furthermore there is a convergent drive towards their formation as mesa-optimizers under optimization pressure. It is therefore expected that AGIs become wrapper-minds in short order (or at least eventually) even if they are not immediately designed this way. If they are aligned, this is even a good thing, since wrapper-minds are best at achieving goals, including humanity's goals. If aligned wrapper-minds are never built, it's astronomical waste, going about optimization of the future light cone in a monstrously inefficient manner. AI risk probably starts with AGIs that are not wrapper-minds, yet these arguments suggest that the eventual shape of the world is given by wrapper-minds borne of AGIs if they hold control of the future, and their disagreement with human values is going to steamroll the future with things human values won't find agreeable. Unaligned wrapper-minds are going to be a disaster, and unaligned AGIs that are not wrapper-minds are still going to build/become such unaligned wrapper-minds.
The crux of the disagreement is whether unaligned AGIs that are not wrapper-minds inevitably build/become unaligned wrapper-minds. The option where they never build any wrapper-minds is opposed by the astronomical waste argument, the opportunity cost of not making use of the universe in the most efficient way. This is not impossible, merely impossibly sad. The option where unaligned AGIs build/become aligned wrapper-minds requires some kind of miracle that doesn't follow from usual understanding of how extrapolation of value/long reflection works, a new coincidence of AGIs with different values independently converging on the same or mutually agreeable goal for the wrapper-minds they would build if individually starting from a position of control, not having to compromise with others.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
Seems doubtful to me, insofar as we imagine wrapper-minds to be grader-optimizers which globally optimize the output of some utility function over all states/universe-histories/whatever, or EU function over all plans.
There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn't mean I accept relevance of those reasons.
You can take a look at this comment for something about my own position on human values, which doesn't seem relevant to this post or my comments here. Specifically, I agree that human values don't have wrapper-mind character, as should be expressed in people or likely to get expressed in sufficiently human-like AGIs, but I expect that it's a good idea for humans or those AGIs to eventually build wrapper-minds to manage the universe (and this point seems much more relevant to AI risk). I've maintained this distinction for a while.
Overall, I’m extremely happy with this post. (Why didn’t I read it before now?) Wrapper minds—as I understand them—are indeed an enemy, and also (IMO) extremely unlikely to come into existence.
I view this post as saying distinct but complementary things to what I’ve been desperately hammering home since early spring. It’s interesting to see someone else (independently?) reach similar conclusions.
My shard theory take: A parent might introspect and say “I love my child unconditionally” (aliasing their feelings onto culturally available thoughts), but then — upon discovering the _turn-your-brain-into Napoleon machine_ — realize “no, actually, my child-shard does not activate and love them in literally all mental contexts, that’s nonsense.” I wouldn’t say their value changed. I’d say they began to describe that decision-influence (loving their child) more accurately, and realized that the influence only activates in certain contexts (e.g. when they think their kid shares salient features to the kid they came to love, like being a living human child.)
I would state this as: Our values are contextual, we care about different things depending on the context, and rushing to explain this away or toss it out seems like a fatal and silly mistake to make in theoretical reasoning about agent structures.
But, of course, we change our value-shards/decision-influences all the time (e.g. overcoming a phobia), and that seems quite important and good.
But note that their value-shard doesn’t define a total predicate over possible candidate children, they are not labeling literally every case and judging whether it “really counts.” Their value is just activated in certain mental contexts.