I have two strong intuitions about human values that'd seemed utterly irreconcilable to me, up until recently.
I believe I see a way to unify the two. The crucial insights have been supplied by the Shard Theory, and the final speculative picture is broadly supported by it. This result mostly dissolved all of my high-level confusions about human values and goal-directed behavior, in addition to satisfying a lot of other desiderata.
Disclaimer: This summary does not represent the views of Team Shard, but only my subjective understanding. For the official summary, see this.
According to the Shard Theory, in the course of brain development, humans jointly learn two things:
The latter are "shards" . In their most primitive form, they're just activation patterns. You see a lollipop enter your field of vision, you grab it. You see a flashlight pointed at your face, you close your eyes.
As the world-models grows more advanced, the shards could grow more sophisticated as well. Instead of only attaching to observations, they can attach to things in the world-model. If you're modelling the world as containing a lollipop the next room over, your lollipop-shard will bid for a plan to go grab it. If your far-future model says that becoming a salaried professional will give you enough income to buy a lot of lollipops, your lollipop-shard will bid for it.
A lot of other values and habits are implemented the same way. The desire to do nice things for people you like, the avoidance of life-threatening situations, the considerations that go into the choice of career — all of those are just shard-implemented reaction patterns, which react to things in your world-model and bid for particular responses to them. If you expect someone you like to be unhappy, a shard activates, bidding for an action-sequence that changes that prediction. If you expect to be in a life-threatening situation, a whole bunch of shards rebel against that vision. If you're considering career choices, you're choosing between different models of the future, and whichever wins the "popularity contest" among the shards is what ends up implemented.
Shards can conflict. Some values are mutually contradictory; the preference for lollipops might conflict with preferences for health and being attractive and avoiding dentists, so plans a lollipop-shards bids for may be overruled by other shards. If the lollipop-shard is suppressed too many times, it'll atrophy and die out.
Shards have a self-preservation instinct. Some indirect — they see that certain changes to personality will decrease the amount of things they value in the future, and will bid against such value-drift plans (you don't want to self-modify to hate your loved ones, because that will make you do things that will make them unhappy). Some direct — these shards can identify themselves in the world-model, and directly bid against plans that eliminate them. (You might inherently like some aspects of your personality, and protest against changes to them — not because of outside-world outcomes, but because that's who you like to be. Conversely, imagine a non-reflectively-stable shard, like a crippling fear of spiders or drug addiction. You don't value valuing this, so you can implement plans that eliminate the corresponding shards via e. g. therapeutic interventions.)
All together, a mind like this would resemble humans pretty well. In particular, it crisply defines what "human flourishing" is. It's the state of the world which minimizes constituent shards' resistance to it; a picture of the world that the maximum number of shards approve. And in addition to satisfying our values on the object-level, it'll also need to satisfy shards' preferences for self-perpetuation.
Hence our preference for diverse, dynamic futures in which we remain ourselves.
Sidebar: Note an important thing here: most of the complexity in a mind like this comes from the world-model. Shards can be very simple if-then functions, but the mere fact that they're implemented over a very sophisticated cross-temporal world model can give rise to some very complex behaviors. This, in part, is why I think the Shard Theory is compelling — it fits very well with various stories of incremental development of goals.
But. That's clearly not a complete story of how humans work, is it?
The shard economy as presented in Part 1 is too rigid. According to it, a human's policy is a relatively shallow function of that human's constituent shards, and significant changes to it imply correspondingly significant changes in the shard economy. And such changes would be rare: ancient, deeply-established shards would have a lot of sway, and their turnover would be low.
But that's not what we often observe. Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.
Suppose a human has a bunch of deeply ingrained values, like a) "donate to the local community" or b) "eat pork" or c) "don't kill".
None of these are knock-out rebuttals. Indeed, even in the last two examples, the new action patterns are not implacable. A sufficiently strong trigger/shard — like a deep trauma, or a very strong value like the love for a child — can break past the life-preservation act in (4) or the ideological takeover in (5).
But this doesn't fully gel with the basic shard-centred picture either. It implies circumstances in which a human's behavior is mainly explained and controlled by some isolated deliberative process, not their entire set of ingrained values. Some part of the human logically reasons out a new policy and then implements it; not as the result of stochastic shard negotiation, but in circumvention of it.
Another issue is the sheer generalizability of human behavior this implies. I can imagine responding to any event my world-model can model in any way I can model. I don't need a special shard for every case — my collection of shards is already somehow fully generalizable. And if I were trapped in a dystopia, I'd be able to spoof the existence of whatever shards my captors want me to have, regardless of my actual shard makeup.
So what's up with that?
We clearly need to introduce some mechanism of planning/search. The exact implementation is a source of some disagreement:
Regardless of the specifics, however what we get is: an advanced, consequentialist plan-making mechanism whose goal is to come up with plans that satisfy the weighted sum of the preferences of the shard economy.
This, I argue, is what we are: that planner mechanism, a fairly explicit mesa-optimizer algorithm running on our brains. And our terminal value is to satisfy our shards' preferences.
Which is... a pretty difficult proposition, actually. Because many of these shards do not actually codify preferences, and certainly not universal ones. Some of our goals might be defined over specific environments/segments of the world-model, in ways that are difficult to translate/generalize to other environments. Some others might not be "goals" at all, just if-then activation patterns. To do our job, we essentially have to compile our own values, routing around various type errors.
To illustrate what I mean, a few examples:
Hence all of our problems with value reflection: there are often multiple "valid" ways to bootstrap any specific shard to value status.
Hence the various pitfalls we could fall into. These processes of interpretation or generalization are conducted by a deliberative and logical process. And that process can be mistaken, can be fooled by logical or logical-sounding arguments. Hence our prosperity to adopt flawed-but-neat ideologies, or become mistaken about what we really want.
Hence our ability to self-modify. The planner can become convinced (either rightly or not) that certain shards need to be created or destroyed for the good of the whole shard economy, then implement plans that do so (build/destroy good/bad habits, remove values that contradict others). At the same time, we also have preferences for retaining our ability to self-modify — both because we're not sure our current model of our desires is accurate, and maybe because we have a shard-implemented preference for mutability.
Thus: We are approximations of idealized utility-maximizers over an inchoate mess of a thousand shards of desire.
Of note: Consider the reversal happening here. Shards began as heuristics optimized by the credit-assignment mechanism to collect a lot of reward. Up to a point, the human's cognitive capabilities were implemented as shards; shards were the optimization process. At that stage, the human wasn't a proper optimizer. In particular, they weren't retargetable.
Over time, however, some components of that system — be that an external planner algorithm or a coalition of planner-shards — developed universal problem-solving capacity. That made the whole shard economy obsolete. But because of the developmental path the human mind took to get there, that mechanism didn't end up optimizing reward. Instead, it was developed to assist shards, and so it re-interpreted shards as its mesa-objectives, in all their messiness.
And it seems very plausible that AIs would follow a similar developmental path.
This framework, in conjunction with my previous toy model, essentially dissolves my main confusions about goal-directedness, human values, and development thereof.
The question to tackle, now, seems to be goal translation/value compilation. How do we adapt the values/goals defined over one environment for another? How do we bootstrap things that do not have the type "value" to the status of a value? What algorithms, in general, exist for doing this? How many possible "solutions" (final value distributions) such procedures tend to have, and how can the space of solutions be constrained?
In a way, this is just a reformulation of the ontology-shift problem, but this framing seems to make it easier to reason about. And easier to investigate.
Thanks to TurnTrout, Charles Foster, and Quintin Pope for productive discussions and critique.
Or, if we've experienced an ontology break so serious as to invalidate all of our constituent shards, as long as there's the potential for new shards to be formed, which will be adapted to the new world-model.