When we talk of aiming for the good future for humanity – whether by aligning AGI or any other way – it's implicit that there are some futures that "humanity" as a whole would judge as good. That in some (perhaps very approximate) sense, humanity could be viewed as an agent with preferences, and that our aim is to satisfy said preferences.

But is there a theoretical basis for this? Could there be? How would it look like?

Is there a meaningful frame in which humanity be viewed as optimizing for its purported preferences across history?

Is it possible or coherent to imagine a wrapper-mind set to the task of maximizing for the utopia, whose activity we'd actually endorse?

This post aims to sketch out answers to these questions. In the process, it also outlines how my current models of basic value reflection and extrapolation work.

Informal Explanation

Basic Case

Is an utopia that'd be perfect for everyone possible?

The short and obvious answer is no. Our civilization contains omnicidal maniacs and true sadists, whose central preferences are directly at odds with the preferences of most other people. Their happiness is diametrically opposed to other people's.

Less extremely, it's likely that most individuals' absolutely perfect world would fail to perfectly satisfy most others. As a safe example, we could imagine someone who loves pizza, yet really, really hates seafood, to such an extent that they're offended by the mere knowledge that seafood exists somewhere in the world. Their utopia would not have any seafood anywhere – and that would greatly disappoint seafood-lovers. If we now postulate the existence of a pizza-hating seafood-lover... Well, it would seem that their utopias are directly at odds.[1]

Nevertheless, there are worlds that would make both of them happy enough. A world in which everyone is free to eat food that's tasty according to their preferences, and is never forced to interact with the food they hate. Both people would still dislike the fact that their hated dishes exist somewhere. But as long as food-hating is not their core value that's dominating their entire personality, they'd end up happy enough.

Similarly, it intuitively feels that worlds which are strictly better according to most people's entire arrays of preferences are possible. Empowerment is one way to gesture at it – a world in which each individual is simply given more instrumental resources, a greater ability to satisfy whatever preferences they happen to have. (With some limitations on impacting other people, etc.)

But is it possible to arrive at this idea from first principles? By looking at humanity and somehow "eliciting"/"agglomerating" its preferences formally? A process like CEV? A target to hit that's "objectively correct" according to humanity's own subjective values, rather than your subjective interpretation of its values?

Paraphrasing, we're looking for an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans' preferences; an utility function that's correlated with the "agglomeration" of most humans' preferences.

Let's consider what we did in the foods example. We discovered two disparate preferences, and then we abstracted up from them – from concrete ideas like "seafood" and "pizza", to an abstraction over them: food-in-general. And we've discover that, although the individuals' preferences disagreed on the concrete level, they ended up basically the same at the higher level. Trivializing, it turned out that a seafood-optimizer and a pizza-optimizer could both be viewed as tasty-food-optimizers.

The hypothesis, then, would go as follows: at some very high abstraction level, the level of global matters and fundamental philosophy, most humans' preferences converge to the same utility function over some variable. For example, "maximize eudaimonia" or "human empowerment" or "human flourishing".

There's a counting argument that slightly supports this. Higher abstraction levels are less expressive: they include fewer objects/variables (fewer countries than people, fewer stars than atoms, fewer galaxies than stars) and these objects have fewer states (fewer moods than the ways your brain's atoms could be arrranged). So the mapping-up of values to them isn't injective. Thus, some conflicting low-level preferences would map to the same preference over the same high-level variable.

That is, of course, a hypothesis. Nevertheless, the mere fact that we can coherently state it is reassuring regarding our ability to eventually test it.


Is Humanity an Utopia-Maximizer?

Maybe. I don't strongly believe in this, but here's a sketch:

If human values indeed converge like this, then perhaps humanity can be viewed as an approximate agent that's been approximately optimizing for building an utopia for its entire history. But those "approximately" do a lot of work; there's plenty of noise involved.

Primary issue is that the distribution of power between its constituents is non-uniform and changes dynamically. At different times, people with different preferences amass disproportionate amounts of resources (often by orders of magnitudes so), and "deviate" humanity's path away from the hypothetical averaged-out course, in their individual preferred directions.

But the balance of power frequently changes, and how technologies change it is relatively unpredictable. So potentially these effects actually cancel out on average, and humanity stays roughly on-target? (Pizza-lovers being in charge for 100 years are replaced by seafood-lovers ruling for 100 years; and while they cancel out their specific preferences, both end up advancing humanity towards having tasty food around.)

It would explain why our world, like, actually does mostly get better over time. As well as provide some grounding to the ideas of "moral progress".

Nevertheless, the approximations there may be extremely noisy, to the point that looking at things this way may not be useful.


Is an Utopia-Maximizer Desirable?

Assuming this hypothetical utopian utility function exists and we derive it, would it be possible to then plug it into some idealized agent/wrapper-mind, and not be horrified at the results?

On my view, the answer is obviously yes. There's a bunch of confusions around this idea that I'd like to address; mainly around what "a fixed goal" implies.

Consider a paperclip-maximizer. It wants the universe to be full of paperclips. If it gets its way, it'd reassemble all matter, including itself, into them.

Note, however, that it would not necessarily aim to freeze them in time. Intuitively, it would be fine with the paperclips still orbiting each other, impacting each other, and so on. Moreover, by the very definition of "a paperclip", there'd be all sorts of subatomic processes happening within them. The paperclip-maximizer would want those to run their natural course. Its utility would stay constant as that happens; invariant under these transformations of the world-state.

Similarly, the maximum of an utopia-maximizer would be defined over an enormous equivalence class of world-states. It would not aim to freeze humanity in time, or impose some specific unchanging social order, or tile the universe with copies of specific people that it deemed most optimal for experiencing happiness, etc.

Its utility would be invariant under individual humans changing over time, under them forging new relationships, under societal structures changing and events generally moving forward. As long as those processes don't wander into some nightmarish outcomes. That's the main function it'd provide: a sort of "safety net", lower-bounding how bad things could get. (And currently, they are very, very bad.)

Indeed, being a wrapper-mind doesn't even disqualify you from being a person (as nostalgebraist's post claims). Your utility can be invariant and maximal under many possible internal states. You can grow and change as a person, even if you have a fixed hard-wired goal that you ultimately serve.

Similarly, it's not unreasonable to suggest that most humans are (effectively isomorphic to) wrapper-minds.

Formal Model

Suppose that on your hands, you have an agent with a vast array of disparate preferences. It's a mess. They're stored in different formats (explicit vs. implicit, deontological vs. consequentialist, instrumental vs. terminal...), defined on different abstraction levels, often conflict with each other.

You want to optimize them, straighten them out. Resolve whatever conflicts they have, translate them to whatever domains you're working in, extrapolate them (to plan for the long-term), concretize them (to figure out what specific actions a philosophy demands of you), agglomerate them...

Why? Performance optimization. Sure, you could just do babble-and-prune search on your world-model, figuring out what would satisfy those preferences by brute force. But that'd be ruinously compute-intensive. You'd like to cache some of them, derive some heuristics from them, resolve conflicts to stop wasting time on those, etc.

How can you sort it out? What target are you even aiming at?

Well, the purpose of utility functions/preferences is to recommend what actions to take. Indeed, that's their main contribution: they define a preference ordering over candidate plans/actions, either directly (deontology), or by way of looking at what worlds a given action would bring about (consequentialism).

Thus, the correct process of value-system performance-optimization would be made up of transformations such that the preference ordering over actions is invariant under them. I. e., the value-optimized agent would always take the same actions in any given situation as the initial agent (if the latter were given sufficient time to think).

Let's see where that can get us.


Deontological Preferences

To start off, deontological preferences are isomorphic to utility functions, and utility functions are isomorphic to deontological preferences. They're related by the softmax function:

Take a given deontological rule, like "killing is bad". Let's say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that "predicts" that you're very likely/unlikely to take specific actions. The above transform would let us translate it into an utility function over actions.

The other way around, an utility function can be viewed as defining some "target distribution" for the variable over which it's defined. Maximizing expected utility would then be equivalent to minimizing the cross-entropy between that target distribution and the real distribution.

And that's not simply an overly abstract trick: it's how human minds are actually hypothesized to work. See Friston's predictive-processing framework in neuroscience (you can start from these comments).

This also covers shards. They're self-executing heuristics bidding for specific actions over others. Thus, each could be transformed into an utility function without loss of information.

That's not at odds with how deontology is usually presented, either. Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their conscious intelligence. But similar dynamics can still be at play "under the hood".


Value Conflict Resolution

Imagine an agent having two utility functions,  and . It's optimizing for their sum, . If the values are in conflict, if taking an action that maximizes  hurts  and vice versa — well, one of them almost surely spits out a higher value, so the maximization of  is still well-defined.

That's roughly how humans do work in practice. If we face a value conflict, we hesitate a bit (calculating the sum, the "winner"), but ultimately end up taking some action that we endorse.

... unless we hesitate too long, and time chooses for us. Or if we know we have to take action fast, and so decide to use some very rough approximations – and potentially make a mistake which we later regret it.

Thus, there's purely practical value in reducing the number of internal conflicts. Finding a value  such that, for all situations, it has the same preference ordering as , but its computational complexity is much lower.


Value Extrapolation

Value extrapolation seems to be straightforward: it's just the reflection of the fact that the world can be viewed as a series of hierarchical ever-more-abstract models.

  1. Suppose we have a low-level model of reality , with  variables (atoms, objects, whatever).
  2. Suppose we "abstract up", deriving a more simple model of the world , with  variables. Each variable  in it is an abstraction over some set of lower-level variables , such that .
    • Recap: Higher-level variables are, by definition, less expressive, i. e. the number of states they could be in is lower than the number of states the underlying system can be in. By the counting argument, that means their states are defined over (very large in practice) equivalence classes of low-level states.
    • Example: "I'm happy" is a high-level state that correspond to a combinatorially large number of configurations my body's atoms can be in. Stipulating "I'm happy" only constrains my low-level state up to that equivalence class.
  3. We iterate, to , ..., . We derive increasingly more abstract models of the world.
    • Note: . Since each subsequent level is simpler, it contains fewer variables. People to social groups to countries to the civilization; atoms to molecules to macro-scale objects to astronomical objects; etc.
  4. Let's define the function . I. e.: it returns a probability distribution over the low-level variables given the state of a high-level variable that abstracts over them.
    • Note: As per (2), that only constrains the low-level system to a (very large) equivalence class of states. (Though the distribution needn't be uniform.)
    • Example: If the world economy is in this state, how happy my grandmother is likely to be?
  5. If we view our values as an utility function , we can "translate" our utility function from any  to  roughly as follows: 
    • (There's a ton of complications there, but this expression conveys the core idea.)

... and then value extrapolation just naturally falls out of this.

Suppose we have a bunch of values at the th abstraction level. Once we start frequently reasoning at th level, we "translate" our values to it, and cache the resultant functions. Since the th level likely has fewer variables than th, the mapping-up is not injective: some values defined over different low-level variables end up translated to the same higher-level variable ("I like pizza and seafood" -> "I like tasty food", "I like Bob and Alice" -> "I like people"). This effect only strengthens as we go up higher and higher. At , we can plausibly end up with only one variable we value (as previously speculated, "eudaimonia" or something).


Putting It Together

Suppose we have a human on our hands, and we want to compile all of their values into a highly abstract utility function that the human would endorse. To do so, we:

  • Transform all values into the same format. (Either utility functions or probability distributions; doesn't really matter.)
  • Translate them around to reveal value conflicts.
  • Resolve those conflicts by finding equivalent-but-simpler utility functions.
  • Extrapolate them upwards, to the highest abstraction level.
  • We end up with[2] a distillation/compilation of that human's entire selfhood, in the format isomorphic to an utility function. The endpoint of their moral philosophy.

... if only it were this easy.


Major Problem: Meta-Preferences

Humans have preferences not only about object-level stuff, but also about the way they do the whole value-compilation process. The above model assumed an idealized process, in the sense of deriving an utility function that would always recommend the same actions as the initial array of values, but have dramatically lower computational complexity.

However, humans have meta-values that can express arbitrarily custom preferences regarding the process of value reflection itself. We might have preferences over...

  • ... basic translations. E. g., a deontologist's refusal to take money into account when choosing whose life to save. (Refusing to translate and account for that preference.)
  • ... how we extrapolate things up the abstraction levels. E. g., "I'm not going to let my petty preferences impact the future of humanity", such that you ignore your preference for pizza when defining the AGI's utility function (rather than biasing it towards it).
  • ... how we resolve value conflicts. E. g., if we have  = "I want to be a good person" and  = "I'd get a thrill out of stealing something", we often wouldn't just tweak  such that it still fires, but only when stealing something wouldn't be against the society's interests. No: we just flat-out delete .
  • Etc.

These complications currently have me worried that there's basically no way to elicit and compile a given human's preferences except directly simulating their mind. No shortcuts whatsoever. (And then that simulation would be path-dependent, such that, depending on what stimuli you show the human in what order, they might end up at vastly-different-yet-equally-legitimate endpoints. But that's a whole separate topic.)

Regardless, this doesn't kill the core idea. I'm reasonably sure (something like) the procedures I've defined are still what humans use most of the time. But there are more complex cases where meta-preferences are involved, they're often crucial, and I'm not sure there are elegant ways to handle them.


Egalitarian Agglomeration

Now onto the last step: how do we agglomerate values between different people? That is, suppose we've "compiled" the preferences of all individual people into a set of utility functions, and then picked just their most-abstract components, getting this set: . How do we transform that into ?

Well, ideally, it'll turn out that . That's the "strong" version of the "human value convergence hypothesis".

What if not, though?

The naive idea would be to just proceed as we had before, and find a simpler function that recommends the same actions as the individual functions' sum. But that has some undesirable properties, like a sensitivity to "utility monsters". The Geometric Rationality sequence has made that point rather well.

Thus, a better target would be a function that's equivalent to the product of individual humans' utility functions. It effectively maximizes the expected utility of a randomly-chosen human; thus, it aims to uniformly distribute utility across everyone. (I really recommend reading the Geometric Rationality sequence.)

And that result is, theoretically,

  • An utility function that humanity-as-a-whole could be said to have been (very roughly) maximizing throughout its history.
  • An utility function that something like CEV might spit out.
  • An utility function whose maximization would rank high by most individual humans' preferences/utility functions.
  • An utility function we could hook up to a wrapper-mind, and then be happy with the result.
  1. ^

    I'm sure you can come up with less tame examples from, say, politics or social issues. Fill them in as needed.

  2. ^

    Well, that was a simplified description of the process. In practice, you'd need to mix these steps up repeatedly.

New Comment
9 comments, sorted by Click to highlight new comments since:

Sure, every time you go more abstract there are fewer degrees of freedom. But there's no free lunch - there are degrees of freedom in how the more-abstract variables are connected to less-abstract ones.

People who want different things might make different abstractions. E.g. if you're calling some high level abstraction "eat good food," it's not that this is mathematically the same abstraction made by someone who thinks good food is pizza and someone else who thinks good food is fish. Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

Yes, at high levels of abstraction, humans can all recommend the same abstract action. But I don't care about abstract actions, I care about real-world actions.

E.g. suppose we abstract the world to an ontology where there are two states, "good" and "bad," and two actions - stay or swap. Lo and behold, ~everyone who abstracts the world to this ontology will converge to the same policy in terms of abstract actions: make the world good rather than bad. But if two people disagree utterly about which low-level states get mapped onto the "good" state, they'll disagree utterly about which low-level actions get mapped onto the "swap from bad to good" action, and this abstraction hasn't really bought us anything.

People who want different things might make different abstractions

That's a direct rejection of the natural abstractions hypothesis. And some form of it increasingly seems just common-sensically true.

It's indeed the case that one's choice of what system to model is dependent on what they care about/where their values are housed (whether I care to model the publishing industry, say). But once the choice to model a given system is made, the abstractions are in the territory. They fall out of noticing to which simpler systems a given system can be reduced.

(Imagine you have a low-level description of a system defined in terms of individual gravitationally- and electromagnetically-interacting particles. Unbeknownst to you, the system describes two astronomical objects orbiting each other. Given some abstracting-up algorithm, we can notice that this system reduces to these two bodies orbiting each other (under some definition of approximation).

It's not value-laden at all: it's simply a true mathematical fact about the system's dynamics.

The NAH is that this generalizes, very widely.)

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

suppose we abstract the world to an ontology where there are two states, "good" and "bad," 

If your model is assumed, i. e. that abstractions are inherently value-laden, then yes, this is possible. But that's not how it'd work under the NAH and on my model, because "good" and "bad" are not objective high-level states a given system could be in.

It'd be something like State A and State B. And then the "human values converge" hypothesis is that all human values would converge to preferring one of these states.

Not even if those people independently keep going higher in the abstraction hierarchy - they'll never converge to the same object, because there's always that inequivalence in how they're translated back to the low level description.

I mean, that's clearly not how it works in practice? Take the example in the post literally: two people disagree on food preferences, but can agree on the "food" abstraction and on both of them having a preference for subjectively tasty ones.

I agree with the part of what you just said that's the NAH, but disagree with your interpretation.

Both people can recognize that there's a good abstraction here, where what they care about is subjectively tasty food. But this interpersonal abstraction is no longer an abstraction of their values, it simply happens to be about their values, sometimes. It can no longer be cashed out into specific recommendations of real-world actions in the way someone's values can[1].

  1. ^

    For certain meanings of "values," ofc.

Okay, let's build a toy model.

  1. We have some system with a low-level state , which can take on one of six values: .
  2. We can abstract over this system's state and get a high-level state , which can take on one of two states: 
  3. We have an objective abstracting-up function .
  4. We have the following mappings between states:
  5. We have an utility function , with a preference ordering of , and an utility function , with a preference ordering of .
  6. We translate both utility functions to , and get the same utility function:  whose preference ordering is .

Thus, both  and  can agree on which high-level state they would greatly prefer. No low-level state would maximally satisfy both of them, but they both would be happy enough with any low-level state that gets mapped to the high-level state of . ( is the obvious compromise.)

Which part of this do you disagree with?

I disagree that translating to x and y let you "reduce the degrees of freedom" or otherwise get any sort of discount lunch. At the end you still had to talk about the low level states again to say they should compromise on b (or not compromise and fight it out over c vs. a, that's always an option).

At the end you still had to talk about the low level states again to say they should compromise on b 

"Compromising on " is a more detailed implementation that can easily be omitted. The load-bearing part is "both would be happy enough with any low-level state that gets mapped to the high-level state of ".

For example, the policy of randomly sampling any  such that  is something both utility functions can agree on, and doesn't require doing any additional comparisons of low-level preferences, once the high-level state has been agreed upon. Rising tide lifts all boats, etc.

Suppose the two agents are me and a flatworm.
a = ideal world according to me
b = status quo
c = ideal world according to the flatworm
d, e, f = various deliberately-bad-to-both worlds

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo, and I wouldn't be "happy enough" if we ended up in flatworm utopia.

What bargains I would agree to, and how I would feel about them, are not safe to abstract away.

I wouldn't be "happy enough" if we ended up in flatworm utopia

You would, presumably, be quite happy compared to "various deliberately-bad-to-both worlds".

I'm not going to stop trying to improve the world just because the flatworm prefers the status quo

Because you don't care about the flatworm and the flatworm is not perceived by you as having much bargaining power for you to bend to its preferences.

In addition, your model rules out more fine-grained ideas like "the cubic mile of terrain around the flatworm remains unchanged while I get the rest of the universe". Which is plausibly what CEV would result in: everyone gets their own safe garden, with the only concession the knowledge that everyone else's safe gardens also exist.

It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.

Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.