I have been thinking about Stuart Armstrong's preference synthesis research agenda, and have long had the feeling that there's something off about the way that it is currently framed. In the post I try to describe why. I start by describing my current model of human values, how I interpret Stuart's implicit assumptions to conflict with it, and then talk about my confusion with regard to reconciling the two views.

The two-layer/ULM model of human values

In Player vs. Character: A Two-Level Model of Ethics, Sarah Constantin describes a model where the mind is divided, in game terms, into a "player" and a "character". The character is everything that we consciously experience, but our conscious experiences are not our true reasons for acting. As Sarah puts it:

In many games, such as Magic: The Gathering, Hearthstone, or Dungeons and Dragons, there’s a two-phase process. First, the player constructs a deck or character from a very large sample space of possibilities.  This is a particular combination of strengths and weaknesses and capabilities for action, which the player thinks can be successful against other decks/characters or at winning in the game universe.  The choice of deck or character often determines the strategies that deck or character can use in the second phase, which is actual gameplay.  In gameplay, the character (or deck) can only use the affordances that it’s been previously set up with.  This means that there are two separate places where a player needs to get things right: first, in designing a strong character/deck, and second, in executing the optimal strategies for that character/deck during gameplay. [...]
The idea is that human behavior works very much like a two-level game. [...] The player determines what we find rewarding or unrewarding.  The player determines what we notice and what we overlook; things come to our attention if it suits the player’s strategy, and not otherwise.  The player gives us emotions when it’s strategic to do so.  The player sets up our subconscious evaluations of what is good for us and bad for us, which we experience as “liking” or “disliking.”
The character is what executing the player’s strategies feels like from the inside.  If the player has decided that a task is unimportant, the character will experience “forgetting” to do it.  If the player has decided that alliance with someone will be in our interests, the character will experience “liking” that person.  Sometimes the player will notice and seize opportunities in a very strategic way that feels to the character like “being lucky” or “being in the right place at the right time.”
This is where confusion often sets in. People will often protest “but I did care about that thing, I just forgot” or “but I’m not that Machiavellian, I’m just doing what comes naturally.”  This is true, because when we talk about ourselves and our experiences, we’re speaking “in character”, as our character.  The strategy is not going on at a conscious level. In fact, I don’t believe we (characters) have direct access to the player; we can only infer what it’s doing, based on what patterns of behavior (or thought or emotion or perception) we observe in ourselves and others.

I think that this model is basically correct, and that our emotional responses, preferences, etc. are all the result of a deeper-level optimization process. This optimization process, then, is something like that described in The Brain as a Universal Learning Machine:

The universal learning hypothesis proposes that all significant mental algorithms are learned; nothing is innate except for the learning and reward machinery itself (which is somewhat complicated, involving a number of systems and mechanisms), the initial rough architecture (equivalent to a prior over mindspace), and a small library of simple innate circuits (analogous to the operating system layer in a computer).  In this view the mind (software) is distinct from the brain (hardware).  The mind is a complex software system built out of a general learning mechanism. [...]
An initial untrained seed ULM can be defined by 1.) a prior over the space of models (or equivalently, programs), 2.) an initial utility function, and 3.) the universal learning machinery/algorithm.  The machine is a real-time system that processes an input sensory/observation stream and produces an output motor/action stream to control the external world using a learned internal program that is the result of continuous self-optimization. [...]
The key defining characteristic of a ULM is that it uses its universal learning algorithm for continuous recursive self-improvement with regards to the utility function (reward system).  We can view this as second (and higher) order optimization: the ULM optimizes the external world (first order), and also optimizes its own internal optimization process (second order), and so on.  Without loss of generality, any system capable of computing a large number of decision variables can also compute internal self-modification decisions.
Conceptually the learning machinery computes a probability distribution over program-space that is proportional to the expected utility distribution.  At each timestep it receives a new sensory observation and expends some amount of computational energy to infer an updated (approximate) posterior distribution over its internal program-space: an approximate 'Bayesian' self-improvement.

Rephrasing these posts in terms of each other, in a person's brain "the player" is the underlying learning machinery, which is searching the space of programs (brains) in order to find a suitable configuration; the "character" is whatever set of emotional responses, aesthetics, identities, and so forth the learning program has currently hit upon.

Many of the things about the character that seem fixed, can in fact be modified by the learning machinery. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable. Unlocking the Emotional Brain describes a number of such updates, such as - in these terms - the ULM eliminating subprograms blocking confidence after receiving an update saying that the consequences of expressing confidence will not be as bad as previously predicted.

Another example of this kind of a thing was the framework that I sketched in Building up to an Internal Family Systems model: if a system has certain kinds of bad experiences, it makes sense for it to spawn subsystems dedicated to ensuring that those experiences do not repeat. Moral psychology's social intuitionist model claims that people often have an existing conviction that certain actions or outcomes are bad, and that they then level seemingly rational arguments for the sake of preventing those outcomes. Even if you rebut the arguments, the conviction remains. This kind of a model is compatible with an IFS/ULM style model, where the learning machinery sets the goal of preventing particular outcomes, and then applies the "reasoning module" for that purpose.

Qiaochu Yuan notes that once you see people being upset at their coworker for criticizing them and you do therapy approaches with them, and this gets to the point where they are crying about how their father never told them that they were proud of them... then it gets really hard to take people's reactions to things at face value. Many of our consciously experienced motivations, actually have nothing to do with our real motivations. (See also: Nobody does the thing that they are supposedly doing, The Elephant in the Brain, The Intelligent Social Web.)

Preference synthesis as a character-level model

While I like a lot of the work that Stuart Armstrong has done on synthesizing human preferences, I have a serious concern about it which is best described as: everything in it is based on the character level, rather than the player/ULM level.

For example, in "Our values are underdefined, changeable, and manipulable", Stuart - in my view, correctly - argues for the claim stated in the title... except that, it is not clear to me to what extent the things we intuitively consider our "values", are actually our values. Stuart opens with this example:

When asked whether "communist" journalists could report freely from the USA, only 36% of 1950 Americans agreed. A follow up question about Amerian journalists reporting freely from the USSR got 66% agreement. When the order of the questions was reversed, 90% were in favour of American journalists - and an astounding 73% in favour of the communist ones.

From this, Stuart suggests that people's values on these questions should be thought of as underdetermined. I think that this has a grain of truth to it, but that calling these opinions "values" in the first place is misleading.

My preferred framing would rather be that people's values - in the sense of some deeper set of rewards which the underlying machinery is optimizing for - are in fact underdetermined, but that is not what's going on in this particular example. The order of the questions does not change those values, which remain stable under this kind of a consideration. Rather, consciously-held political opinions are strategies for carrying out the underlying values. Receiving the questions in a different order caused the system to consider different kinds of information when it was choosing its initial strategy, causing different strategic choices.

Stuart's research agenda does talk about incorporating meta-preferences, but as far as I can tell, all the meta-preferences are about the character level too. Stuart mentions "I want to be more generous" and "I want to have consistent preferences" as examples of meta-preferences; in actuality, these meta-preferences might exist because of something like "the learning system has identified generosity as a socially admirable strategy and predicts that to lead to better social outcomes" and "the learning system has formulated consistency as a generally valuable heuristic and one which affirms the 'logical thinker' identity, which in turn is being optimized because of its predicted social outcomes".

My confusion about a better theory of values

If a "purely character-level" model of human values is wrong, how do we incorporate the player level?

I'm not sure and am mostly confused about it, so I will just babble & boggle at my confusion for a while, in the hopes that it would help.

The optimistic take would be that there exists some set of universal human values which the learning machinery is optimizing for. There exist various therapy frameworks which claim to have found something like this.

For example, the NEDERA model claims that there exist nine negative core feelings whose avoidance humans are optimizing for: people may feel Alone, Bad, Helpless, Hopeless, Inadequate, Insignificant, Lost/Disoriented, Lost/Empty, and Worthless. And pjeby mentions that in his empirical work, he has found three clusters of underlying fears which seem similar to these nine:

For example, working with people on self-image problems, I've found that there appear to be only three critical "flavors" of self-judgment that create life-long low self-esteem in some area, and associated compulsive or avoidant behaviors:
Belief that one is bad, defective, or malicious (i.e. lacking in care/altruism for friends or family)
Belief that one is foolish, incapable, incompetent, unworthy, etc. (i.e. lacking in ability to learn/improve/perform)
Belief that one is selfish, irresponsible, careless, etc. (i.e. not respecting what the family or community values or believes important)
(Notice that these are things that, if you were bad enough at them in the ancestral environment, or if people only thought you were, you would lose reproductive opportunities and/or your life due to ostracism. So it's reasonable to assume that we have wiring biased to treat these as high-priority long-term drivers of compensatory signaling behavior.)
Anyway, when somebody gets taught that some behavior (e.g. showing off, not working hard, forgetting things) equates to one of these morality-like judgments as a persistent quality of themselves, they often develop a compulsive need to prove otherwise, which makes them choose their goals, not based on the goal's actual utility to themself or others, but rather based on the goal's perceived value as a means of virtue-signalling. (Which then leads to a pattern of continually trying to achieve similar goals and either failing, or feeling as though the goal was unsatisfactory despite succeeding at it.)

So - assuming for the sake of argument that these findings are correct - one might think something like "okay, here are the things the brain is trying to avoid, we can take those as the basic human values".

But not so fast. After all, emotions are all computed in the brain, so "avoidance of these emotions" can't be the only goal any more than "optimizing happiness" can. It would only lead to wireheading.

Furthermore, it seems like one of the things that the underlying machinery also learns, is situations in which it should trigger these feelings. E.g. feelings of irresponsibility can be used as an internal carrot and stick scheme, in which the system comes to predict that if it will feel persistently bad, this will cause parts of it to pursue specific goals in an attempt to make those negative feelings go away.

Also, we are not only trying to avoid negative feelings. Empirically, it doesn't look like happy people end up doing less than unhappy people, and guilt-free people may in fact do more than guilt-driven people. The relationship is nowhere linear, but it seems like there are plenty of happy, energetic people who are happy in part because they are doing all kinds of fulfilling things.

So maybe we could look at the inverse of negative feelings: positive feelings. The current mainstream model of human motivation and basic needs is self-determination theory, which explicitly holds that there exist three separate basic needs:

Autonomy: people have a need to feel that they are the masters of their own destiny and that they have at least some control over their lives; most importantly, people have a need to feel that they are in control of their own behavior.
Competence: another need concerns our achievements, knowledge, and skills; people have a need to build their competence and develop mastery over tasks that are important to them.
Relatedness (also called Connection): people need to have a sense of belonging and connectedness with others; each of us needs other people to some degree

So one model could be that the basic learning machinery is, first, optimizing for avoiding bad feelings; and then, optimizing for things that have been associated with good feelings (even when doing those things is locally unrewarding, e.g. taking care of your children even when it's unpleasant). But this too risks running into the wireheading issue.

A problem here is that while it might make intuitive sense to say "okay, if the character's values aren't the real values, let's use the player's values instead", the split isn't actually anywhere that clean. In a sense the player's values are the real ones - but there's also a sense in which the player doesn't have anything that we could call values. It's just a learning system which observes a stream of rewards and optimizes it according to some set of mechanisms, and even the reward and optimization mechanisms themselves may end up getting at least partially rewritten. The underlying machinery has no idea about things like "existential risk" or "avoiding wireheading" or necessarily even "personal survival" - thinking about those is a character-level strategy, even if it is chosen by the player using criteria that it does not actually understand.

For a moment it felt like looking at the player level would help with the underdefinability and mutability of values, but the player's values seem like they could be even less defined and even more mutable. It's not clear to me that we can call them values in the first place, either - any more than it makes meaningful sense to say that a neuron in the brain "values" firing and releasing neurotransmitters. The player is just a set of code, or going one abstraction level down, just a bunch of cells.

To the extent that there exists something that intuitively resembles what we call "human values", it feels like it exists in some hybrid level which incorporates parts of the player and parts of the character. That is, assuming that the two can even be very clearly distinguished from each other in the first place.

Or something. I'm confused.

New Comment
3 comments, sorted by Click to highlight new comments since:

I definitely agree that the player vs character distinction is meaningful, although I would define it a bit differently.

I would identify it with cortical vs subcortical, a.k.a. neocortex vs everything else. (...with the usual footnotes, e.g. the hippocampus counts as "cortical" :-D)

(ETA: See my later post Inner alignment on the brain for a better discussion of some of the below.)

The cortical system basically solves the following problem:

Here is (1) a bunch of sensory & other input data, in the form of spatiotemporal patterns of spikes on input neurons, (2) occasional labels about what's going on right now (e.g. "something good / bad / important is happening"), (3) a bunch of outgoing neurons. Your task is to build a predictive model of the inputs, and use that to choose signals to send into the outgoing neurons, to make more good things happen.

The result is our understanding of the world, our consciousness, imagination, memory, etc. Anything we do that requires understanding the world is done by the cortical system. This is your "character".

The subcortical system is responsible for everything else your brain does to survive, one of which is providing the "labels" mentioned above (that something good / bad / important / whatever is happening right now).

For example, take the fear-of-spiders instinct. If there is a black scuttling blob in your visual field, there's a subcortical vision system (in the superior colliculus) that pattern-matches that moving blob to a genetically-coded template, and thus activates a "Scary!!" flag. The cortical system sees the flag, sees the spider, and thus learns that spiders are scary, and it can plan intelligent actions to avoid spiders in the future.

I have a lot of thoughts on how to describe these two systems at a computational level, including what the neocortex is doing, and especially how the cortical and subcortical systems exchange information. I am hoping to write lots more posts with more details about the latter, especially about emotions.

even the reward and optimization mechanisms themselves may end up getting at least partially rewritten.

Well, there is such a thing as subcortical learning, particularly for things like fine-tuning motor control programs in the midbrain and cerebellum, but I think most or all of the "interesting" learning happens in the cortical system, not subcortical.

In particular, I'm not really expecting the core emotion-control algorithms to be editable by learning or thinking (if we draw an appropriately tight boundary around them).

More specifically: somewhere in the brain is an algorithm that takes a bunch of inputs and calculates "How guilty / angry / happy / smug / etc. should I feel right now?" The inputs to this algorithm come from various places, including from the body (e.g. pain, hunger, hormone levels), and from the cortex (what emotions am I expecting or imagining or remembering?), and from other emotion circuits (e.g. some emotions inhibit or reinforce each other). The inputs to the emotion calculation can certainly change, but I don't expect that the emotion calculation itself changes over time.

It feels like emotion-control calculations can change, because the cortex can be a really dominant input to those calculations, and the cortex really can change, including by conscious effort. Why is the cortex such a dominant input? Think about it: the emotion-calculation circuits don't know whether I'm likely to eat tomorrow, or whether I'm in debt, or whether Alice stole my cookie, or whether I just got promoted. That information is all in the cortex! The emotion circuits get only tiny glimpses of what's going on in the world, particularly through the cortex predicting & imagining emotions, including in empathetic simulation of others' emotions. If the cortex is predicting fear, well, the amygdala obliges by creating actual fear, and then the cortex sees that and concludes that its prediction was right all along! There's very little "ground truth" that the emotion circuits have to go on. Thus, there's a wide space of self-reinforcing habits of thought. It's a terrible system! Totally under-determined. Thus we get self-destructive habits of thought that linger on for decades.

Anyway, I have this long-term vision of writing down the exact algorithm that each of the emotion-control circuits is implementing. I think AGI programmers might find those algorithms helpful, and so might people trying to pin down "human values". I have a long way to go in that quest :-D

there's also a sense in which the player doesn't have anything that we could call values ...

I basically agree; I would describe it by saying that the subcortical systems are kinda dumb. Sure, the superior colliculus can recognize scuttling spiders, and the emotion circuits can "dislike" pain. But any sophisticated concept like "flourishing", "fairness", "virtue", etc. can only be represented in the form of something like "Neocortex World Model Entity ID #30962758", and these things cannot have any built-in relationship to subcortical circuits.

So the player's "values" are going to (1) simple things like "less pain is good", and (2) things that don't have an obvious relation to the outside world, like complicated "preferences" over the emotions inside our empathetic simulations of other people.

If a "purely character-level" model of human values is wrong, how do we incorporate the player level?

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

But either way, I'm all for trying to get a better understanding of how I (the character / cortical system) am "built" by the player / subcortical system. :-)

Great comment, thanks!

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

Well, to make up a silly example, let's suppose that you have a conscious belief that you want there to be as much cheesecake as possible. This is because you are feeling generally unsafe, and a part of your brain has associated cheesecakes with a feeling of safety, so it has formed the unconscious prediction that if only there was enough cheesecake, then you would finally feel good and safe.

So you program the AI to extract your character-level values, it correctly notices that you want to have lots of cheesecake, and goes on to fill the world with cheesecake... only for you to realize that now that you have your world full of cheesecake, you still don't feel as happy as you were on some level expecting to feel, and all of your elaborate rational theories of how cheesecake is the optimal use of atoms start feeling somehow hollow.

There is a missmatch in saying cortex=charcter and subcortex=player.

If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.

In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model. 

According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).

ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.

Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.