I have previously advocated for finding a mathematically precise theory for formally approaching AI alignment. Most recently I couched this in terms of predictive coding and longer ago I was thinking in terms of a formalized phenomenology, but further discussions have helped me realize that, while I consider those approaches useful and they helped me discover my position, they are not the heart of what I think is import. The heart, modulo additional paring down that may come as a result of discussions sparked by this post, is that human values are rooted in valence and thus if we want to build AI aligned with human values we must be able to understand how values arise from valence.

Peter Carruthers has kindly and acausally done me the favor of laying out large parts of the case for a valence theory of value ("Valence and Value", Philosophy and Phenomenological Research, Vol. XCVII No. 3, Nov. 2018, doi:10.1111/phpr.12395). He sets out to do two things in the linked paper. One is to make the case that valence is a "unitary natural-psychological kind" (another way of saying it parsimoniously cuts the reality of human minds at the joints). The other is to give an account of how it is related to value, arguing that valence represents value against the position that valence is value. He calls these positions the representational and the hedonic accounts, respectively.

I agree with him on some points and disagree on others. I mostly agree with section 1 of his paper, and then proceed to disagree with parts of the rest, largely because I disagree with his representational account of valence because I think he flips the relationship between valence and value. Nonetheless, he has provided a strong jumping off point and explores many important considerations, so let's start from there before moving towards a formal model of values in terms of valence and saying how that model could be used in formally specifying what it would mean for two agents to be aligned.

The Valence-Value Connection

In the first section he offers evidence that valence and value are related. I recommend you read his arguments for yourself (the first section is only a few pages), but I'll point out several highlights:

It is widely agreed, however, that all affective states share two dimensions of valence and arousal (Russell, 1980, 2003; Reisenzein, 1994; Rolls, 1999). All affective states have either positive or negative valence (positive for orgasm, negative for fear); and all can be placed along a continuum of bodily arousal (high or low heart-rate, speed of breathing, tensing of muscles, and so on).
Valence-processing appears to be underlain by a single (albeit multicomponent) neurobiological network, involving not just subcortical evaluative regions in the basal ganglia, but also the anterior insula and anterior cingulate, together especially with orbitofrontal and ventromedial prefrontal cortex (Leknes & Tracey, 2008; FitzGerald et al., 2009; Plassmann et al., 2010; Bartra et al., 2013). The latter regions are the primary projection areas for valence signals in the cortex. These signals are thought to provide an evaluative "common currency" for use in affectively-based decision making (Levy & Glimcher, 2012). Valence produced by many different properties of a thing or event can be summed and subtracted to produce an overall evaluative response, and such responses can be compared to enable us to choose among options that would otherwise appear incommensurable.
Moreover, not only can grief and other forms of social suffering be blunted by using Tylenol, just as can physical pain (Lieberman & Eisenberger, 2009; Lieberman, 2013), but so, too, is pleasure blunted by the same drugs (Durso et al., 2015). In addition, both pain and pleasure are subject to top–down placebo and nocebo effects that seemingly utilize the same set of mechanisms. Just as expecting a pain to be intense (or not) can influence one’s experience accordingly, so can expectations of pleasure increase or decrease the extent of one’s enjoyment (Wager, 2005; Plassmann et al., 2008; Ellingsen et al., 2013). Indeed, moderate pain that is lesser than expected can even be experienced as pleasant, suggesting the involvement of a single underlying mechanism (Leknes et al., 2013).
It is widely believed by affective scientists that valence is intrinsically motivating, and plays a fundamental role in affectively-based decision making (Gilbert & Wilson, 2005; Levy & Glimcher, 2012). When we engage in prospection, imagining the alternatives open to us, it is valence-signals that ultimately determine choice, generated by our evaluative systems responding to representations of those alternatives. The common currency provided by these signals enables us to compare across otherwise incommensurable alternatives and combine together the values of the different attributes involved. Indeed, there is some reason to think that valence might provide the motivational component underlying all intentional action, either directly or indirectly.
Moreover, intentions can constrain and foreclose affect-involving practical reasoning. Likewise, one’s goals can issue in behavior without requiring support from one’s affective states. Notably, both intentions and goals form parts of the brain’s control network, located especially in dorsolateral prefrontal cortex (Seeley et al., 2007). Note that this network is distinct from—although often interacting with, of course—the affective networks located in ventromedial prefrontal cortex and subcortically in the basal ganglia.
More simply, however, beliefs about what is good can give rise to affective responses directly. This is because of the widespread phenomenon of predictive coding (Clark, 2013), which in this case leads to an influence of top–down expectations on affective experience. We know that expecting an image to depict a house can make it appear more house-like than it otherwise would (Panichello et al., 2013). And likewise, expecting something to be good can lead one to experience it as more valuable than one otherwise would. This is the source of placebo-effects on affective experience (Wager, 2005; Plassmann et al., 2008; Ellingsen et al., 2013). Just as expecting a stimulus to be a house can cause one to experience it as house-like even if it is, in fact, completely neutral or ambiguous, so believing something to be good may lead one to experience it as good in the absence of any initial positive valence.
It may be, then, that the valence component of affect plays a fundamental and psychologically-essential role in motivating intentional action. It is the ultimate source of the decisions that issue in intentions for the future and the adoption of novel goals. And it is through the effects of evaluative beliefs on valence-generating value systems that the former can acquire a derivative motivational role. If these claims are correct, then understanding the nature of valence is crucial for understanding both decision-making and action.

Credit where credit is due: the Qualia Research Institute has been pushing this sort of perspective for a while. I didn't believe, though, that valence was a natural kind until I understood it as the signaling mechanism in predictive coding, but other lines of evidence may be convincing to other folks on that point, or you may still not be convinced. In my estimation, Carruthers does a much better job of presenting the evidence than either QRI or myself have done to a skeptical, academic audience, although I expect there are still many gaps to be covered which could prove to unravel the theory. Regardless, it should be clear that something is going on that relates valence to value, so even if you don't think the relationship is fundamental, it should still be valuable to learn what we can from how valence and value relate to help us become less confused about values.

How Valence and Value Interact

Carruthers takes the position that valence is representative of value (he calls this the "representational account") and argues it against the position that valence and value are the same thing (the "hedonic account"). By "representative of" he seems to mean that value exists and valence is something that partially or fully encodes value into a form that brains can work with. Here's how he describes it, in part:

On one view (the view I advocate) the valence component of affective states like pain and pleasure is a nonconceptual representation of badness or goodness. The valence of pain is a fine-grained perception-like representation of seeming badness and the valence of pleasure is a similarly fine-grained representation of seeming goodness, where both exist on a single continuum of seeming value. However, these phrases need to be understood in a way that does not presuppose any embedding within the experience of the concepts BAD and GOOD. One has to use these concepts in describing the content of a state of valence, of course (just as one has to use color-concepts in describing the content of color experience), but that doesn’t mean that the state in question either embeds or presupposes the relevant concept.
For comparison, consider nonconceptual representations of approximate numerosity, of the sort entertained by infants and nonhuman animals (and also by adult humans) (Barth et al., 2003; Jordan et al., 2008; Izard et al., 2009). In describing the content of such a representation one might say something like: the animal sees that there are about thirty dots on the screen. This needs to be understood in a way that carries no commitment to the animal possessing the concept THIRTY, however. Rather, what we now know is that the representation is more like a continuous curve centered roughly on thirty that allows the animal to discriminate thirty dots from forty dots, for example, but not thirty from thirty-five.

At first I thought I agreed with him on the representational account because he rightly, in my view, notices that valence need not contain within it nor be built on our conceptual, ontological understanding of goodness, badness, and value. Reading closer and given his other arguments, though, it seems to me that he is saying that although valence is not representational of a conceptualization of value, he does mean it is representational of real values, whatever those be. I take this to be a wrong-way reduction: he is taking a simpler thing (valence) and reducing it into terms of a more complex thing (value).

I'm also not convinced by his arguments against the "hedonic account" since, to my reading, they often reflect a simplistic interpretation of how valence signals might function in the brain to produce behavior. This is forgivable, of course, because complex dynamic systems are hard to reason about, and if you don't have first hand experience with them you might not fully appreciate the way simple patterns of interaction can give rise to complex behavior. That said, his argument against value being identified with valence fail, in my mind, to make his point because they all leave open this escape route of "complex interactions that behave differently than the simple interactions they are made of", sort of like failing to realize that a solar-centric, elliptical-orbit planetary model can account for retrograde motion because it doesn't contain any parts that "move backwards", or that evolution by differential reproduction can give rise to beings that do things that do not contribute to differential reproductive success.

Yet I don't think the hedonic account, as he calls it, is quite right either, because he defines it such that there is no room between valence and value for computation to occur. Based on the evidence for a predictive-coding-like mechanism at play in the human brain (cf. academic papers on the first page of Googling "predictive coding evidence" for: 1, 2, 3, 4; against: 1), that mechanism using valence to send feedback signals, and the higher prior likelihood that values are better explained by reducing them to something simpler than vice versa, I'm inclined to explain the value-valence connection as the result of our reifying as "values" the self-experience of having brains semi-hierarchically composed of homeostatic mechanisms using valence to send feedback signals. Or with less jargon, values are the experience of computing the aggregation of valence signals. Against the representational and hedonic account, we might call this the constructive account because it suggests that value is constructed by the brain from valence signals.

My reasoning constitutes only a sketch of an argument for the constructive account. A more complete argument would need to address, at a minimum, the various cases Carruthers considers and much else besides. I might do that in the future if it proves instrumental to my ultimate goal of seeing the creation of safe, superintelligent AI, but for now I'll leave it at this sketch to move on to offering a mathematical model of the constructive account and using it to formalize what it would mean to construct aligned AI. Hereon I'll assume the constructive account, making the rest of this post conditional on that account's as yet unproven correctness.

A Formal Model of Human Values in Terms of Valence

The constructive account implies that we should be able to create a formal model of human values in terms of valence. I may not manage to create a perfect or perfectly defensible model, but my goal is to make it at least precise enough that we can squeeze out any wiggle room from it where, if this theory is wrong, it might try to hide by escaping into it. Thus we can either expose fundamental flaws in the core idea (values constructed from valence) or expose flaws in the model in order to move towards a better model that correctly captures the idea in a precise enough way that we can safely use it when reasoning about alignment of superintelligent AI.

Let's start by recalling some existing models of human values and work from there to create a model of values grounded in valence. This will mean a slight shift in terminology, from talking about values to preferences. I will here consider these terms interchangeable, but not everyone would agree. Some people insist values are not quantifiable or formally modelable. I'm going to, perhaps unfairly, completely ignore this class of objection, as I doubt many of my readers believe it. Others use "value" to mean the processes that generate preferences, or they might only consider meta-preferences to be values. This is a disagreement on definitions, so know that I am not making this kind of distinction, and instead lump everything like value, preference, affinity, taste, etc. into a single category and freely equivocate among these terms since I think they are all of the same kind or type that generate answers to questions of the form "what should one do?".

The standard model of human values is the weak preference ordering model. Given the set of all possible world states , a person's values are defined by a weak order over . This model has several variations, such as replacing with a total order , removing the ability to equally value two different world states, or relaxing to a partial order , permitting incomparable world states. The benefit of using a weak order is that it's sufficient for modeling rational agents: a total order is overkill and a partial order is not enough to make expected utility theory work.

Unfortunately humans aren't rational agents, so the weak preference ordering model fails to completely describe human values. Or at least so it seems at first. One response is to throw out the idea that there is even a preference ordering, instead replacing it with a preference relation that sometimes gives a comparison between two world states, sometimes doesn't, and sometimes produces loops (an "approximate order"). Although I previously endorsed this approach, I no longer do, because most of the problems with weak ordering can be solved by Stuart Armstrong's approach of realizing that world states are all conditional on their causal history (that is, time-invariant preferences don't actually exist, we just sometimes think it looks like they do) and treating human preferences as partial (held over not necessarily disjoint subsets of , reflecting that humans only model of a subset of possible world states). This means that having a weak preference ordering may not in itself be a challenge to giving a complete description of human values, so long as what constitutes a world state and how the preferences form over them are adequately understood.

Getting to adequate understanding is non-trivial, though. For example, even if the standard model describes how human preferences function, it doesn't explain how to learn what they are. The usual approach to finding preferences is behaviorist: observe the behavior of an agent and infer the values from there with the necessary help of some normative assumptions about human behavior. This is the approach in economic models over revealed preferences, inverse reinforcement learning, and much of how humans model other humans. Stuart Armstrong's model of partial preferences avoids making normative assumptions about behavior by making assumptions about how to define preferences but ends up requiring solving the symbol grounding problem. I think we can do better by making the assumption that preferences are computed from valence since valence is in theory observable, correlated with values, and requires solving problems in neuroscience rather than philosophy.

So, without further ado, here's the model.

The Model

Let be a human embedded in the world and the set of all possible world states. Let be the set of all world states as perceived by and the world state as perceives it. Let be the set of valence functions of the brain that generate real-valued valence when in a perceived world state. Let be the aggregation function from the output of all on to a single real valued aggregate valence, which for is denoted for and is called the preference function. Then the weak preference ordering of H is given by .

Some notes. The functions are, as I envision them, meant to correspond to partial interaction with and to represent how individual control systems in the brain generate valence signals in response to being in a state that is part of , although I aim for to make sense as an abstraction even if the control system model is wrong so long as the constructive valence account is right. There might be a better formalism than making each a function from all of to the reals, but given the interdependence of everything within a Hubble volume, it would likely be to reconstitute each with as accessible from each particular control system or from each physical interaction within each control system, even if in practice each control system ignores most of the available state information and sees the world not much different from every other control system in a particular brain. Thus for humans with biological brains as they exist today a shared H(x) is probably adequate unless future neuroscience suggests greater precision is needed.

However maybe we don't need and can simply use directly with entirely accounting for the subjective aspects of calculating .

I'm uncertain if producing a complete ordering of is a feature or a bug. On the standard model it would be a feature because rational choice theory expects this, but on Stuart's model it might be a bug because now we're creating more ordering than any human is computing, and more generally we should expect, lacking hypercomputation, that any embedded (finite) agent cannot actually compute a complete order on because, even if is finite, it's so large as to require more compute than will ever exist in the entire universe to consider each member once. But then again maybe this is fine and we can capture the partial computation of via an additional mechanism while leaving this part of the model as is.

Regardless we should recognize that the completeness of in and of itself is a manifestation of a more general limitation of the model: the model doesn't reflect how humans compute value from valence because it supposes the possibility of simultaneously computing and determining the best world state in O(1) time, otherwise it would need to account for considering a subset of that might shift as the world state in which H is embedded transitions from to to and so on (that is, even if we restrict to possible world states that are causal successors of the present state, will change in the course of computing ). The model presented is timeless, and that might be a problem because values exist at particular times because they are features of an embedded agent and so letting them float free of time fails to completely constrain the model to reality. I'm not sure if this is a practical problem or not.

Further this model has many of the limitations that Stuart Armstrong's value model has: provides slice-in-time view of values rather than persistent values, doesn't say how to get ideal or best values, and doesn't deal with questions of identity. Those might or might not be real limitations: maybe there are no persistent values and the notion of a persistent value is a post hoc reification in human ontology; maybe ideal or best values don't exist or are at least uncomputable; and maybe identity is also a post hoc reification that isn't, in a certain sense, real. Clearly, I and others need to think about this more.

Despite all these limitations, I'm excited about this model because it provides a starting point for building a more complete model that addresses these limitations while capturing the important core idea of a constructive valence account of values.

Formally Stating Alignment

Using this model, I can return to my old question of how to formally specify the alignment problem. Rather than speaking in terms of phenomenological constructs, as I did in my previous attempt, I can simply talk in terms of valence and preference ordering.

Consider two agents, a human and an AI , in a world with possible states . Let and be the preference function of and the utility function of , respectively. Then is aligned with if

In light of my previous work, I believe this is sufficient because, even though it does not explicitly mention how and model each other, that is not necessary because that is already captured by the subjective nature of and , i.e. and 's ontologies are already computed by and so we don't need to make it explicit at this level of the model.


I very much think of this as a work in progress that I'm publishing in order to receive feedback. Although it's the best of my current thinking given the time and energy I have devoted to it, my thinking is often made better by collaboration with others, and I think the best way to make that happen is by doing my best to explain my ideas so others can interact with them. Although I hope to eventually evolve these ideas towards something I'm sure enough of that I would want to see them published in a journal, I would want to be much more sure this was describing reality in a useful way for understanding human values as necessary to building aligned AGI.

As briefly mentioned earlier, I also think of this work as conditional on future neuroscience proving correct the constructive valence account of values, and I would be happy to get it to a point where I was more certain of it being conditionally correct even if I can't be sure it is correct because of that conditionality. Another way to put this is that I'm taking a bet with this work that the constructive account will be proven correct. Thus I'm most interested in comments that poke at this model conditional on the constructive account being correct, medium interested in comments that poke at the constructive account, and least interested in comments that poke at the fact that I'm taking this bet or that I think specifying human values is important for alignment (we've previously discussed that last topic elsewhere).

New Comment
2 comments, sorted by Click to highlight new comments since:

This was definitely an interesting and persuasive presentation of the idea. I think this goes to the same place as learning from behavior in the end, though.

For behavior: In the ancestral environment, we behaved like we wanted nourishing food and reproduction. In the modern environment we behave like we want tasty food and sex. Given a button that pumps heroin into our brain, we might behave like we want heroin pumped into our brains.

For valence, the set of preferences that optimizing valence cashes out to depends on the environment. We, in the modern environment, don't want to be drugged to maximize some neural signal. But if we were raised on super-heroin, we'd probably just want super-heroin. Even assuming this single-neurological-signal hypothesis, we aren't valence-optimizers, we are the learned behavior of a system whose training procedure relies on the valence signal.

Ex hypothesi, we're going to have learned preferences that won't optimize valence, but might still be understandable in terms of a preference maturation process that is "trying" to optimize valence but ran into distributional shift or adversarial optimization or something. These preferences (like refusing the heroin) are still fully valid human preferences, and you're going to need to look at human behavior to figure out what they are (barring big godlike a priori reasoning), which entails basically similar philosophical problems as getting all values from behavior without this framework.

These preferences (like refusing the heroin) are still fully valid human preferences, and you're going to need to look at human behavior to figure out what they are (barring big godlike a priori reasoning), which entails basically similar philosophical problems as getting all values from behavior without this framework.

I'm hopeful that this won't be true in a certain, limited way. That is, in a certain sense, scanning brains and observing how neurons operate to determine the behavior of a human is a very different sort of operation from observing their behavior "from the outside" the way we observe people's behavior today. Much of the difficulty is that because observing behavior we can see only with our unaided senses and without a deep model of the brain forces us to make very large normative assumptions to get the necessary power to infer things about how a human values things, but if we have a model like this and it appears to be correct then we can, practically speaking, make "smaller", less powerful normative assumptions because we understand and can work out the details of more of the gears of the mind.

The result is that in a certain sense we are still concerned with behavior, but because the level of detail is so much higher and the model so much richer we are less likely to find ourselves making mistakes from having taken large inferential leaps as we would if we observed behavior in the normal sense.