Planned summary for the Alignment Newsletter:

This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn't verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author's best guess is that human values are captured by valence, as modeled by minimization
... (read more)
Showing 3 of 7 replies (Click to show all)
1G Gordon Worley III5dSome examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world: * Joseph Stalin's collectivization of farms * Tokugawa Iemitsu's closing off of Japan * Hugo Chávez's nationalization of many industries I've made my case for that here [] . No, it's not my goal that we not look at humans. I instead think we're currently too focused on trying to figure out everything from only looking at the kinds of evidence we can easily collect today, and that we also don't have detailed enough models to know what other evidence is likely relevant. I think understanding whatever is going on with values is hard because there is data further "down the stack", if you will, from observations of behavior that is relevant. I think that because I look at issues like latent preferences that by definition exist because we didn't have enough data to infer their existence but that need not necessarily exist if we gather more data about how those latent preferences are generated such that we could discover them in advance by looking earlier in the process that generates them.
2Rohin Shah5dWhat's your model for why those actions weren't undone? To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad? It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive? (I'm aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.) Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they'd be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.
What's your model for why those actions weren't undone?

Not quite sure what you're asking here. In the first two cases they eventually were undone after people got fed up with the situation, the last is recent enough I don't consider it's not having already been undone as evidence people like it, only that they don't have the power to change it. My view is that these changes stayed in place because the dictators and their successors continued to believe the good out weighted the harm when either this was clearly contrary to the... (read more)

Deconfusing Human Values Research Agenda v1

by G Gordon Worley III 3 min read23rd Mar 202010 comments


On Friday I attended the 2020 Foresight AGI Strategy Meeting. Eventually a report will come out summarizing some of what was talked about, but for now I want to focus on what I talked about in my session on deconfusing human values. For that session I wrote up some notes summarizing what I've been working on and thinking about. None of it is new, but it is newly condensed in one place and in convenient list form, and it provides a decent summary of the current state of my research agenda for building beneficial superintelligent AI; a version 1 of my agenda, if you will. Thus, I hope this will be helpful in making it a bit clearer what it is I'm working on, why I'm working on it, and what direction my thinking is moving in. As always, if you're interesting in collaborating on things, whether that be discussing ideas or something more, please reach out.

Problem overview

  • I think we're confused about what we really mean when we talk about human values.
  • This is a problem because:
  • What are values?
    • We don't have an agreed upon precise definition, but loosely it's "stuff people care about".
      • When I talk about "values" I mean the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology.
    • Importantly, what people care about is used to make decisions, and this has had implications for existing approaches to understanding values.
  • Much research on values tries to understand the content of human values or why humans value what they value, but not what the structure of human values is such that we could use it to model arbitrary values. This research unfortunately does not appear very useful to this project.
  • The best attempts we have right now are based on the theory of preferences.
    • In this model a preference is a statement located within a (weak, partial, total, etc.)-order. Often written like A > B > C to mean A is preferred to B is preferred to C.
    • Problems:
      • Goodhart effects are robust and preferences in formal models are measures that is not the thing we care about itself
      • Stated vs. revealed preferences: we generally favor revealed preferences, this approach has some problems:
      • General vs. specific preferences: do we look for context-independent preferences ("essential" values) or context-dependent preferences
        • generalized preferences, e.g. "I like cake better than cookies", can lead to irrational preferences (e.g. non-transitive preferences)
        • contextualized preferences, e.g. "I like cake better than cookies at this precise moment", limit our ability to reason about what someone would prefer in new situations
    • See Stuart Armstrong's work for an attempt to address these issues so we can turn preferences into utility functions.
  • Preference based models look to me to be trying to specify human values at the wrong level of abstraction. But what would the right level of abstraction be?

Solution overview

  • What follows is a summary of what I so far think moves us closer to less confusion about human values. I hope to either think some of this is wrong or insufficient by the end of the discussion!
  • Assumptions:
    • Humans are embedded agents.
    • Agents have fuzzy but definable boundaries.
      • Everything in every moment causes everything in every next moment up to the limit of the speed of light, but we can find clusters of stuff that interact with themselves in ways that are "aligned" such that the stuff in a cluster makes sense to model as an agent separate from the stuff not in an agent.
  • Basic model:
    • Humans (and other agents) cause events. We call this acting.
    • The process that leads to taking one action rather than another possible action is deciding.
    • Decisions are made by some decision generation process.
    • Values are the inputs to the decision generation process that determine its decisions and hence actions.
    • Preferences and meta-preferences are statistical regularities we can observe over the actions of an agent.
  • Important differences from preference models:
    • Preferences are causally after, not causally before, decisions, contrary to the standard preference model.
      • This is not 100% true. Preferences can be observed by self-aware agents, like humans, and influence the decision generation process.
  • So then what are values? The inputs to the decision generation process?
    • My best guess: valence
    • This leaves us with new problems. Now rather than trying to infer preferences from observations of behavior, we need to understand the decision generation process and valence in humans, i.e. this is now a neuroscience problem.


  • underdetermination due to noise; many models are consistent with the same data
    • this makes it easy for us to get confused, even when we're trying to deconfuse ourselves
    • this makes it hard to know if our model is right since we're often in the situation of explaining rather than predicting
  • is this a descriptive or causal model?
    • both. descriptive of what we see, but trying to find the causal mechanism of what we reify as "values" at the human level in terms of "gears" at the neuron level
  • what is valence?
  • complexities of going from neurons to human level notions of values
    • there's a lot of layers of different systems interacting on the way from neurons to values and we don't understand enough about almost any of them or even for sure what systems there are in the causal chain
  • Valence in human computer interaction research


Thanks to Dan Elton, De Kai, Sai Joseph, and several other anonymous participants of the session for their attention, comments, questions, and insights.