NB: Kaj recently said some similar and related things while I was on hiatus from finishing this post. I recommend reading it for a different take on what I view as a line of thinking generated by similar insights.
One of the challenges with developing a theory of human values is dealing with the apparent non-systematic nature of human decision making which makes it seem that human value are not consistent, coherent, or rational. One solution is to build or discover mechanisms by which they can be made legible and systematic. Another is to embrace the illegibility and inconsistency and find ways of working with it. This is a short start towards doing the latter because I believe the former cannot be made to work well enough to stand up against Goodhart effects under extreme optimization by superintelligent AGI that we want to align with human values.
I've been thinking a lot about what values are, and in particular looking for phenomena that naturally align with the category we variously call values, preferences, affinity, taste, aesthetics, or axiology. The only thing I have found that looks like a natural kind (viz. a model that cuts reality at its joints) is valence.
Valence on its own doesn't fully explain all the phenomena we want to categorize as values, especially things like meta-preferences or "idealized" values that are abstracted away from the concrete, embedded process of a human making a choice at a point in time. Instead it gives us a mechanism by which we can understand why a human makes one choice over another at some point in their causal history. And decisions are not themselves preferences, because decisions are embedded actions taken by an agent in an environment whereas preferences are, as typically considered, generators of decisions. I think we need to flip this notion of preferences as generators on its head, and in so doing we move towards becoming less confused about preferences.
So let me describe my current model of how this works, and let's see if it explains the world we find ourselves in any better than existing theories of values and preferences.
Humans are embedded agents. They carry out processes that result in them causing events. We call this process of an agent causing events acting and the process that leads to taking one action rather than any other possible action a decision. Although I believe valence describes much of how humans decide what actions to take, we need not consider that detail here and instead consider the abstraction of a decision generation process that is, importantly, inseparable from its implementation up to the limit of functional equivalence and conditioned on the causal history of the agent. Another way to say this is that the algorithm that makes the decision can be reasoned about and modeled but there is no simplification of the algorithm that produces exactly the same result in all cases unless it is a functionally equivalent algorithm and the decision is situated in time such that it cannot be separated from its embedding in the environment (which includes the entire past of the universe).
NB: I think there are a lot of interesting things to say about how the decision generation process seems to work in humans—how it comes up with the set of choices it chooses between, how it makes that choice, how it is modified, etc.—however I am going to leave off considerations of that for now so we can consider the theory at a more abstract level without getting bogged down in the implementation details of one of the gears.
Additionally, all of this is described in terms of things like agents that don't exist until they are reified into existence: prior to that reification into ontology all we have is stuff happening. Let's try not to get hung up on things like where to draw the boundary of an agent right now and treat the base concepts in this model as useful handles for bootstrapping understanding of a model that I expect can be reduced.
Preferences are then statistical regularities (probability distributions) over decisions. Importantly they come causally after decisions. Consequently preferences may predict decisions but they don't generate them. Meta-preferences are then probability distributions over preferences. Values, aesthetics, axiology, etc. are abstractions for talking about this category of probability distributions over decisions (and decisions about decisions, etc.).
Here's a pictorial representation of the model if that helps make it clearer:
This is as opposed to the standard or "old" model where preferences are the decision generators, which I'll stylize thusly, keeping in mind there's a lot of variation in how these "old" models work that I'm glossing over:
Note that in the new model preferences can still end up causally prior to decisions to the extent that they are discerned by an agent as features of their environment, but this is different from saying that preferences or meta-preferences are primary to the decision generation process. Thus when I say that preferences are causal postcedents of decisions I mean that if an agent did not know about or otherwise "have" preferences they would still make decisions by the decision generation process.
Although backwards from the standard model, this should not be too surprising since all animals manage to make decisions regardless of how aware they are of themselves or their actions, thus we should expect our model of values to function in the absence of decision generating preferences. Nonetheless, my guess is that this knowledge of preferences, especially knowledge of meta-preferences, feels like knowledge of the decision generation process from the inside and provides an important clue in understanding how humans might come to develop fixed points in their decision generation processes even if it really is all just valence calculations and why humans have grasped on the idea that preferences are a good model for the decision generation process.
You might object that I've just rearranged the terms or that this is just a more detailed model of revealed preferences, and to some extent those things are true, but I also think I've done it in a way that pulls apart concepts that were previously confounded such that we get something more useful for addressing AI alignment, which we'll explore in more detail now.
When we think of preferences as the generators of decisions, we run into all sorts of confusions. For example, if we equate preferences with revealed preferences people object that their revealed preferences leave something out about the process that generated their behavior and that generalizing from their observed behavior might not work as they would expect it to when applied to novel situations. This appears to be a general problem with most attempts at having computers learn human values today: they conflate behavior with the generators of behavior, find the generators only by making normative assumptions, and then end up with something that almost but doesn't quite match the generator.
But if we don't pay attention to revealed preferences we are also misled about people's preferences, since, for example, what people claim to be their preferences (their stated preferences) also don't seem to do a very good job of predicting their behavior. Maybe that's because people incorrectly assume their partial preferences are "full" preferences in a total order and may be related to scope insensitivity; maybe it's because people are deceiving themselves about their preferences for various reasons. Whatever the reason, stated preferences and revealed preferences both result in models with errors more than large enough for them to fall apart under superintelligent optimization.
Another problem, much commented upon by me at least, with treating preferences as generators of decisions is that this places the descriptive strength of preferences at odds with the normative demands we would like to place on preferences. For example, there's a lot to recommend rational preferences and preferences that can be described by a utility function, so people have put a lot of work into trying to find ways that these might also explain observed human behavior, even if it's to consider human behavior degraded from an ideal that it might approach if only we thought longer, knew more, etc.. But if we can create some space between the process of making decisions and the pattern of decisions made in our models this would ease much of that tension in terms of our models' abilities to explain relation and serve our purposes.
Perhaps the solution lies at some synthesis of stated and revealed preferences, but that looks to me like trying to patch a broken system or put lipstick on a pig, and at the end of the day such a model may work a little better by papering over the faults of the two submodels but will also be a kludge of epicycles that will crack if a comet comes screaming through. Alternatively we could look for some other method of identifying preferences, like brain scans, but at this point I think we are just arguing terminology. I could probably be convinced that calling the decision generation process "preferences" has some strong value, but from where I stand now it seems to cause more confusion than it resolves, so I'd rather see preferences treated solely as causally after decisions and talk some other way about whatever is causally before.
What are the consequences of understanding preferences as causally downstream of actions rather than causally upstream of them? And does it make any difference since we still have something—the decision generation process—doing the work that we previously asked preferences, perhaps or perhaps not modeled with a utility function, to do? In other words, how does this model help us?
One of the big things it does is clear up confused thinking from getting the causal relationship between decision generation and preferences backwards. Rather than trying ever harder to find a theory that serves the two masters of accurately describing human behavior and obeying mathematical criteria that make our models behave in useful ways, we can let them operate independently. Yes, we may still want to, for example, modify human behavior to match norms, such as by increasing the rationality of human preferences, but also understand that the change doesn't come from changing preferences directly, but from changing decision generation processes such that, as a consequence, preferences are changed. And we may still want to design machines aligned with human values, but understand that aligning a machine with human preferences is not the same thing as aligning a machine with human decision generation processes since only the latter stands to capture all that humans value.
Another advantage of this model is that it is more explicitly embedded in the world. Preferences are intentionally an abstraction away from many of the messy details of how decisions are made, but as a result they lose some of their grip on reality. Said another way, preferences are a leaky abstraction, and while they may be adequate for addressing questions in microeconomics, they seem inadequate for helping us build aligned AI. There is no leakless abstraction, but by realizing that preferences are higher up the abstraction stack and thus more leaky we can realize the need to go down the stack and get nearer the territory to find a model with more gears that better captures what matters, maybe even up to a limit where superintelligent optimization is no longer a threat but an opportunity.
In short I think the main thing this new model does is free us from the constrictions of trying to make the preference model work with humans and accounts for the embeddedness of humans. It still doesn't say enough about how decisions are generated, but it gives us a better shaped model into which an abstraction of the implementation details can be slotted than the old model provided.
I feel like what I have described in this post is only one aspect of the model that is slowly coalescing in my mind, and it is able to crystalize into something communicable only by having germs to form around provided by interacting with others. So, what have I missed, or what would you like to know that would test this theory/reframing? What, if it were true, would invalidate it? I'd love to know!
When you say the human decision procedure causes human values, what I hear is that the human decision procedure (and its surrounding way of describing the world) is more ontologically basic than human values (and their surrounding way if describing the world).
Our decision procedure is "the reason for our values" in the same way that the motion of electric charge in your computer is the reason it plays videogames (even though "the electric charge is moving" and "it's playing a game" might be describing the same physical event). The arrow between them isn't the most typical causal arrow between two peers in a singular way of describing the world, it's an arrow of reduction/emergence, between things at different levels of abstraction.
I think I basically agree with this and think it's right. In some ways you might say focusing too much on "values" acts like a barrier to deeper investigation of the mechanisms at work here, and I think looking deeper is necessary because I expect that optimization against the value abstraction layer alone will result in Goodharting.
It looks like the idea of human values is very contradictional. May be we should dissolve it? What about "AI safety" without human values?
In some sense that's a direction I might be moving in with my thinking, but there is still some thing that humans identify as values that they care about, so I expect there to be some real phenomenon going on that needs to be considered to get good outcomes, since I expect the default remains a bad outcome if we don't pay attention to whatever it is that makes humans care about stuff. I expect most work today on value learning is not going to get us where we want to go because it's working with the wrong abstractions, and my goal in this work is to dissolve those abstractions to find better ones for our long-term purposes.