As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day (more or less - I'm getting back on track!) for 25 days. Or until I run out of hot takes.
When considering AI alignment, you might be tempted to talk about "the human's utility function," or "the correct utility function." Resist the temptation when at all practical. That abstraction is junk food for alignment research.
As you may already know, humans are made of atoms. Collections of atoms don't have utility functions glued to them a priori - instead, we assign preferences to humans (including ourselves!) when we model the world, because it's a convenient abstraction. But because there are multiple ways to model the world, there are multiple ways to assign these preferences; there's no "the correct utility function."
Maybe you understand all that, and still talk about an AI "learning the human's utility function" sometimes. I get it. It makes things way easier to assume there's some correct utility function when analyzing the human-AI system. Maybe you're writing about inner alignment and want to show that some learning procedure is flawed because it wouldn't learn the correct utility function even if humans had one. Or that some learning procedure would learn that correct utility function. It might seem like this utility function thing is a handy simplifying assumption, and once you have the core argument you can generalize it to the real world with a little more work.
That seeming is false. You have likely just shot yourself in the foot.
Because the human-atoms don't have a utility function glued to them, building aligned AI has to do something that's actually materially different than learning "the human's utility function." Something that's more like learning a trustworthy process. If you're not tracking the difference and you're using "the human's utility function" as a target of convenience, you can all too easily end up with AI designs that aren't trying to solve the problems we're actually faced with in reality - instead they're navigating their own strange, quasi-moral-realist problems.
Another way of framing that last thought might be that wrapper-minds are atypical. They're not something that you actually get in reality when trying to learn human values from observations in a sensible way, and they have alignment difficulties that are idiosyncratic to them (though I don't endorse the extent to which nostalgebraist takes this).
What to do instead? When you want to talk about getting human values into an AI, try to contextualize discussion of the human values with the process the AI is using to infer them. Take the AI's perspective, maybe - it has a hard and interesting job trying to model the world in all its complexity, if you don't short-circuit that job by insisting that actually it should just be trying to learn one thing (that doesn't exist). Take the humans' perspective, maybe - what options do they have to communicate what they want to the AI, and how can they gain trust in the AI's process?
Of course, maybe you'll try to consider the AI's value-inference process, and find that its details make no difference whatsoever to the point you were trying to make. But in that case, the abstraction of "the human's utility function" probably wasn't doing any work anyhow. Either way.
A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.
I think this is still wrong, because of goodhart scope of AGIs and corrigibility of humans. Agent's goodhart scope is the space of situations where it has good proxies for its preference. An agent with decisions governed by a utility function can act in arbitrary situations, it always has good proxies for its utility function. Logical uncertainty doesn't put practical constraints on its behavior. But for an aligned AGI that seems unlikely, CEV seems complicated and possible configurations of matter superabundant, therefore there are always intractable possibilities outside the current goodhart scope. So it can at best be said to have a utility function over its goodhart scope, not over all physically available possibilities. Thus the only utility function it could have is itself a proxy for some preference that's not in practice a utility function, because the agent can never actually make decisions according to a global utility function. Conversely, any AGI that acts according to a global utility function is not aligned, because its preference is way too simple.
Corrigibility is in part modification of agent's preference based on what happens in environment. The abstraction of an agent usually puts its preference firmly inside its boundaries, so that we can consider the same agent, with the same preference, placed in an arbitrary environment. But a corrigible agent is not like that, its preference depends on environment, and in the limit it's determined by its environment, not just by the agent. Environment doesn't just present the situations for an agent to choose from, it also influences the way it's making its decisions. So it becomes impossible to move a corrigible agent to a different environment while preserving its preference, unless we package its whole original environment as part of the agent that's being moved to a new environment.
Humans are not at all classical agent abstractions that carry the entirety of their preference inside their heads, they are eminently corrigible, their preference depends on environment. As a result, an aligned AGI must be corrigible not just temporarily because it needs to pay attention to humans to grow up correctly, but permanently, because its preference must also continually incorporate the environment, to remain the same kind of thing as human preference. Thus even putting aside logical uncertainty that keeps AGI's goodhart scope relatively small, an aligned AGI can't have a utility function because of observational/indexical uncertainty, it doesn't know everything in the world (including the future) and so doesn't have the data that defines its aligned preference.