[Epistemic status: ¯\_(ツ)_/¯ ]
Armstrong and Mindermann write about a no free lunch theorem for inverse reinforcement learning (IRL): the same action can reflect many different combinations of values and (irrational) planning algorithms.
I think even assuming humans were fully rational expected utility maximizers, there would be an important underdetermination problem with IRL and with all other approaches that infer human preferences from their actual behavior. This is probably obvious if and only if it's correct, and I don't know if any non-straw people disagree, but I'll expand on it anyway.
Consider two rational expected utility maximizing humans, Alice and Bob.
Alice is, herself, a value learner. She wants to maximize her true utility function, but she doesn't know what it is, so in practice she uses a probability distribution over several possible utility functions to decide how to act.
If Alice received further information (from a moral philosopher, maybe), she'd start maximizing a specific one of those utility functions instead. But we'll assume that her information stays the same while her utility function is being inferred, and she's not doing anything to get more; perhaps she's not in a position to.
Bob, on the other hand, isn't a value learner. He knows what his utility function is: it's a weighted sum of the same several utility functions. The relative weights in this mix happen to be identical to Alice's relative probabilities.
Alice and Bob will act the same. They'll maximize the same linear combination of utility functions, for different reasons. But if you could find out more than Alice knows about her true utility function, then you'd act differently if you wanted to truly help Alice than if you wanted to truly help Bob.
So in some cases, it's not enough to look at how humans behave. Humans are Alice on some points and Bob on some points. Figuring out details will require explicitly addressing human moral uncertainty.
If Alice received further information (from a moral philosopher, maybe), she'd start maximizing a specific one of those utility functions instead.
This is the key fact about Alice's behavior, which distinguishes it from Bob's behavior, so the question is whether an AI can learn that fact.
Of course the AI could if it ever observed Alice in a situation where she learned anything about morality.
Or any case that has any mutual information with how Alice would respond to moral facts. (For a sufficiently smart reasoner that includes everything---e.g. watching Alice eat breakfast gives you lots of general information about her brain, which in turn lets you make better predictions about how she would behave in other cases.)
And of course the AI would tend to create situations where Alice learned moral facts, since that's a very natural response to uncertainty about how she'd respond to moral facts.
So overall it seems like you'd have to restrict the behavior of the IRL agent quite far before this becomes a problem.
I think that this might not end up being a problem if the value learning agent can communicate with Alice (e.g. in the context of CIRL). If they don't get any info from moral philosophers, then they should probably maximise something like the expectation of her utility function for the same reason that Alice does. If they do get info, they can just give Alice that info, see what she does, and act accordingly. I think the real problem comes in in the realistic case where Alice isn't handling moral uncertainty perfectly, so the value learning agent shouldn't actually maximise the weighted sum of the utility functions she's uncertain over.
Huh, not sure why I didn't say this when I first read this post, but there is a difference between Alice and Bob -- Alice will seek out information about her utility function, while Bob will not.
Certainly any value learning method will have to account for the fact that humans do not in fact know their own values, but it's not the case that such behavior is indistinguishable from behavior that maximizes a utility function.
I meant to assume that away:
But we'll assume that her information stays the same while her utility function is being inferred, and she's not doing anything to get more; perhaps she's not in a position to.
In cases where you're not in a position to get more information about your utility function (e.g. because the humans you're interacting with don't know the answer), your behavior won't depend on whether or not you think it would be useful to have more information about your utility function, so someone observing your behavior can't infer the latter from the former.
Maybe practical cases aren't like this, but it seems to me like they'd only have to be like this with respect to at least one aspect of the utility function for it to be a problem.
Paul above seems to think it would be possible to reason from actual behavior to counterfactual behavior anyway, I guess because he's thinking in terms of modeling the agent as a physical system and not just as an agent, but I'm confused about that so I haven't responded and I don't claim he's wrong.
Oh yeah, I agree with Paul's comment and it's saying the same thing as what I'm saying. Didn't see it because I was reading on the Alignment Forum instead of LessWrong. I've moved that comment to the Alignment Forum now.