Written quickly after a CHAI meeting on the topic, haven't thought through it in depth.

If we write down an explicit utility function and have an AI optimize that, we expect that a superintelligent AI would end up doing something catastrophic, not because it misunderstands what humans want, but because it doesn't care -- it is trying to optimize the function that was written down. It is doing what we said instead of doing what we meant.

An approach like Inverse Reward Design instead says that we should take the human's written down utility function as an observation about the true reward function, and infer a distribution over true reward functions. This agent is "doing what we mean" instead of doing what we said.

This suggests a potential definition -- in a "do what we mean" system, the thing that is being optimized is a latent variable, whereas in a "do what we say" system, it is explicitly specified. Note that "latent" need not mean that you have a probability distribution over it, it just needs to be hidden information. For example, if I had to categorize iterated distillation and amplification, it would be as a "do what we mean" system where the thing being optimized is implicit in the policy of the human and is never made fully certain.

However, this doesn't imply that we want to build a system that exclusively does what we mean. For example, with IRD, if the true reward function is not in the space of reward functions that we consider (perhaps because it depends on a feature that we didn't have), you can get arbitrarily bad outcomes (see the problem of fully updated deference). One idea would be to have a "do what we mean" core, which we expect will usually do good things, but have a "do what we say" subsystem that adds an extra layer of safety. For example, even if the "do what we mean" part is completely sure about the human utility function and knows we are making a mistake, the AI will still shut down if we ask it to because of the "do what we say" part. This seems to be the idea in MIRI's version of corrigibility.

I'd be interested to see disagreements with the definition of "do what we mean" as optimizing a latent variable. I'd also be interested to hear how "corrigibility" and "alignment" relate to these concepts, if at all. For example, it seems like MIRI's corrigibility is closer to "do what we say" while Paul's corrigibility is closer to "do what we mean".

New Comment
2 comments, sorted by Click to highlight new comments since:

I liked this post when it came out, and I like it even more now. This also brings to mind Paul's more recent Inaccessible Information.

Thanks! I like it less now, but I suppose that's to be expected (I expect I publish posts when I'm most confident in the ideas in them).

I do think it's aged better than my other (non-public) writing at the time, so at least past-me was calibrated on which of my thoughts were good, at least according to current-me?

The main way in which my thinking differs is that I'm less optimistic about defining things in terms of what "optimizing" is happening -- it seems like such a definition would be too vague / fuzzy / filled with edge cases to be useful for AI alignment. I do think that the definition could be used to construct formal models that can be analyzed (as had already been done in assistance games / CIRL or the off switch game).

The definition is also flawed; it clearly can't be just about optimizing a latent variable, since that's true of any POMDP. What the agent ends up optimizing for depends entirely on how the latent variable connects to the agent's observations; this clearly isn't enough for do what we mean. I think the better version is Stuart's three principles in Human Compatible (summary).