All of Leon Lang's Comments + Replies

Thanks a lot for these pointers!

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded action... (read more)

5Rohin Shah5mo
Oh, I see. I'm not interested in "solving outer alignment" if that means "creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations" (because as you point out it seems far too stringent a requirement). You could look at ascription universality [] and ELK []. The general mindset is roughly "ensure your reward signal captures everything that the agent knows"; I think the mindset is well captured in mundane solutions to exotic problems [].

To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.


I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI. 

If so, I would ... (read more)

3Rohin Shah5mo
I'm confused about what you're trying to say in this comment. Are you saying "good feedback as defined here does not solve alignment"? If so, I agree, that's the entire point of goal misgeneralization (see also footnote 1). Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I'm not sure why you think it is a bug).

Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text.