Let's generously suppose that $q$ is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on $q$ by a little bit but may yet change the consequences of acting on those responses by a lot.
Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a "natural way". i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.

Reply

[-]orthonormal9y00

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

Reply

[-]orthonormal9y00

Re #1, an obvious set of questions to include in $q$ are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

Reply

[-]orthonormal9y00

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

Reply