AI ALIGNMENT FORUM
AF

Humans consulting HCH
Personal Blog

1

HCH as a measure of manipulation

by orthonormal
11th Mar 2017
1 min read
7

1

Humans consulting HCH
Personal Blog
HCH as a measure of manipulation
1Ryan Carey
0orthonormal
0orthonormal
0orthonormal
1Jessica Taylor
1orthonormal
New Comment
6 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:55 PM
[-]Ryan Carey8y10

I can think of two problems:

  1. Let's generously suppose that q is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on q by a little bit but may yet change the consequences of acting on those responses by a lot.
  2. Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a "natural way". i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.
Reply
[-]orthonormal8y00

Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.

Reply
[-]orthonormal8y00

Re #1, an obvious set of questions to include in q are questions of approval for various aspects of the AI's policy. (In particular, if we want the AI to later calculate a human's HCH and ask it for guidance, then we would like to be sure that HCH's answer to that question is not manipulated.)

Reply
[-]orthonormal8y00

There's the additional objection of "if you're doing this, why not just have the AI ask HCH what to do?"

Overall, I'm hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human's HCH via certain informational content, than for the AI to reliably calculate the human's HCH. But I don't have strong arguments for this intuition.

Reply
[-]Jessica Taylor8y10

"Having a well-calibrated estimate of HCH" is the condition you want, not "being able to reliably calculate HCH".

Reply
[-]orthonormal8y10

I should have said "reliably estimate HCH"; I'd also want quite a lot of precision in addition to calibration before I trust it.

Reply
Moderation Log
Curated and popular this week
6Comments

A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi:

We'd like to have a straightforward way to define "manipulation", so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact.

We could initially define manipulation in terms of a human's expected actions, or more robustly, in terms of effects on a human's policy distribution across a wide array of plausible environments. However, we'd like to have our AI still be able to tell us information (in a non-manipulative manner) instead of hiding from us in an effort to avoid all influence!

The title of course spoils the next idea: if the AI can reason about some suitable model of HCH, then we can define the notion of "action a has very low influence on a human, as compared to the null action, apart from conveying information x": that over a distribution of questions q,

HCH(q)|a≈HCH(x,q)|null

where HCH is defined relative to that human; we're conditioning the distribution on whether the AI takes action a or the null action; and x,q is the input consisting of statement x followed by question q.

This of course does not exclude the use of manipulative statements x, but it at least could allow us to reduce forms of manipulation to those that would happen with the text input to HCH.

I'd prefer to have the AI reason about HCH rather than just (e.g.) the human's actions in a one-hour simulation, because HCH can in principle capture a human's long-term and extrapolated preferences, and these are the ones I most want to ensure don't get manipulated.

Is there an obvious failure of this approach, an obvious improvement to it, or something simpler that it reduces to?

Mentioned in
2Comparing AI Alignment Approaches to Minimize False Positive Risk