Abstract model of human bias

0Vanessa Kosoy

0Stuart Armstrong

0Vanessa Kosoy

0Stuart Armstrong

0Vanessa Kosoy

New Comment

I think it might be possible to get somewhere with a model of this type if we formalize the idea that manipulation requires considerable optimization power. For example, we can assume that a random description has low probability to be manipulative. Or, consider the following stronger assumption. For any algorithm that takes one description as input and produces another description of the same choice as output, if the computing resources used by the algorithm are sufficiently few then for most inputs it will not produce a manipulative output.

Those are some of the lines I was thinking along. But it's not clear if the peak of the distribution is close to accuracy, human bias and poor understanding being what they are..

I agree that even without manipulation, human reasoning is wildly inaccurate. But perhaps we can use a model where human reasoning *asymptotically converges* to something accurate *unless* subjected to some sort of "destructive manipulation" which is unlikely to happen by chance.

The following is one (simplistic) model which might be a useful starting point.

Consider a human and a robot playing a stochastic game like in CIRL. Suppose that each of them is an oracle machine plugged into a reflective oracle, like in the recent paper of Jan, Jessica and Benya. Let the robot have the following prior over the program implemented by the human. The human implements a random program (i.e. a random string of bits for some prefix-free universal Oracle machine) conditional on this program being asymptotically optimal in mean for the class of all robot policies that avoid producing some set of "manipulative action sequences." Here, "manipulative sequences" can be any set of action sequences s.t. where is the length of the action sequence , is the number of possible actions and is a parameter on which the prior depends.

A putative new idea for AI control; index here.Any suggestions for refining this model are welcome!

Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the "true" human preferences. The basic idea is to formalise the question:

## The AI's influence

The AI has access to an algorithm H, representing the human. It can either interact with H or simulate the interaction correctly.

The interaction consists of describing the outcome of choice A versus choice B, and then asking the human which option is better. The set of possible binary choices is C (thus (A,B)∈C). The set of descriptions is D; the set of possible descriptions for (A,B) is DA,B.

Then we have the assumption that humans can be manipulated:

Note that D1 could be a paragraph while D2 could be a ten-volume encyclopedia; all that's required is that they be logically equivalent.

But manipulating human answers in the immediate sense is not the only way the AI can influence them. Our values can change through interactions, reflection, and even through being given true and honest information, and the AI can influence this:

## The grounding assumptions

So far, we've just made the task hopeless: the AI can get any answer from H, and can make H into whatever algorithm it feels like. Saying H has preferences is meaningless.

However, we're building from a human world where the potential for human manipulating humans is limited, and somewhat recognisable. Thus:

Basically these are examples of interactions that are agreed to be fair, honest, and informative. The more abstract the choices, the harder it is to be sure of this.

Of course, we'd also allow the AI to learn from examples of negative interactions as well:

Finally, we might want a way to encode human meta-preferences:

## Building more assumptions in

This still feels like a bare-bones description, unlikely to converge to anything good. For one, I haven't even defined what "logically equivalent" means. But that's the challenge of those constructing solutions to the problem of human preferences. Can they construct sufficiently good D′A,B and D′′A,B to converge to some sort of "true" values for H? Or, more likely, what extra assumptions and definitions are needed to get such a convergence? And finally, is the result reflective of what we would want?