# 0

Personal Blog

A putative new idea for AI control; index here.

This post is just an initial foray into modelling human irrationality, for the purpose of successful value learning. Its purpose is not to be full model, but have enough details that various common situations can be successfully modelled. The important thing is to model humans in ways that humans can understand (as it's our definition which determines what's a bias and what's a preference in humans).

## Humans, actions, and joint distributions

The human themselves is simply modelled as their brain (thus various human sense organs can be observed by the AI rather than being part of the description).

Let be the set of possible reward functions the human may be maximising. Let be the set of policies the human may be following. We'll assume that is closed under the taking of mixed strategies.

The AI has a joint probability distribution over , and events in the world. By conditioning on any element , defines a map from to probability distributions over . Since is closed under the taking of mixed strategies, this means that can be seen as a map from to .

The map and the marginal distribution ( restricted to ) define entirely. Note that is what relates human actions to their explanation in terms of the reward .

## Basic properties of P

Here are a few properties could have:

#. The distribution is historical if is independent of any action the AI takes. #. An AI's action overwrites the reward if is constant, conditional on , while is still `broad' ("broad" is not fully defined, but is certainly broad enough if it assigns non-zero probability to both an and ). #. The distribution is -rational if there exits a prior distribution over the universe such that maps to the optimal policy for an -maximising agent with prior .

It's clear that if is historical, the AI will treat the human's reward function as something it has to discover, and can't influence. An action that overwrites the reward means that the human's policy is fixed by action , independently of whatever reward it might have. This is bad because a) the human actions are no longer informative to the AI about their reward, and b) the human actions are likely suboptimal with respect

Note that stratification can be seen as taking a non-historical distribution, and making it historical via counterfactual.

Those basic properties can define a basic model of a human. But humans have far more biases and irrationalities. Though these are multiple and complicated, we'll focus here on a few general properties that can capture a lot of these irrationalities in relatively "natural" ways.

By "natural", we mean human understandable properties that encode biases in ways that are not too complicated and are close to how we understand them.

Humans are not perfect logical reasoners who fully and immediately know all the infinite implications of any statement. Now, modelling bounded rationality or logical uncertainty is going to be tricky, but we can for the moment simply assume that humans only partially update their probabilities when new data comes in.

Specifically, there is a function which maps an observation and previous history to the set of statements that get updated.

Humans don't tend to update their beliefs in a fully Bayesian fashion. Thus define an update function which is used instead of Bayesian updates. If we update the odds ratio of event according to evidence , the correct Bayesian update is

• .

Then could be a function of one variable:

• .

Or of two:

• .

Or could also be a function of the observation and prior history .

Now, humans update some probabilities better than others, but we'll defer that to when we talk of multi-agent models.

# Bounded rationality: inconceivable actions

Humans don't fully explore the space of possible actions and policies, preferring to stick to those that are the most easily accessible. So most actions are literally inconceivable to us.

This can be modelled by a function which maps and/or to a set of possible actions. This map will determine a subset of possible human policies (these are the policies that only ever take actions compatible with ).

# Multi-agent models

Multi-agent models are useful for modelling the various contradictions in the human psyche (system 1 vs system 2, conscious vs subconscious, short vs long term preferences etc).

This can be modelled as seeing the human as consisting of different agents , , ... , each of them with their own possible biases as described above (tohugh they must share a common function ). There is a function which, taking and possibly into account, weights the verdicts of the different subagents and outputs the ultimate action.

These multiple agents may be optimising for different reward functions, so can break down into multiple , one for each agent.

# Recursion and introspection

This is the most complex situation here. Humans have explicit meta preferences ("I don't want to be racist", "I want to be rational", "I want to be right", etc...) that influences how they update their beliefs. Generally these meta-preferences view the human as an integrated whole.

We can try and model this by positing meta preferences which are desirable properties of the agent as a whole, and an introspection function which will trigger occasionally, and map some feature of the agent closer to . This is deliberately very vague, and I'll try and flesh it out and formalise it as needed.

## The modelled human

Thus we can define as representing an irrational human if:

1. The prior is -consistent if the human's actions are chosen by applying the agent weighting function to , where maps any to the action chosen by an agent with reward , bounded rationality , partial update , and available actions .
2. The prior is -consistent if it is as above, plus the action of the introspection function on the meta-preferences .
Personal Blog