Here I'll develop my observation that anchoring bias is formally similar to taste based preferences, and develop some more formalism for learning the values/preferences/reward functions of a human.

Anchoring or taste

An agent $H$ (think of them as a simplified human) confronts one of two scenarios:

In scenario I, the agent sees a movie scene where someone wonders how much to pay for a bar of chocolate, spins a wheel, and gets either £0.01 or £100. Then $H$ is asked how much they would spend for the same bar of chocolate.
In scenario II, the agent sees a movie scene in which someone eats a bar of chocolate, which reveals that the bar has nuts, or doesn't. Then $H$ is asked how much they would spend for the same bar of chocolate.

In both cases, $H$ will spend £1 for the bar (£0.01/no nuts) or £3 (£100/nuts).

We want to say that scenario I is due to anchoring bias, while scenario II is due to taste differences. Can we?

Looking into the agent

We can't directly say anything about $H$ just by their actions, of course - even with simplicity priors. But we can make some assumptions if we look inside their algorithm, and see how they model the situation.

Assume that $H$ 's internal structure consists of two pieces: a modeller $M$ and an assessor $A$ . Any input $i$ is streamed to both $M$ and $A$ . Then $M$ can interrogate $A$ by sending an internal variable $v$ , receives another variable in return, and then outputs $o$ .

In pictures, this looks like this, where each variable has been indexed by the timestep at which it is transmitted:

Here the input $i_{1}$ decomposes in $m$ (the movie) and $q$ (the question). Assume that these variables are sufficiently well grounded that when I describe them ("the modeller", "the movie", "the key variables", and so on), these descriptions mean what they seem to.

So the modeller $M$ will construct a list of all the key variables, and pass these on to the assessor $A$ to get an idea of the price. The price will return in $v_{3}$ , and then $M$ will simply output that value as $o_{4}$ .

A human-like agent

First we'll design $H$ to look human-like. In scenario I the modeller $M$ will pass $v_{2} = q$ to the assessor $A$ - only the question $q =$ "how much is a bar of chocolate worth?" will be passed on (in a real world scenario, more details about what kind of chocolate it is would be included, but let's ignore those details here). The answer $v_{3}$ will be £1 or £3, as indicated above, dependent on $m$ (which is also an input into $A$ ).

In scenario II, the modeller will pass on $v_{2} = {q, n}$ where $n$ is a boolean that indicates whether the chocolate contains nuts or not. The response $v_{3}$ will be £1 if $n = 0$ (false) or £3 if $n = 1$ (true).

Can we now say that anchoring is a bias but the taste of nuts is a preference? Almost, we're nearly there. To complete this, we need to make the normative assumption:

$α$ : key variables that are not passed on by $M$ are not relevant to the agent's reward function.

Now we can say that anchoring is a bias (because the variable that changes the assessment, the movie, affects $A$ but is not passed on via $M$ ), while taste is likely a preference (because the key taste variable is passed on by $M$ ).

A non-human agent

We can also design an $H^{'}$ with the same behaviour as $H$ , but clearly non-human. For $H^{'}$ , $v_{2}^{'} = q$ in scenario II, while $v_{2}^{'} = {q, n}$ is scenario I, where $n$ is a boolean encoding whether the movie-chocolate was bought for £0.01 or for £100.

In that case, $α$ will assess anchoring as a demonstration of preference, while the presence of nuts is clearly an irrational bias. And I'd agree with this assessment - but I wouldn't call $H^{'}$ a human, for reasons explained here.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

5

Anchoring vs Taste: a model

5

Anchoring or taste

Looking into the agent

A human-like agent

A non-human agent