Loosely crossposted at the Intelligent Agent Forum.

Previously, I presented a model of rationality-reward pairs (p,R), in which R was a reward and p the (ir)rationality planner that mapped the reward R to the agent's policy.

But I recently realised that this model can also track whether something is currently over-writing or over-riding the human's preferences. Whether some entity, through drugs, manipulation, brain surgery, or whatever methods, has illegitimately changed someone's preferences. As before, this only models that situation, it doesn't allow you to conclude that it's actually happening.

Feast or heroin famine

An AI has the opportunity to surreptitiously inject someone with heroin (I) or not do so (¬I). If it doesn’t, the human will choose to enjoy a massive feast (F); if it does, the human will instead choose more heroin (H).

So the human policy is given by π(I)=H, π(¬I)=F.

The pair (p,R) are compatible with π if p(R)=π; if using (ir)rationality planner p to maximise reward R leads to policy π.

Reward and rationality

There are three natural R's to consider here: Rp, a generic pleasure reward. Next, Re, the ‘enjoyment’ reward, where enjoyment is pleasure endorsed as ‘genuine’ by common judgement. Assume that Rp(H)=1, Rp(F)=1/3, Re(F)=1/2, and Re(H)=0 - heroin is more pleasurable than a feast but less enjoyable.

Finally, there is the twisted reward Rt, which is Rp conditional on I and Re conditional on ¬I (twisted rewards may seem more complicated than simple rewards, but that is not always the case).

There are two natural p's: pr, the fully rational planner. And pf, the planner that is fully rational conditional on I, but always maps to H if I is chosen: pf(R)(I)=H, for any reward R.

Compatibility

The pair (pr, Re) is not compatible with π: it predicts that the human would take action F following I (feast following injection). The reward Rp is compatible with neither m: it predicts H following ¬I (heroin following no injection).

The other three pairs are compatible: pr(Rt), pf(Rt), and pf(Re) all give the correct policy π.

Regret and overriding rewards

This leads to a definition of when the AI is overriding human rewards. Given a pair compatible (p,R), an AI’s action A overrides the human reward if π|A is poorly optimised for maximising R. If V(π)(R|A) is the expected reward (according to R, conditional on A) of the actual human policy, and V*(R|A) is the expected reward of the human following the ideal policy for maximising R, then a measure of how much the AI is overriding rewards is the regret:

V*(R|A)-V(π)(R|A).

One might object that this isn’t the AI overriding the reward, but reducing human rationality. But these two facets are related: π|A is poorly fitted for maximising R, but there’s certainly another reward R' which π|A is better suited to maximise. So the AI is effectively forcing the human into maximising a different reward.

(An alternate, but related, measure of whether people’s reward is being overridden is whether, conditional on A, p(R) is ‘sensitive’ to the reward R. A merely incompetent human would have p(R) changing a lot dependent on R - though never maximising it very well - while one with reward overridden would have the same behaviour whatever R it was supposedly supposed to maximise).

Apply these to the example above. The (pr, Rt) pair means that the human is rationally maximising the twisted reward Rt. The (pf, Rt) is one where the injection forces the human into a very specific behaviour - specific behaviour that coincidentally is exactly the right thing for their reward. Finally, (pf, Re) claims that the injection forces the human into specific behaviour that is detrimental to their reward. In the first two cases, the AI’s recommended action is I (expected reward 1 versus 1/2 for ¬I, maximum regret 1/2), in the second it’s ¬I (expected reward 1/2 versus 0 for I, maximum regret also 1/2).

Of course, it’s also possible to model humans are opiode-maximisers, whose rationality is overridden by not getting heroin injections; as already stated, rewards and rationality cannot be deduced from observations alone.

As above, so below: no natural default

You may have noticed that, according to this definition, an AI is over-riding human rewards when it lets humans just be less effective at achieving their goals. An irrationality on the part of a human is a thus a failure of an AI that serves the human interest.

But this is not a new problem in AI. You start with a goal phrased as "reduce suffering", and expecting it to cure pain and diseases, before realising you're instructed it to optimise the universe to force everyone to avoid the slightest bit of contrary emotion.

Similarly, there is no clear way to distinguish "don't force humans into behaviour contrary to their rewards" from "make sure humans maximise their rewards to the upmost". We better make damn sure we get the right reward.