Goal completion: noise, errors, bias, prejudice, preference and complexity

Stuart_Armstrong

A putative new idea for AI control; index here.

This is a preliminary look at how an AI might assess and deal with various types of errors and uncertainties, when estimating true human preferences. I'll be using the circular rocket model to illustrate how these might be distinguished by an AI. Recall that the rocket can accelerate by -2, -1, 0, 1, and 2, and the human wishes to reach the space station (at point 0 with velocity 0) and avoid accelerations of ±2. In the forthcoming, there will generally be some noise, so to make the whole thing more flexible, assume that the space station is a bit bigger than usual, covering five squares. So "docking" at the space station means reaching {-2,-1,0,1,2} with 0 velocity.

The purpose of this exercise is to distinguish true preferences from other things that might seem to be preferences from the outside, but aren't. Ultimately, if this works, we should be able to construct an algorithm that identifies preferences, such that anything it rejects is at least arguably not a preference.

So, at least initially, I'll be identifying terms like "bias" with "fits into the technical definition of bias used in this model"; once the definitions are refined, we can then check whether they capture enough of the concepts we wand.

So I'm going to use the following terms to distinguish various technical concepts in this domain:

@. Noise @. Error @. Bias @. Prejudices @. Known prejudices @. Preferences

Here, noise is when the action selected by the human doesn't lead to the correct action output. Error is when the human selects the wrong action. Bias is when the human is following the wrong plan. A prejudice is a preference the human has that they would not agree upon if it was brought to their conscious notice. A know prejudice is a prejudice the human knows about, but can't successfully correct within themselves.

And a preference is a preference.

What characteristics would allow the AI to distinguish between these? Note that part of the reason that the terms are non-standard is that I'm not starting with perfectly clear concepts and attempting to distinguish them; instead, I'm finding way of distinguishing various concepts, and seeing if these map on well to the concepts we care about.

Noise versus preference and complexity

In this model, noise is seen as 5% chance of under-accelerating, ie a desired acceleration of ±2 will, 5% of the time, give an acceleration of ±1. And a desired acceleration of ±1 will, 5% of the time, give an acceleration of zero.

The human starts from a position where, to reach the space station at zero velocity, the best plan is to accelerate for a long while and decelerate (by -1) for two turns. Accelerating by -2 once would also do the trick, though the human prefers not to do that, obviously.

As in the previous post, the AI has certain features φi to explain the human's behaviour. They are:

φ0({-2,1,0,1,2},0;-)=1
φ1(-,-;-)=1
φ2(-,-;+2)=1
φ3(-,-;-2)=1

The first feature indicates that the five-square space-station is in a special position (if the velocity is 0). The second feature (a universal feature) is used to show that wasting time is not beneficial. The third feature is used to rule out accelerations of +2, the last feature those of -2.

Given the trajectory it's seen, the AI can confidently fit φ0, φ1 and φ2 to some estimate of true rewards (the human rushes to the space station, without using +2 accelerations). However, it doesn't know what to do with φ3. The human had an opportunity to use -2 acceleration, but went for two -1s instead. There are two options: the human actually wants to avoid -2 accelerations, and everything went well. Or the human doesn't want to avoid them, but the noise forced their desired -2 acceleration down to -1.

Normally there would be a complexity prior here, with a three-feature explanation being the most likely - possibly still the most likely after multiplying it by 5% to account for the noise. However, there is a recurring risk that the AI will underestimate the complexity of human desires. One way of combating this is to not use a complexity prior, at least up to some "reasonable" size of human desires. If the four-feature explanation has more prior weight than the three-feature one, then the φ3 is likely to be used and the AI will see the -1,-1 sequence as deliberate, not noise.

A warning, however: humans have complex preferences, but those preferences are not relevant to every single situation. What about φ4, the human preference for chocolate, φ5, the human preference for dialogue in movies, and φ6, the human preference for sunlight? None of them would appear directly in this simple model of the rocket equation. And though φ0-φ1-φ2-φ3 is a four-feature model of human preferences, so is φ0-φ1-φ2-φ6 (which is indistinguishable, in this example, from the three-feature model φ0-φ1-φ2).

So we can't say "models with four features are more likely than models with three"; at best we could say "models with three relevant features are as likely as models with four relevant features". But, given that, the AI will still converge on the correct model of human preferences.

Note that as the amount of data/trajectories increases, the ability of the AI to separate preference from noise increases rapidly.

Error versus bias versus preference

First, set noise to zero. Now imagine that the rocket is moving at a relative velocity such that the ideal strategy to reach the space station is to accelerate by +1 for three more turns, and then decelerate by -1 for several turns until it reaches the station.

Put the noise back up to 5%. Now, the optimum strategy is to start decelerating immediately (since there is a risk of under-accelerating during the deceleration phase). Instead, the human starts accelerating by +1.

There are three possible explanations for this. Firstly, the human may not actually want to dock at the space station. Secondly, the human may be biased - overconfident, in this case. The human may believe there is no noise (or that it can overcome it through willpower and fancy flying?) and therefore is following the ideal strategy in the no-noise situation. Or the human may simply have made an error, doing +1 when it meant to do -1.

These options can be distinguished by observing subsequent behaviour (and behaviour on different trajectories). If we assume the preferences are correct, then a biased trajectory involves the human following the ideal plan for an incorrect noise value, and the desperately adjusting at the end when they realise their plan won't work. An error, on the other hand, should result in the human trying to undo their action as best they can (say, by decelerating next turn rather than following the +1,+1,-1,-1,... strategy of the no-noise world).

These are not sharp distinctions (especially on single trajectory or a small set of them). Maybe the human has a preference for odd manoeuvres as it approaches the space station. Maybe it makes a mistake every turn, and purely coincidentally follows the right trajectory. And so on.

But this is about the most likely (simplest) explanation. Does the human show all the signs of being a competent seeker of a particular goal, except that sometimes they seem to do completely random things, which they then try and repair (or shift to a completely alternate strategy if the random action is not repairable)? Most likely an error.

Is the human behaviour best explained by simple goals, but a flaw in the strategy? This could happen if the overconfident human always accelerated too fast, and then did some odd manoeuvres back and forth to dock with the station. This could be odd preferences for the docking procedure, but a larger set of trajectories could rule this out: sometimes, the overconfident human will arrive perfectly at the station. In that case, they will not perform the back and forth dance, revealing that that behaviour was a result of a flawed strategy (bias) rather than odd preferences.

A subtlety in distinguishing bias is when the human (or maybe the system its in) uses meta-rationality to correct for the bias. Maybe the human is still overconfident, but has picked up a variety of habits that compensate for that overconfidence. How would the AI interpret some variant of overly prudent approach phase, followed by wildly reckless late manoeuvring (when errors are easier to compensate for)? This is not clear, and requires more thought.

Preference versus prejudice (and bias)

This is the most tricky distinction of all - how would you distinguish a prejudice from a true preference? One way of approaching it is to see if presenting the same information in different ways makes a difference.

This can be attempted with bias as well. Suppose the human's reluctance for ±2 accelerations is due to a bias that causes them to fear that the rocket will fall apart at those accelerations, but that bias isn't accurate. Then the AI can report either "we have an acceleration of +2" or "we have the highest safe acceleration". Both are saying the same thing, but the human will behave differently in either, revealing something about what is preference and what is bias.

What about prejudice? Racism and sexism are obvious examples, but it's more common than that. Suppose the pilot listens to opera music while flying, and unconsciously presses down harder on the accelerator while listening to "ride of the Valkyries". This fits perfectly into the prejudice format: it's a preference that the pilot would want to remove if they were informed about it.

To test this, the AI could offer to inform the human pilot of the music selection when the pilot was planning the flight (possibly at some small price). If the pilot had a genuine preference for "flying fast when listening to Wagner", then this music selection is relevant for their planning, and they'd certainly want it. If the prejudice was unconscious, however, they would see no interest in seeing the music selection at this point.

Once a prejudice is identified, the AI then has the option of asking the human directly if they agree with it (thus upgrading it to a true but unknown preference).

Known prejudices

Sometimes, people have prejudices, know about them, don't like them, but can't avoid them. They might then have very complicated meta-behaviours to avoid falling prey to them. To use the Wagner example, someone trying to repress that would seem to have the double preferences "I prefer to never listen to Wagner while flying" and "if, however, I do hear Wagner, I prefer to fly faster", when in fact neither of these are preferences.

It would seem that the simplest would be to have people list their undesired prejudices. But apart from the risk they could forget some of them, their statements might be incorrect. They could say "I don't want to want to fly faster when I hear opera", while in reality only Wagner causes that in them. So further analysis is required beyond simply collecting these statements.

Revisiting complexity

In a previous post, I explored the idea of giving the AI some vague idea of the size and complexity of human preferences, and that it should aim in that size for its explanations. However, I pointed out a tradeoff: if the size was too large, the AI would label prejudices or biases as preferences, while if the size was too small, it would ignore genuine preferences.

If there are ways of distinguishing biases and prejudices from genuine preferences, though, then there is no trade-off. Just put the expected complexity for combined human preferences+prejudices+biases at some number, and let algorithm sort out what is preference and what isn't. It is likely much easier to estimate the size of human preferences+pseudo-preferences, than it is to identify the size of true preferences (that might vary more from human to human, for start).

I welcome comments, and will let you know if this research angle goes anywhere.