# 0

Personal Blog

A putative new idea for AI control; index here.

When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.

## The problems with the old setups

My initial setup has a counterfactual which I'll call a causal counterfactual, following Jessica's terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent's utility function.

There are low-probability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those low-probability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A="the human presses the button" and A'="the quantum event presses the button" were taken to be interchangeable, as were B="the human doesn't press the button" and B'="the quantum event prevents the button from being pressed".

The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn't press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.

Benya has sketched out a problem with another approach to counterfactuals, which we can call "evidential counterfactuals". Here we're trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A' and B' here).

The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.

For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.

A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn't intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn't).

## Solving the problem: prior modification

Here we'll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent bias-indifferent in a sensible way.

To do this, take a standard reward learning agent, with posterior and prior on the POMDP .

The agent will be modified as follows: it will be given a different prior (which will ultimately lead to a different posterior ).

The prior and will start off equal on the empty history:

• .

Then the will be defined inductively. Recall that action is unbiased given history if . Define the bias of as:

• .

Thus is the bias of action , as measure between and the expectation of . This bias is used as a corrective term to , to make the agent suitably indifferent to biasing actions.

Indifference in this setting is defined by three criteria:

#. The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward. #. The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is. #. Given the above, maintains the distribution and structure of as much as possible.

The first criteria is implied by for all actions , ie being unbiased (as if there is no further information about the correct reward, then there is a single well defined , conditional on and , and being unbiased means that this is equal to ).

Then, given that is unbiased, the second criteria simply means that remains a distribution over (as the consequences of perfect information is just a weighted average of the pure ' points). Which seems obvious. What would be but a distribution over ? See the next sections for what this criteria really means.

# Indifference for small biases

Let be the set of possible future values of (given the history and the action ). Note that is a subset of the simplex , the set of probability distributions on .

The bias is small' if for all , then is also an element of the simplex .

In that case, is simply defined as . By assumption, this is an element of . The expectation of this expression is:

• .

Thus is unbiased.

And since this is simply a translation, it preserves the structure of , this satisfies all the criteria above.

# Indifference for large biases

If the bias is large, in that there exists a possible value of with not a point on , then we need to proceed differently.

As before, let be the set of possible future values of (given the history and the action ), and for , let be the probability of , given and .

The we want to replace with , where is as close to' as possible. Since embeds in , the Euclidean metric on the later restricts to the former.

Thus consider the constrained optimisation problem for :

• Minimise subject to: #. , #. .

Then define as .

If we see and as random variables dependent of and , the optimisation problem is the same as saying that is bias-free while has minimised variance.

The constraints are not contradictory: for instance will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements that minimise the strictly convex quadratic function.

And obviously, if is always in , then is the optimal set of solutions solution, so this optimisation reproduces the small biases' case.

# The final values

This alternate prior leads to an alternate posterior , simply defined by having it equal to on complete histories: .

# Another alternative

It should be noted that if we're willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is', then there's a simpler solution: simply always define as , applying the solution for small biases to large biases.

This means that (and ultimately ) need not be elements of . However, can still define a reward the agent can optimise, in the following sense:

• Given a complete history , the agent will maximise the reward .

Since need not be in , some of these coefficients can be negative, but that still results in a consistent to maximise.

## Properties of the approach

It's clear the agent is indifferent to bias, but notice that this doesn't prevent the agent from learning: once it gets an observation, can change significantly. It's just changes to its expectation that are controlled.

Notice also that the agent doesn't believe, or act as if it believed, anything unlikely: its bets will be accurate.

And it doesn't have the problem of lotteries. Assume that the agent has , and there is a lottery which the agent has one chance in a million of winning.

Then if it takes action which ensures that chooses if and only if it wins the lottery, then with probability it ends up with reward function and a won lottery, and with probability it ends up with reward function and a lost lottery. The expected reward function is still ; it has simply split this expectation differently across worlds where it's won or not won the lottery.

One thing that this approach doesn't solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent's reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).

Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.

## Indifference and bias

Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).

Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.

Personal Blog

# 0

New Comment

If we apply this to the shutdown problem, is it acceptable to say:

If not, what would you set to? (I'm treating and as reward functions here which seems fine)

For policies/actions that don't affect the probability of humans pressing the button, .

For actions that do affect the probability a little bit, the effect of will be to undo this, by, for instance, slightly increasing the probability of given the button was pressed.

I'm not completely sure what multiple actions with large changes of probability would lead to (in expectation, nothing, but in actual fact...)

Hmm... I'm finding that I'm unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what and are (since these are parameters of the algorithm). From those I can derive and to determine the agent's action. But at the moment I have no way of proceeding, since I don't know what and are. Can you get me unstuck?

Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If is the observation of the shutdown command and the observation of the paperclip maximising command, and and the relevant utilities, then can be defined as and , for all histories .

Then define as the probability of versus , conditional on the fact that the agent follows a particular deterministic policy .

If the agent does indeed follow , then . If it varies from this policy, then is altered in proportion to the expected change in caused by choosing a different action.

It seems like you're taking to be a real number. Is a specific event?

Because of the specific time restriction, there is no way to randomise the outcome ahead of time. And because it’s assumed tied to a specific physical event, there is no way to influence it at all. The whole physical definition and apparatus serve the purpose of making biasing the only way to affect the result.

Hmm, I don't understand. Of course it is possible to influence the button push without biasing it (e.g. create a robot that flips a coin and then pushes or doesn't push the button). And of course it's not possible to influence the quantum event in any way (including by biasing it). So I don't see any event that can't be influenced in any way except by biasing it.

The way I'm using the term, unbiased influence involves replacing the stochastic event with another one that has same mean. But since (or if) the quantum event is specifically defined in the process, this can't be done.

Let me ask a more specific question. In your setup with the quantum event and the button , can you define the event such that:

1. The agent can influence by biasing .
2. The agent can't influence without biasing .

Clearly, and , so I don't know what is. (I interpreted you as saying there is such an ; let me know if this is incorrect)

You are correct and I'm wrong. The causal counterfactual is unbiased and uninfluenceable. The evidential counterfactual is both biased and influenceable. I'll correct the post.