This post is a more minor post, that I'm putting up to reference in other posts.

Probabilities, weights, and expectations

You're an agent, with potential uncertainty over your reward function. You know you have to maximise

$0.5 R_{1} + 0.5 R_{2}$

where $R_{1}$ and $R_{2}$ are reward functions. What do you do?

Well, how do we interpret the $0.5$ ? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:

If you won't be learning any more information to help you distinguish between reward functions, then weights and probabilities play the same role.

Thus, if you don't expect to learn any more reward function-relevant information, maximising reward given $P (R_{1}) = P (R_{2}) = 0.5$ is the same as maximising the single reward function $R_{3} = 0.5 R_{1} + 0.5 R_{2}$ .

So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:

$\begin{matrix} 0.5 R_{1} + 0.5 R_{2} 1 (0.5 R_{1} + 0.5 R_{2}) 0.25 R_{1} + 0.25 R_{2} + 0.5 (0.5 R_{1} + 0.5 R_{2}) 0.5 (1.5 R_{1} - 0.5 R_{2}) + 0.5 (1.5 R_{2} - 0.5 R_{1}) \end{matrix}$

Now, given a probability distribution $p_{R}$ over reward functions, we can take its expectation $E (p_{R})$ . You can define this by talking about affine spaces and so on, but the simple version of it is: to take an expectation, rewrite every probability as a weight. So the result becomes:

If you won't be learning any more information to help you distinguish between reward functions, then distributions with same expectation are equivalent.

Expected evidence and unriggability

We've defined an unriggable learning process as one that respects conservation of expected evidence.

Now, conservation of expected evidence is about expectations. It basically says that, if $π_{1}$ and $π_{2}$ are two policies the agent could take, then for the probability distribution $p_{R}$ ,

$E (p_{R} ∣ π_{1}) = E (p_{R} ∣ π_{2}) .$

Suppose that $p_{R}$ is in fact riggable, and that we wanted to "correct" it to make it unriggable. Then we would want to add a correction term for any policy $π$ . If we took $π_{0}$ as a "default" policy, we could add a correction term to $p_{R} ∣ π$ :

$(p_{R} ∣ π) \to (p_{R} ∣ π) - E (p_{R} ∣ π) + E (p_{R} ∣ π_{0}) .$

This would have the required unriggability properties. But how do you add to a probability distribution - and how do you subtract from it?

Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.

For example, if $(p_{R} ∣ π) = 0.5 R_{1} + 0.5 R_{2}$ and $(p_{R} ∣ π_{0}) = 0.75 (R_{1} - R_{2}) + 0.25 R_{2}$ , then we can map $(p_{R} ∣ π)$ to

$1 (0.75 R_{1} - 0.5 R_{2})$ ,
$0.75 (R_{1} - R_{2}) + 0.25 R_{2}$ ,
$0.5 (R_{1} + R) + 0.5 (R_{2} + R)$ with $R = 0.25 R_{1} - R_{2}$ ,
$0.75 R_{1} + 0.25 (- 2 R_{2})$ ,
and many other options...

This multiplicity of possibilities is what I was trying to deal with in my old post about reward function translations.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

Probabilities, weights, sums: pretty much the same for reward functions

6

Probabilities, weights, and expectations

Expected evidence and unriggability