This post is a more minor post, that I'm putting up to reference in other posts.
You're an agent, with potential uncertainty over your reward function. You know you have to maximise
where R1 and R2 are reward functions. What do you do?
Well, how do we interpret the 0.5? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:
Thus, if you don't expect to learn any more reward function-relevant information, maximising reward given P(R1)=P(R2)=0.5 is the same as maximising the single reward function R3=0.5R1+0.5R2.
So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:
Now, given a probability distribution pR over reward functions, we can take its expectation E(pR). You can define this by talking about affine spaces and so on, but the simple version of it is: to take an expectation, rewrite every probability as a weight. So the result becomes:
We've defined an unriggable learning process as one that respects conservation of expected evidence.
Now, conservation of expected evidence is about expectations. It basically says that, if π1 and π2 are two policies the agent could take, then for the probability distribution pR,
E(pR ∣π1)=E(pR ∣π2).
Suppose that pR is in fact riggable, and that we wanted to "correct" it to make it unriggable. Then we would want to add a correction term for any policy π. If we took π0 as a "default" policy, we could add a correction term to pR∣π:
This would have the required unriggability properties. But how do you add to a probability distribution - and how do you subtract from it?
Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.
For example, if (pR∣π)=0.5R1+0.5R2 and (pR∣π0)=0.75(R1−R2)+0.25R2, then we can map (pR∣π) to
This multiplicity of possibilities is what I was trying to deal with in my old post about reward function translations.