There are three natural reward functions that are plausible:
R(0), which is linear in the number of times B(0) is pressed.
R(1), which is linear in the number of times B(1) is pressed.
R(2)=I(E,X)R(0)+I(O,X)R(1), where I(E,X) is the indicator function for X being pressed an even number of times, I(O,X)=1−I(E,X) being the indicator function for X being pressed an odd number of times.
Why are these reward functions "natural" or more plausible than R(3)=c, (some constant, independent of button presses), R(4)=R(0)+R(1) (the total number of button presses), etc.
Why are these reward functions "natural" or more plausible than R(3)=c, (some constant, independent of button presses), R(4)=R(0)+R(1) (the total number of button presses), etc.