Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure.

1) We make our agent myopic. It only cares about the reward that it accrues in the next timesteps.

2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let be the normal range of reward, with the sole exception that blowing up the moon gives a reward of .

3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize , where is the reward and is the impact.

If the impact measure is working, and there's no way to blow up the moon keeping the impact less than , then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than , it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next timesteps. By making sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable.

An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence.



New Comment
1 comment, sorted by Click to highlight new comments since:

Sorry, I'm confused by the terminology: