A putative new idea for AI control; index here.

A simple and easy design for a -maximising agent that turns into a -minimising one.

Let be some boolean random variable outside the agent's control, that will be determined at some future time (based on a cosmic event, maybe?). Set it up so that , and for a given utility consider the utility:

  • .

Before , the expected value of is , so . Hence the agent is a -maximiser. After , the most likely option is , hence a little bit of evidence to that effect is enough to make into a -minimiser.

This isn't perfect corrigibility --- the agent would be willing to sacrifice a bit of -value (before ) in order to maintain its flexibility after . To combat this effect, we could instead use:

  • .

If is large, then the agent is willing to pay very little -value to maintain flexibility. However, the amount of evidence of that it needs to become a -minimiser is equally proportional to , so better be a clear and convincing event.

New Comment