A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ϵ, and for a given utility u consider the utility:

u#=(2/ϵ)Xu−u.

Before t, the expected value of (2/ϵ)X is 2, so u#=u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.

This isn't perfect corrigibility --- the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

u#=Ω(2/ϵ)Xu−u.

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.

A putative new idea for AI control; index here.A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ϵ, and for a given utility u consider the utility:

Before t, the expected value of (2/ϵ)X is 2, so u#=u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.

This isn't perfect corrigibility --- the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.