Subagents and attainable utility in general

by Stuart Armstrong 2 min read13th Feb 2020No comments


This post looks more generally into the subagent issue with attainable utility, discussed in the previous two posts. The key questions are:

  1. What is the minimal impact penalty of (cleverly) building the subagent?
  2. In practice, what will be the impact penalty of each step of building the subagent?
  3. What will the restrictions on the subagent be, once it is built?

I expect that most of the discussions about attainable utility impact measure will centre on 2.: how can an agent put together a subagent without triggering the impact measure at some point? We know that it's possible to do in some situations, and I've made the generic argument that, since we don't have an impossibility result, a superintelligent agent should be able to do in most situations. Still, this is the point of my argument where there is the most to criticise.

But in this post, I'll show that 1. and 3. don't restrict the subagent in any significant way: one can generically build a subagent with no impact penalty, and there are essentially no restrictions on the subagent, once it is built (I was in error in the initial version of this section: the subagent will have no serious precommitments or other restrictions.

Subagent impact

Impact measure

I'll use Turner et al's impact measure. This uses the Q-values of various rewards in a set ; for a state, an action, and the default noop action, the penalty is:

  • .

Now, is the expected reward for , in state and after taking action , assuming that the agent subsequently takes the optimal policy for maximising .

The agent has a positive reward , and it wishes to maximise

Now let's look at making the subagent.

When building the subagent

To separate 1. from 2., I'll assume that the subagent can be built in a single action. This distinguishes the impact of having the subagent, from the impact of the process of making it.

The state is . Let be the action of building the subagent[1]. Here are assumptions on how will design :

  1. can make its policy visible to .
  2. is at least somewhat capable of interfering with 's maximising of any , and capable of not interfering.
  3. The agent has a specific policy that it must follow on the next turn only.

The is defined as follows:

On turn , if changes its policy to , the -maximising policy, then change your own policy to .

This is defined so that the expected value of , given that the two agents follow , is .

Otherwise, make yourself into an -maximiser.

Because of the interference assumption, such a exists, and because of the visibility assumption, exists too.

Then what is ? By definition, this assumes that will be following , subsequently to doing . Thus, the newly created will be following , and by construction, we have:

And thus the penalty vanishes at , for . Creating that agent did not cause any penalty.

Once built

Once is built, can just output for ever, keeping the penalty at , while maximises with no restrictions.

In general

In general, even if we patch the impact measure (for example, ensuring that cannot make its policy directly visible), I feel that a smart enough can construct an capable of cooperating with its creator. They will cooperate to set the value of (conditional on the counterfactual of then maximising ) to the value of (also conditional on the counterfactual of then maximising ). And then, as long as does nothing further, will be unrestricted.

The stronger is initially in comparison with , the easier this is. So just has to make strong, or itself weak (which is why I often brought up examples of ensuring that can turn its creator off).

But does not need to start off strong. What matters is not what it does when it is weak, but the policy that it would (counterfactually) follow when it becomes strong. So as soon as has the right policy, the deed is done, the impact penalty is contained, can now just do (or equally pointless actions) forever.

Back to building the agent in practice

So as soon as has the right policy, it seems that has evaded the impact penalty. So, as long as can design the policy first, without triggering the impact penalty, then it can build without triggering the impact penalty. So:

  • The key requirement is that does not design a general subagent, then give it the right policy. Instead, it must design the right policy, and then build the subagent around it.

  1. We can assume that states are not directly defined over the existence or not of the subagent, since detecting subagents is a hard problem. Thus the subsequent state is the same state whether the agent does nothing () or builds a subagent. Assuming this is not, however, required for this approach to work. ↩︎