Stuart Armstrong

Stuart Armstrong's Comments

Attainable Utility Preservation: Scaling to Superhuman

More seriously, the reason I'm sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can't articulate that reason; it's the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can't formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine - I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won't :-(

Attainable Utility Preservation: Scaling to Superhuman

But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.

But if you look back at the history of impact measures, you'll see a lot of skulls; people say "this impact measure will work", and I say "I think you're overconfident; I'm sure I can find a way it fails". It then ends up being a solvable "break the impact measure" problem. So by Laplace’s Rule of Succession, I’d be surprised if this approach were The Ultimate Impact Measure That Works.


(In)action rollouts

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What will do: forever (maybe with some extra info for the ).

What would counterfactually do (to define ): for turns (the rollout, to define ) followed by (to define the aspect). Call this .

The penalty is defined in terms of these , entirely counterfactual except for .

What must do: iff follows , then must ensure that .

So a key question is: how early can detect if is following ? If it can detect this immediately, then it can immediately implement the policy for ensuring that (for the stricter impact measure) or the much easier (for the original impact measure).

If it has to wait turns to be sure, then it needs to preserve its power to control many different .

Now, will never actually follow , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of will determine the penalty; is rewarded for quickly gaining control of the rewards.

In practice, will signal early on that it's not following . Maybe there's a very small impact action, , with a tiny penalty. Then would like to use that to signal to that its not in the counterfactual world where it is following . So long as 's first action isn't , that signal is enough (of course, if 's first action isn't , then is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from , which is not a counterfactually defined state (though is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Stepwise inaction and non-indexical impact measures

For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.

This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).

There's another issue; it's not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don't include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.

It won't be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it's plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe "something".

Stepwise inaction and non-indexical impact measures

The inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline).

Therefore the results on the inaction baseline apply ( ).

This means that restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards").

Roughly, attainable utility becomes twenty billion questions.

For the original example, this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.

Subagents and impact measures: summary tables

AU and RR have the same problem, formally, in terms of excess power; it's just that AU wants low power and RR wants high power, so they don't have the same problem in practice.

Stepwise inaction and non-indexical impact measures

It needs to benefit for ; not necessarily for or alone.

It seems to me that, generically, since is supposed to be a hindrance to the agent, taking a few turns to then neutralise should be beneficial.

Appendix: how a subagent could get powerful

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action.

Load More