Stuart Armstrong

Stuart Armstrong's Comments

Thinking About Filtered Evidence Is (Very!) Hard

Is there any meaningful distinction between filtered evidence and lying? I know that in toy models these can be quite different, but in the expansive setting here, where the speaker can select the most misleading technically true fact, is there any major difference?

And how would the results here look if we expended it to allow the speaker to lie?

ACDT: a hack-y acausal decision theory

There are some minor differences; your approach learns the whole model, whereas mine assumes the model is given, and learns only the "acausalish" aspects of it. But they are pretty similar.

One problem you might have, is learning the acausal stuff in the mid-term. If the agent learns that causality exists, and then that in the Newcomb problem is seems to have a causal effect, then it may search a lot for the causal link. Eventually this won't matter (see here), but in the mid-term it might be a problem.

Or not. We need to test more ^_^

Writeup: Progress on AI Safety via Debate

Very impressive work, both the output and how you iterate on it.

Some thoughts about the cross-examination issue, prompted by your "Implementation 2 for human debaters: teams of two". It occurred to me that B* could win if it could predict A and B's future behaviour, and match up it's answer with B.

I'd prefer that such an option not exist; that B could answer the question directly, without needing to rewind. Hence prediction won't help.

Cross-examination still helps: A can cross examine as soon as they suspect B is shielding behind an ambiguity. This means that A might have to abandon their current question line, and start again on the other one. This seems more secure (if longer).

And the AI would have got away with it too, if...

I mainly mentioned child-rulers because Robin was using that example; and I used "getting deposed" as an example of agency problems that weren't often (ever?) listed in the economics literature.

Attainable Utility Preservation: Scaling to Superhuman

More seriously, the reason I'm sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can't articulate that reason; it's the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can't formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine - I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won't :-(

Attainable Utility Preservation: Scaling to Superhuman

But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.

But if you look back at the history of impact measures, you'll see a lot of skulls; people say "this impact measure will work", and I say "I think you're overconfident; I'm sure I can find a way it fails". It then ends up being a solvable "break the impact measure" problem. So by Laplace’s Rule of Succession, I’d be surprised if this approach were The Ultimate Impact Measure That Works.


(In)action rollouts

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What will do: forever (maybe with some extra info for the ).

What would counterfactually do (to define ): for turns (the rollout, to define ) followed by (to define the aspect). Call this .

The penalty is defined in terms of these , entirely counterfactual except for .

What must do: iff follows , then must ensure that .

So a key question is: how early can detect if is following ? If it can detect this immediately, then it can immediately implement the policy for ensuring that (for the stricter impact measure) or the much easier (for the original impact measure).

If it has to wait turns to be sure, then it needs to preserve its power to control many different .

Now, will never actually follow , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of will determine the penalty; is rewarded for quickly gaining control of the rewards.

In practice, will signal early on that it's not following . Maybe there's a very small impact action, , with a tiny penalty. Then would like to use that to signal to that its not in the counterfactual world where it is following . So long as 's first action isn't , that signal is enough (of course, if 's first action isn't , then is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from , which is not a counterfactually defined state (though is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Stepwise inaction and non-indexical impact measures

For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.

This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).

There's another issue; it's not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don't include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.

It won't be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it's plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe "something".

Load More