Un-manipulable counterfactuals

[-]paulfchristiano11y20

As Nate pointed out, using these counterfactuals to make decisions gives you CDT. I posted an implementation of CDT using this idea a while ago. Here is a more general idea along these lines. It's also used here and here.

In general, conditioning on extremely low probability events can lead to strange beliefs (e.g. skeptical hypotheses), which places lower bounds on the safe probability of X=0. If you try to give the AI a really confident model of the world, to eliminate skeptical hypotheses, then stuff can get weird fast.

More seriously, the behavior of these agents is very sensitive to the behavior of humans in extremely improbable cases. For example, you proposed having a button that operates the explosives with probability 99.9999%, and an AI who erroneously believes that the explosives will never go off (or who just makes decisions based on what will happen if the explosives never go off--since the two systems have formally identical behavior, it doesn't seem worth distinguishing). But what do the humans actually do, when the button that is supposed to work 99.999999% of the time doesn't? Without careful measures, they probably panic and turn off the machine anyway.

In general, it seems better to avoid conditioning on extremely unlikely events for both of these reasons. But conditioning on reasonably likely counterfactuals is a natural thing to do. And for some use cases, this may not be a problem.

[-]So8res11y10

Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.

For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.

[-]Stuart_Armstrong11y00

I'm not sure I understand this.

The example I was thinking of was instead of eg conditioning on "the button wasn't pressed" in corrigibility, you have corrigibility only implemented if the button is pressed AND X=1. Then the counterfactual is just X=0.

Is there a CDT angle to that?

[-]So8res11y10

We might be talking about different things when we talk about counterfactuals. Let me be more explicit:

Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is it held constant?)

In this scenario, how do you use a stochastic event to "construct a counterfactual"? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I'm not quite sure which one you want to discuss.)

[-]Stuart_Armstrong11y00

I see why you think this gives CDT now! I wasn't meaning for this to be used for counterfactuals about the agent's own decision, but about an event (possibly a past event) that "could have" turned out some other way.

The example was to replace the "press" with something more unhackable.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

1

1