Which counterfactuals should an AI follow?

I believe that we need to take a Conceputal Engineering approach here. That is, I don't see counterfactuals as intrinsically part of the world, but rather someone we construct. The question to answer is what purpose are we constructing these for? Once we've answer this question, we'll be 90% of the way towards constructing them.

As far as I can see, the answer is that we imagine a set of possible worlds and we notice that agents that use certain notions of counterfactuals tend to perform better than agents that don't. Of course, this raises the question of which possible worlds to consider, at which point we notice that this whole thing is somewhat circular.

However, this is less problematic than people think. Just as we can only talk about what things are true after already having taken some assumptions to be true (see Where Recursive Justification hits Bottom), it seems plausible that we might only be able to talk about possibility after having already taken some things to be possible.

The Counterfactual Prisoner's Dilemma

"The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through.

The Counterfactual Prisoner's Dilemma

Hmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation.

Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

Now let's consider Counterfactual Prisoner's Dilemma. If the coin comes up HEADS, then principle F tells us that the counterfactuals need to have the COIN coming up HEADS as well. However, it doesn't tell us how to handle the impact of the agent's policy if they had seen TAILS. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked.

You justify your construction by noting that the agent can figure out that it will make the same decision in both the HEADS and TAILS case. In contrast, my tendency is to exclude information about our decision making procedures. So, if you knew you were a utility maximiser this would typically exclude all but one counterfactual and prevent us saying choice A is better than choice B. Similarly, my tendency here is to suggest that we should be erasing the agent's self-knowledge of how it decides so that we can imagine the possibility of the agent choosing PAY/NOT PAY or NOT PAY/PAY.

But I still feel somewhat confused about this situation.

How do we prepare for final crunch time?

One of the biggest considerations would be the process for activating "crunch time". In what situations should crunch time be declared? Who decides? How far out would we want to activate and would there be different levels? Are there any downsides of such a process including unwanted attention?

If these aren't discussed in advance, then I imagine that far too much of the available time could be taken up by whether to activate crunch time protocols or not.

PS. I actually proposed here that we might be able to get a superintelligence to solve most of the problem of embedded agency by itself. I'll try to write it up into a proper post soon.

The Counterfactual Prisoner's Dilemma

You're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging.

However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result.

(Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well).

This seems to suggest that we need to consider policies rather than completely separate decisions for each possible branch of reality. And while, as I already noted, this doesn't get us all the way, it does make the argument for paying much more compelling by defeating the strongest objection.

Why 1-boxing doesn't imply backwards causation

I think the best way to explain this is to imagine characterise the two views as slightly different functions both of which return sets. Of course, the exact type representations isn't the point. Instead, the types are just there to illustrate the difference between two slightly different concepts.

possible_world_pure() returns {x} where x is either <study & pass> or <beach & fail>, but we don't know which one it will be

possible_world_augmented() returns {<study & pass>, <beach & fail>}

Once we've defined possible worlds, it naturally provides us a definition of possible actions and possible outcomes that matches what we expect. So for example:

size(possible_world_pure()) = size(possible_action_pure()) = size(possible_outcome_pure()) = 1

size(possible_world_augmented()) = size(possible_action_augmented()) = size(possible_outcome_augmented()) = 2

And if we have a decide function that iterates over all the counterfactuals in the set and returns the highest one, we need to call it on possible_world_augmented() rather than possible_world_pure().

Note that they aren't always this similar. For example, for Transparent Newcomb they are:

possible_world_pure() returns {<1-box, million>}

possible_world_augmented() returns {<1-box, million>, <2-box, thousand>}

The point is that if we remain conscious of the type differences then we can avoid certain errors.

For example possible_outcome_pure() = {"PASS"}, doesn't mean that possible_outcome_augmented() = {"PASS"}. It's that later which would imply it doesn't matter what the student does, not the former.

Why 1-boxing doesn't imply backwards causation

Excellent question. Maybe I haven't framed this well enough.

We need a way of talking about the fact that both your outcome and your action are fixed by the past.

We also need a way of talking about the fact that we can augment the world with counterfactuals (Of course, since we don't have complete knowledge of the world, we typically won't know which is the factual and which are the counterfactuals).

And that these are two distinct ways of looking at the world.

I'll try to think about a cleaner way of framing this, but do you have any suggestions?

(For the record, the term I used before was Raw Counterfactuals - meaning consistent counterfactuals - and that's a different concept than looking at the world in a particular way).

(Something that might help is that if we are looking at multiple possible pure realities, then we've introduced counterfactuals as only one is true and "possible" is determined by the map rather than the territory)

Alignment By Default

Also, I have another strange idea that might increase the probability of this working.

If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"?

I don't think it's likely to work, but thought I'd share anyway.

Alignment By Default


Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post?

Alignment By Default

I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it's likely to also include a high-level concept corresponding to what people say about about values - or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better - and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don't put a very high weight on this scenario at all.

Load More