Where does ADT Go Wrong?

abramdemski

The main success of asymptotic decision theory is that it correctly solves Agent Simulates Predictor, as well as a swath of related problems. This is a challenging problem. Unfortunately, ADT hasn't generalized well to other problems. It seems like perhaps the property which allows it to get ASP right is precisely the property which blocks it from generalizing further.

There are two algorithms in ADT. The first, SADT, is given a decision problem in functional form so that counterfactuals are well-defined. This is called an embedder. SADT has a set of strategies which it can choose from, and it chooses whichever strategy gets the highest expectation when plugged into the embedder. This can be thought of as imitating whichever agent does best on the problem.

The second algorithm, DDT, is for learning the embedder to give SADT. DDT has a set of possible embedders, and chooses optimistically from the embedders which it thinks could be equivalent to the true decision problem it faces. This way, it rules out overly optimistic embedders and settles on the best of the embedders which is equivalent to the true problem.

The main sign of a problem with SADT is that it will choose to crash into itself in a game of chicken. Consider the embedder which works as follows: it inserts the chosen policy into the slot of player 1. Player 2 is a new instance of SADT which looks at the problem via an embedder in which it plays the role of player 2 against player 1's strategy. If player 1 chooses a strategy which always goes straight rather than swerving, then SADT expects player 2 to learn to swerve -- that is what SADT would do if it saw itself up against a go-straight bot. Unfortunately, the real player 2 is thinking the same thing. So, both players go straight, and crash into each other; the worst possible outcome.

In short, SADT thinks that it moves first in games against other SADT agents. Since each copy thinks it moves first, the outcome can be very bad.

The simple analysis of what went wrong is that this game is outside of the optimality conditions for SADT. That's not very interesting, though. Why doesn't SADT work well outside those conditions?

One possible analysis of what goes wrong is that SADT thinks it can get the outcome of any other agent just by copying that agent's strategy. Unlike SADT, LIDT considers what happens if LIDT itself took a specified action. We don't have many optimality results for LIDT, because this is hard to analyze -- it relies on reasoning about counterfactual occurrences based on history. SADT instead uses a well-defined notion of counterfactual, but seemingly at the cost of a poor fit to reality.

If this is true, any fix to ADT would likely stop solving ASP. ADT gets ASP right because it thinks it'll get the reward which other agents get if it switches to those other agent's strategies. Switching to a strategy which always two-boxes looks like a bad idea, because the predictor would not be giving money to that other agent. If ADT were thinking more realistically, it would reason like LIDT, seeing that it gets more money on this particular instance by two-boxing. But, of course, this "realistic" thinking makes it get less money in the end.

I don't think this analysis is quite right, though. It's true that the example with the game of chicken goes wrong because SADT thinks it can do as well as some other agent by copying that agent's strategy, but that isn't SADT's fault. It's only using the counterfactual it is given.

So, my analysis of where SADT goes wrong in this example is that the embedder is wrong.

Consider three possible embedders:

Copy: Takes an agent as input, and uses that agent as both player 1 and player 2.
Spoofer: Takes the input agent as player 1. Player 2 is a ADT agent who thinks it is playing against the input agent.
Delusion: Takes the input agent as player 1. Player 2 is a nADT agent who thinks it is playing against an ADT agent.

"Spoofer" is the embedder which I described before. It "spoofs" the other player into thinking it is up against whatever strategy it's trying out. "Delusion", on the other hand, has a player 2 who thinks it's up against another SADT agent no matter who it's actually up against. (I didn't come up with the name for the delusion embedder -- I would have called it the real embedder. I preserved the name choice here to represent the fact that it's very debatable which embedder better represents the real situation.)

If SADT used the copy or delusion embedder, it would not crash into itself in a game of chicken. So why did I say ADT crashes into itself?

DDT learns to use the spoofer. It considers all three embedders to be equivalent to the real situation, because if you feed in the ADT agent itself, you recover the true scenario. So, it makes the most optimistic choice among these -- which is the spoofer.

So, it seems like DDT is to blame. It's not totally clear what part of DDT to blame. Maybe the optimistic choice doesn't make as much sense as it seems. But, I claim that DDT's reality filter is faulty. It checks that the embedder amounts to reality if you feed in the DDT agent. This doesn't seem like the right check to perform. I think it should be requiring that the embedder amounts to reality if you feed in the policy which will be selected in the end. This would rule out the spoofing embedder in a game of chicken, since the policy which looks best according to that embedder implies more reward than you can really get. Furthermore, it seems like this is the right reason to rule out that embedder.

I think such a modified reality check might just turn it into an LIDT-like thing. I haven't figured out very much about it, though.

As you can see, many of the explanations here were sketchy. I've omitted some details, especially with respect to the optimality conditions of ADT. I encourage you to treat this as pointers to arguments which might be made, and think through what's going on with ADT yourself.

[-]gallabytes9y00

When considering an embedder $F$ , in universe $U$ , in response to which SADT picks policy $π$ , I would be tempted to apply the following coherence condition:

$E [F (π)] = E [F (D D T)] = E [U]$

(all approximately of course)

I'm not sure if this would work though. This is definitely a necessary condition for reasonable counterfactuals, but not obviously sufficient.