All of LGS's Comments + Replies

It does do (a variant of) MCTS. Check it out for yourself. The paper is here:

Appendix B, page 12:

"We now describe the search algorithm used by MuZero. Our approach is based upon Monte-Carlo tree search with upper confidence bounds, an approach to planning that converges asymptotically to the optimal policy in single agent domains and to the minimax value function in zero sum games [22]."

No. I am very familiar with the paper, and MuZero does not use MCTS, nor does it support the claims of OP. First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.* Second, using MCTS for the planning is not necessary. As they note, any kind of planning algorithm, not just MCTS would work ("For example, a naive search could simply select the k step action sequence that maximizes the value function. More generally, we may apply any MDP planning algorithm to the internal rewards and state space induced by the dynamics function.") Third, NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion. What more do you want? I feel that we should at least acknowledge that TreeQN exists, wasn't insuperably hard to create, and, inasmuch as it runs on current hardware at all, doesn't seem to entail 'a factor of a million slowdown'. (VIN/VPN/Predictron might count as examples here too? There's a lot of model-based RL work which make the NN learn part of the planning process, like Imagination-based Planner or MCTSnets.) Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a n

You are aware that MuZero has tree search hardcoded into it, yes? How does that contradict claim 1?

MuZero does not do MCTS and still outperforms.