Why is a chess game the opposite of an ideal gas? On short timescales an ideal gas is described by elastic collisions. And a single move in chess can be modeled by a policy network.
The difference is in long timescales: If we simulated elastic collisions for a long time, we'd end up with a complicated distribution over the microstates of the gas. But we can't run simulations for a long time, so we have to make do with the Boltzmann distribution, which is a lot less accurate.
Similarly, if we rolled out our policy network to get a distribution over chess game outcomes (win/loss/draw), we'd get the distribution of outcomes under self-play. But if we're observing a game between two players who are better players than us, we have access to a more accurate model based on their Elo ratings.
Can we formalize this? Suppose we're observing a chess game. Our beliefs about the next move are conditional probabilities of the form P1(xk+1|x0⋯xk), and our beliefs about the next n moves are conditional probabilities of the form Pn(xk+1⋯xk+n|x0⋯xk). We can transform beliefs of one type into the other using the operators
If we're logically omniscient, we'll have ΠnP1=Pn and ΣnPn=P1. But in general we will not. A chess game is short enough that Πn is easy to compute, but Σn is too hard because it has exponentially many terms. So we can have a long-term model Pn that is more accurate than the rollout ΠnP1, and a short-term model P1 that is less accurate than ΣnPn. This is a sign that we're dealing with an intelligence: We can predict outcomes better than actions.
If instead of a chess game we're predicting an ideal gas, the relevant timescales are so long that we can't compute Πn or Σn. Our long-term thermodynamic model Pn is less accurate than a simulation ΠnP1. This is often a feature of reductionism: Complicated things can be reduced to simple things that can be modeled more accurately, although more slowly.
In general, we can have several models at different timescales, and Π and Σ operators connecting all the levels. For example, we might have a short-term model describing the physics of fundmental particles; a medium-term model describing a person's motor actions; and a long-term model describing what that person accomplishes over the course of a year. The medium-term model will be less accurate than a rollout of the short-term model, and the long-term model may be more accurate than a rollout of the medium-term model if the person is smarter than us.
To me it seems like for the considerations you bring up in this post, the difference between the ideal gas and the chess game is that we have a near-exact short-timescale model for the ideal gas, but we don't have a near-exact short-timescale model for the chess game.
If we knew the source code for both of the players in the chess game, we could simulate the game until someone wins, and get an accurate prediction of the outcome, that would be better than just using the ELO ratings.
Running the argument through, we would conclude that intelligence is characterized by us not having a reductionist model of it. Which seems not as ridiculous as it first sounds -- we ascribed intelligent design to eg. rain and evolution before we understood how they worked. Also, if we could near-exactly simulate chess players (in our brains, not using computers), I doubt we would see them as very intelligent.
Another disanalogy is that for an ideal gas you want to predict the microstate (which the long-timescale model doesn't get) but for the chess game you want to predict the macrostate (who wins).
Yeah, I think the fact that Elo only models the macrostate makes this an imperfect analogy. I think a better analogy would involve a hybrid model, which assigns a probability to a chess game based on whether each move is plausible (using a policy network), and whether the higher-rated player won.
I don't think the distinction between near-exact and nonexact models is essential here. I bet we could introduce extra entropy into the short-term gas model and the rollout would still be superior for predicting the microstate than the Boltzmann distribution.
The notation for the sum operator is unclear. I'd advise writing the sum as i=k+2,...,k+n and using an i subscript inside the sum so it's clearer what is being substituted where.
The sum isn't over i, though, it's over all possible tuples of length n−1. Any ideas for how to make that more clear?
I find the current notation fine, but if you want to make it more explicit, you could do
Thanks, I made this change to the post.
My initial inclination is to introduce Xn as the space of events on turn n, and define Xa:b:=b∏i=aXi and then you can express it as ∑σ∈Xk+2:k+nPn(xk+1,σ|x0...xk) .