Adam Scherlis

Wiki Contributions


Image layout is a little broken. I'll try to fix it tomorrow.

Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.

The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-entropy of that distribution under the data distribution as the loss function or measure of goodness of the generative model.

So in this example, "look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate" is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.

We can see the pattern and take the current position into account because the distribution is conditional on previous bits.

Predicting 011011011... doesn't do as well because cross-entropy penalizes unwarranted overconfidence.

Predicting 50-50 for each bit doesn't do as well because cross-entropy still cares about successful predictions.

(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. "Log[-likelihood] loss" and "the log scoring rule" are other common terms for the empirical version.)

As I said above, this isn't just a standard information theory approach to this, it's actually how GPT-3 and other LLMs were trained.

I'm curious about Crutchfield's thing, but so far not convinced that standard information theory isn't adequate in this context.

(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that's beyond the scope of this comment.)