Image layout is a little broken. I'll try to fix it tomorrow.
Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.
The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-entropy of that distribution under the data distribution as the loss function or measure of goodness of the generative model.
So in this example, "look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate" is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.
We can see the pattern and take the current position into account because the distribution is conditional on previous bits.
Predicting 011011011... doesn't do as well because cross-entropy penalizes unwarranted overconfidence.
Predicting 50-50 for each bit doesn't do as well because cross-entropy still cares about successful predictions.
(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. "Log[-likelihood] loss" and "the log scoring rule" are other common terms for the empirical version.)
As I said above, this isn't just a standard information theory approach to this, it's actually how GPT-3 and other LLMs were trained.
I'm curious about Crutchfield's thing, but so far not convinced that standard information theory isn't adequate in this context.
(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that's beyond the scope of this comment.)
This post makes the excellent point that the paradigm that motivated SAEs -- the superposition hypothesis -- is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn't enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn't be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circular embeddings (in several contexts) and feature splitting / dimensionality-reduction visualizations. Features aren't just crammed together arbitrarily; they're grouped with similar features.
I didn't properly appreciate this point before reading this post (actually: before someone summarized the post to me verbally), at which point it became blindingly obvious.
There are some earlier blog posts that point out that superposition is probably only part of the story, e.g. https://transformer-circuits.pub/2023/superposition-composition/index.html on compositionality, but this one presents the relevant empirical evidence and its implications very clearly.
This post holds up pretty well: SAEs are still popular (although they've lost some followers in the last ~year), and the point isn't specific to SAEs anyway (circular features embeddings are ubiquitous). Superposition is also still an important idea, although I've been thinking about it less so I'm not sure what the state of the art is.
My only complaint is that "maybe if I'm being more sophisticated, I can specify the correlations between features" is giving the entire game away -- the full set of correlations is nearly equivalent to the embeddings themselves, and has all of the interesting parts.
But I think the rest of the post demonstrates an important tension between theory and experiment, which an improved theory has to be able to account for, and I don't think I've heard of an improved theory yet.