Probability as Minimal Map

johnswentworth

What’s the difference between a map and the raw data used to generate that map?

Suppose, for example, that google sends a bunch of cars out driving around New York City, snapping photos every few feet, then uses the photos to reconstruct a streetmap. In an informational sense, what’s the difference between the giant database full of photos and the streetmap?

Obvious answer: the map throws away a huge amount of information which we don’t care about, at least for the map’s use-case.

Let’s try to formalize this intuition a little. We want the ability to run queries against the map - e.g. things like “How far apart are Times Square and the Met?” or questions about street network topology. Let’s call the whole class of queries we want to support . We also have some input data $X$ - e.g. the database full of images from the streets of NYC. The goal of the map-synthesizing process is to throw out as much information as possible from $X$ , while still keeping any and all information relevant to queries in $Q$ .

Natural next question: given the query class $Q$ and some data-generating process, can we characterize “optimal” map-making processes which maximize the amount of information “thrown out” (measured in an information-theoretic sense) while still keeping all information relevant to $Q$ ?

For the very simplest query classes, we can answer this question quite neatly.

Suppose that our query class contains only a single query $q$ , and the query is a true/false question. For instance, maybe the only query we care about at all is $q$ = “Is the distance between Times Square and the Met greater than one mile?”, and we want to throw out as much information as possible without losing any info relevant to the query.

Claim: any solution to this problem is isomorphic to the probability $P [q | X]$ . In other words, $P [q | X]$ is itself a minimal “lossless” map (where by “lossless” we mean it does not throw out any info from $X$ relevant to the query, not that it perfectly predicts $q$ ), and any other minimal “lossless” map is a reversible transformation of $P [q | X]$ .

Let’s prove it.

We’ll formalize our claim in terms of mutual information $I$ . We have two pieces:

$P [q | X]$ throws away no information relevant to $q$ : $I (q; X) = I (q; P [q | X])$
For any function $g (X)$ , if $g$ throws away no information relevant to $q$ , then $P [q | X]$ can be computed from $g (X)$ : $I (q; X) = I (q; g (X))$ implies $P [q | X] = f (g (X))$ for some $f$ .

One important note: our notation $P [q | X]$ is really shorthand for $P [q = T r u e | X]$ , so $P [q | X]$ is not a function of the “outcome” of $q$ - it’s only a function of $X$ . Also note that the second condition implies that the entropy of $g (X)$ is at least as high as $P [q | X]$ , so $P [q | X]$ is indeed minimal in terms of information content.

In the next two sections, we’ll separately prove our two pieces.

Probability-Computing Circuits

We want to show that $P [q | X]$ throws away no information relevant to $q$ : $I (q; X) = I (q; P [q | X])$ .

We’ll start with a short side-quest to prove a lemma: $P [q | P [q | X] = p] = p$ . Here’s an interpretation: suppose you work late trying to calculate the probability of $q$ based on some giant pile of data $X$ . You finally compute the result: $P [q | X] = p$ . You go to bed feeling victorious, but wake up to find that your computer has died and your giant pile of data is lost. However, you still remember the final result $p$ - even though you don’t know $X$ , you do know that $P [q | X] = p$ . Based on this information, what probability should you assign to $q$ ? Presumably the answer should be $p$ .

We can directly prove this in a few lines:

$P [q | P [q | X] = p] = (P [P [q | X] = p])^{- 1} P [q, P [q | X] = p] = (\sum_{x} I [P [q | X = x] = p] P [X = x])^{- 1} \sum_{x} I [P [q | X = x] = p] P [q | X = x] P [X = x]$

Note that the indicator function $I [.]$ in the main sum is zero whenever $P [q | X = x]$ is not $p$ , so we can replace $P [q | X = x]$ with $p$ in that sum: $= (\sum_{x} I [P [q | X = x] = p] P [X = x])^{- 1} \sum_{x} I [P [q | X = x] = p] p P [X = x] = p$

We can also write this in the more compact form $P [q | P [q | X]] = P [q | X]$ , which we’ll use later.

To build a bit more intuition, let’s expand on this result a bit. Imagine a circuit (aka deterministic causal diagram or computational DAG) which takes in data $X$ and outputs $P [q | X]$ . We’ll think about some arbitrary cross-section of that circuit - some set of internal nodes ${g_{i}}$ which cut the circuit in two, so that $g$ mediates the interaction between $X$ and $P [q | X]$ . We can write this as $P [q | X] = f (g (X))$ for some function $f$ .

Claim: for any cross-section $g$ , $P [q | g (X)] = P [q | X]$ . Intuitively, we can imagine that we’re partway through calculating $P [q | X]$ , we’ve calculated all the $g$ ’s but haven’t finished the full computation yet, when suddenly our computer dies and we lose all the original data $X$ . As long as we still have the $g$ ’s around, we’re good - we just shrug and continue the calculation, since the $g$ ’s were all we needed to finish up anyway.

I won’t write out the proof for this one, but it works exactly like the previous proof.

Notice that our cross-section theorem has two interesting corner cases: all of $X$ is a cross-section, and $P [q | X]$ by itself is a cross-section. The former yields the trivial statement $P [q | X] = P [q | X]$ , while the latter yields our lemma from earlier: $P [q | P [q | X] = p] = p$ .

Armed with our lemma, let’s return to the main problem: show that $P [q | X]$ throws away no information relevant to $q$ , formalized as $I (q; X) = I (q; P [q | X])$ .

We’ll start by applying an identity from information theory:

$I (q, X) = H (q) - H (q | X)$
$I (q, P [q | X]) = H (q) - H (q | P [q | X])$

… where $H$ is the Shannon entropy. So to show that $I (q, X) = I (q, P [q | X])$ , we can show $H (q | X) = H (q | P [q | X])$ . Our lemma makes this trivial:

$H (q | P [q | X]) = - \sum_{q} P [q | P [q | X]] l n P [q | P [q | X]] = - \sum_{q} P [q | X] l n P [q | X] = H (q | X)$ … so $P [q | X]$ contains all the information in $X$ relevant to $Q$ .

The Data Processing Inequality

Next, we’d like to show that $P [q | X]$ is a minimal representation of the information in $X$ relevant to $Q$ . Formally, for any $g (X)$ , if $g$ throws away no information relevant to $q$ , then $P [q | X]$ is a function of $g (X)$ : $I (q; X) = I (q; g (X))$ implies $P [q | X] = f (g (X))$ for some $f$ .

In particular, we’ll show that $P [q | g (X)] = P [q | X]$ , so we can choose $f (g *) = P [q | g (X) = g *]$ . Intuitively, this says that if $g (X)$ throws out no information from $X$ relevant to $q$ , then it yields the same probability for $q$ .

To prove this, we’ll use a theorem from information theory called the data processing inequality (DPI). Intuitively, the DPI says that if we start with the data $X$ , we cannot gain any new information about $q$ just by processing $X$ . Formally, for any function $g$ :

$I (q; X) \geq I (q; g (X))$

… with equality iff $q$ and $X$ are independent conditional on $g (X)$ (so that $X$ tells us nothing else about $q$ once we know $g (X)$ ). The proof and a bit more discussion can be found in Cover & Thomas’ textbook on page 34.

We’re interested in the equality case $I (q; X) = I (q; g (X))$ , so the DPI tells us that $q$ and $X$ are independent conditioned on $g (X)$ :

$P [q, X | g (X)] = P [q | g (X)] P [X | g (X)]$

But we can always just expand via the chain rule instead:

$P [q, X | g (X)] = P [q | X, g (X)] P [X | g (X)] = P [q | X] P [X | g (X)]$

Equate these two expressions for $P [q, X | g (X)]$ , and we find

$P [q | g (X)] = P [q | X]$

… which is what we wanted.

Why Is This Interesting?

We’ve shown that the probability $P [q | X]$ summarizes all the information in $X$ relevant to $q$ , and throws out as much irrelevant information as possible. Why does this matter?

First, hopefully this provides some intuition for interpreting a probability $P [q | X]$ as a representation of the information in $X$ relevant to $q$ . In short: probabilities directly represent information. This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations - we can talk about simple mechanical processes which produce “maps” of the world, and the probabilistic calculations embedded in those processes.

Second, this is a natural first step in characterizing map-making processes. The data processing inequality also suggests a natural next step: sufficient statistics, which serve as minimal “lossless” maps when our data are independent samples from a distribution, and our queries can ask anything at all about the parameters of the distribution. From there, we could further generalize to approximate sufficient statistics - maps which throw out a small amount of information relevant to the queries, in cases where there is no “lossless” map smaller than $X$ .

That said, this still only talks about half of the full map-making process: the data-processing half. There’s still the data-collection half to address - the process which generates $X$ in the first place, and somehow ties it to the territory. And of course, when map-making processes are embedded in the real world, there might not be a clean separation between data-collection and data-processing, so we need to deal with that somehow.

Third, the interpretation of probabilities as a minimal map - or a representation of information - suggests possible avenues to relax traditional probability to deal with things which are time-like separated from us, e.g. for handling self-reference and advanced decision-theoretic problems. In particular, it suggests strategically throwing away information to deal with self-referential tomfoolery - but continuing to use probabilities to represent the reduced information remaining.

I hope that I'm not becoming that annoying person who redirects every conversation towards their own pet theory even when its of minimal relevance, but I think there are some very strong links here with my theory of logical counterfactuals as forgetting or erasing information (https://www.lesswrong.com/posts/BRuWm4GxcTNPn4XDX/deconfusing-logical-counterfactuals).

In fact, I wasn't quite aware of how strong these similarities were before I read this post. Both probabilities* and logical counterfactuals don't intrinsically exist in the universe; instead they seem to draw their existence from our epistemic situation. Zoom into the universe enough everything is deterministic and there is only one thing that an agent ever could have chosen; zoom out and there are multiple possibilities.

I'm currently writing a post where I'll argue that logical counterfactuals are an answer and not a question and I think this applies to probability too. Both exist in the map and not in the territory; this means that they are human created terms that exist for human purposes. Arguing what they are or aren't in the abstract indicates confusion; instead, we need to choose a purpose (or question) and then we can choose to notion appropriate to that. This is part of why these notions come apart with questions like Sleeping Beauty and Counterfactual Mugging.

*Okay, it kind of does under some theories of quantum mechanics, but lots of the discussion of probability uses probability to deal with a lack of complete knowledge as opposed to intrinsically existing uncertainty, which is of a different kind and can't be explained by quantum mechanics

That is highly relevant, and is basically where I was planning to go with my next post. In particular, see the dice problem in this comment - sometimes throwing away information requires randomizing our probability distribution. I suspect that this idea can be used to rederive Nash equilibria in a somewhat-more-embedded-looking scenario, e.g. our opponent making its decision by running a copy of us to see what we do.

Thanks for pointing out the relevance of your work, I'll probably use some of it.

We’ve shown that the probability summarizes all the information in $X$ relevant to $q$ , and throws out as much irrelevant information as possible.

This seems correct.

Lets say two different points in the data configuration space, X_1 and X_2, provide equal evidence for q. Then P[q|X_1] = P[q|X_2]. The two different data possibilities are mapped to the same point in this compressed map. So far so good.

(I assume that I should interpret the object P[q|X] as a function over X, not as a point probability for a specific X.)

First, hopefully this provides some intuition for interpreting a probability $P [q | X]$ as a representation of the information in $X$ relevant to $q$ . In short: probabilities directly represent information. This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations - we can talk about simple mechanical processes which produce “maps” of the world, and the probabilistic calculations embedded in those processes.

I don't think this works.

The map P[q|X] have gotten rid of all the irrelevant information in the map, but it still contains some information that never came from the map. I.e. P[q|X] is not generated only from the information in X relevant for q.

E.g. from P[q|X] we can get

P[q] = sum_X P[q|X]

i.e. the prior probability of q. And if the prior of q where different P[q|X] would be different too.

The way you can't (shouldn't) get rid of priors here, feels similar to how you can't (shouldn't) get rid of coordinates in physics. In this analogy, the choice of prior is analogues to the choice of the origin. Your choice of origin is completely subjective (even more so than the prior). Technically you can represent position in a coordinate free way (only relative positions), but no one does it, because doing so destroys other things.

(I'm being maximally critical, because you asked for it)

Yeah, I basically buy that complaint. None of this was intended to get rid of priors, because a correct model shouldn't get rid of priors.

In that case I'm confused about this statement

This interpretation makes sense even in the absence of “agents” with “beliefs”

What is priors in the absence of something like agents with beliefs?

Frequencies are one example.

Here's what you wrote:

This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations.

Do you still agree with yourself?

I'm no longer sure. Thanks!