on MIRI’s “Logical Induction” paper

When I first saw this paper, I said it looked impressive, and seemed a lot more substantial than MIRI’s other work, but I never really looked at it in much detail. In the past week I’ve been having an extended conversation with a friend about it, and that spurred me to read it more closely. I now think it’s much less impressive than it seems at first glance.

It occurred to me that the criticisms I made to my friend might be of wider interest, so I’ll write a post about them.

Summary:

The authors state something they call the “logical induction criterion,” meant to formalize a kind of ideal inference. To satisfy this criterion, an inference procedure only needs to have certain asymptotic properties in the infinite-time limit. Rather than being sufficient for good inference but too strong for practical computation (as the paper suggests), the criterion is too weak for good inference: it tolerates arbitrarily bad performance for arbitrarily long times, and more generally, provides no guarantees whatsoever about the quality of the inference procedure’s output at any finite time. Moreover (see remarks below about pointwise convergence), there is a principled reason for the lack of guarantees which is baked deep into the paper’s approach, and which I don’t think could be fixed without completely overhauling that approach.

The easiest way to see why the LI criterion is problematic in practice is to consider the criterion-satisfying procedure constructed in the paper, called LIA. Speaking very roughly, LIA makes a countably-infinite list (in unspecified/arbitrary order) of all the “mistakes” it could possibly make, and at discrete time n, does a brute force search for a set of beliefs which avoids the first n mistakes in the list.

Depending on the ordering of the list, LIA can make an arbitrarily bad mistake for an arbitrarily long time (some mistake has to go in slot number 3^^^3, etc.) Nonetheless, LIA can be proven to asymptotically avoid every mistake, since for every mistake there is some time N at which it begins to avoid it. Thus, LIA enjoys a very large number of nice-sounding asymptotic properties, but it converges for very different reasons than most algorithms one is used to hearing about: rather than converging to nice properties because it moves toward them, it simply exploits the fact that these properties can only fail in countably many ways, and ticks off those ways one by one.

Thus, LIA is like an “author-bot” which generates all possible character strings in some arbitrary order, and at each step consults a panel of readers to weigh in on its best work so far. One could argue that this bot’s “asymptotic best work” will have any aspect of writerly brilliance imaginable, but that does not mean that the bot satisfies some “ideal writing criterion,” and indeed we’d expect its output in practice to lack any aspects of writerly brilliance. (LIA differs from author-bot in that it has an objective, if very slow, way to find its “best work so far.” But garbage in means garbage out.)

EDIT 6/28/17: I’ve added some remarks about pointwise vs. uniform convergence. The relevance of this distinction to the LI paper was originally brought to my attention by @jadagul in a response to this post, and I think it’s important enough that it should be discussed here for completeness’ sake. To find the part I added, search for “pointwise.”

More detail under the cut

The paper is long and complicated and I won’t even try to summarize it here in a way that will provide full technical understanding. But I’ll sketch some things. (N.B. I’ll often use “LI” below to abbreviate both “logical induction” and “logical inductor.)

The paper is (roughly) concerned with the question of assigning probabilities (representing degrees of belief) to logical/mathematical statements. The probabilities ought to be consistent with the results of pure deduction, but must somehow cope with the fact that at any given time, many theorems which are true are not yet proven. This is a hard problem.

The paper’s basic framing seems good and interesting to me. The authors use the relation between probabilities and (expected-utility-maximizing) betting odds, and think of the probabilities assigned to logical sentences at time n as being the “prices” of “shares” of those statements in a stock trading market. This nicely cuts through philosophical issues like “what does it mean to believe that an unproven conjecture is probably true?” Whatever else it means, it would seem to mean that one thinks a bet on the truth of that conjecture would eventually pay off, and that may be all one needs.

Given the above, it seems like we could get a good set of probabilities (and thus solve the problem) if we had an idealized efficient market in predictions about the truth-value of logical sentences. This idea has the appeal of “outsourcing” the question of how to actually figure out the probabilities: if there are enough (sufficiently sensible/distinct) traders that arbitrage opportunities rarely persist, then we can read the right probabilities off of the prices, even if no single trader knows how to compute them.

If we try to turn this into an algorithm, we run into the problem that this market doesn’t actually exist, and that simulating it would mean simultaneously doing everything that all of the zillions of traders do. Efficient markets can “know” information that no one person knows, but we are trying to specify a procedure usable by one person (or machine, or other idealized “agent”), so it is not clear how the magic of efficient markets will help.

The authors deal with this by giving up on having efficient prices at any given time, and instead requiring that any arbitrage opportunity will eventually be removed, although it may take an arbitrarily long time.

The technical setup of the paper is quite complicated, but here are some of the ingredients. There is a discrete time variable, “n.” As time elapses, a “deductive process” D proves more and more theorems. Meanwhile, at time n, our induction procedure assigns probabilities/prices P_n to logical sentences, including those which D has not proved true or false. There may also be “traders,” which place buy and sell orders on “shares” of sentences, and receive payouts if the sentences are later proven true.

Given this setup, the authors state the “logical induction criterion,” which is meant to identify price assignments P_n (for all n) which reflect ideal reasoning. A pricing mechanism satisfies the logical induction criterion if the prices cannot be “exploited” by any “efficiently computable” trader. The authors call such a pricing mechanism a “logical inductor.”

Two terms here require further notice here: “exploited” and “efficiently computable.”

By “efficiently computable” (“e.c.”) the authors mean “computable in polynomial time, as a function of n.” Thus the market doesn’t have to worry about strategies whose runtime blows up too fast as n increases. This seems reasonable, and the authors cite some stuff to the effect that “can’t be done in polytime” basically means “can’t be done in the real world,” a claim I’ve heard before and which seems insightful. (Although it is a bit odd in conjunction with LIA’s slow runtime; if non-e.c. traders are effectively impossible, is LIA impossible too?)

The really worrisome definition is “exploitable.” This is defined asymptotically: a trader can make money from arbitrage for an arbitrarily long time without being said to “exploit” the prices, so long as the amount of money they can make has an upper bound. This is very different from the usual notion of market (in)efficiency: any arbitrage opportunity is permissible by this definition at any given time n, so long as it will disappear at some later time N.

Because the definition of exploitability is asymptotic, the results which follow from the LI criterion are all asymptotic. The LI criterion does not guarantee you that anything will happen by any particular time n you can name – just that certain things will eventually happen. The authors prove that any convergence rate formula for a logical inductor’s performance must be uncomputable, which sounds bad, but actually seems to me like a red herring: the real problem is that if your inductor has any nice convergence rate properties, these constitute an extra bonus perk not ensured by the LI criterion itself.

Putting it the other way around: the authors prove that an LI asymptotically makes many smart choices because otherwise a smart e.c. trader could fleece them forever – but for any mistake, there is an LI which makes that mistake until time n = 3^^^3, or n = 3^^^(3^^^3), or whatever.

The authors in fact prove a theorem driving this point home, to which they give the modest name “Closure under Finite Perturbations” (Theorem 4.6.1). It says that an LI is (intuitively enough) still an LI if you change its P_n for only finitely many n. As the authors say:

This means that we can take a logical inductor, completely ruin its beliefs on the 23rd day (e.g., by setting P_{23}(phi) = 0 for all phi), and it will still be a logical inductor.

One could also set P_n = 0 for all n within the lifetime of the universe, etc.

Interestingly, the authors are still able to prove some things about the “speed” (in some sense) at which any LI proceeds. This involves something kind of like the definition of a Cauchy sequence, where a certain “speed” property can be guaranteed in the “tail” of the sequence, where the “tail” is defined as n > N for some N depending on epsilon.

For instance, they prove (Theorem 4.2.4) that any LI will “outpace deduction,” in the sense that if we have an sequence of theorems (phi_0, phi_1, …) which can be efficiency generated, then even if the difficulty of proving phi_k grows fast with k (say phi_k is proven at time n = f(k) for some fast-growing f), the LI will eventually start assigning high probability to phi_k by time n = k. “Eventually,” here, means “for n > N”: if I give you an epsilon > 0, you can guarantee that P(phi_n) > 1 - epsilon, but only after some epsilon-dependent cutoff time.

This is sort of a recurring theme of the paper: the authors mention desirable properties that we intuitively think of as holding “for all x” at some given time (e.g. efficient market prices resist all arbitrage attempts simultaneously), then then assign each x its own distinct N and guarantee the property for x only after its corresponding N.

(EDIT 6/28/17) As @jadagul brought to my attention, the distinction referred to in the previous paragraph is precisely the distinction between pointwise and uniform convergence.

A sequence of functions f_n(x) converges to the function f(x) pointwise if, for each x and each epsilon > 0, there is an N such that |f_n(x) - f(x)| < epsilon if n > N. Note that N is allowed to depend on x as well as epsilon.

The sequence f_n(x) converges to f(x) uniformly if for each epsilon > 0, there is an N such that for all x, |f_n(x) - f(x)| < epsilon if n > N. Here, we do not allow each x to have “its own N.” Instead, we require f_n to be close to f globally once n > N. This is a stronger requirement than pointwise convergence

In analysis, the notion of uniform convergence is useful because it guarantees that various properties of f_n will carry over to f. For instance, if the f_n are continuous and they converge uniformly to f, then f is also continuous. But if the f_n only converge pointwise, f may not be continuous. (@jadagul gives the example f_n(x) = x^n on [0,1], which converges pointwise to a discontinuous function. If you are familiar with Fourier series, the Gibbs phenomenon is another example.)

In the case at hand – logical inductors – we have theorems about the LI’s performance on some class of objects, such as “e.c. sequences of theorems.” The equivalent of “x” is some variable indexing objects from the class.

For instance, Theorem 4.2.4 is a “pointwise convergence” result in the sense that each e.c. sequence of theorems is allowed to have its own distinct N (for a given epsilon). It doesn’t give us any guarantees of the form “by some time N, the LI will [have some good property] for all e.c. sequences of theorems.”

This is important, because we care about whether or not properties of the limit P_{\infty} will hold (perhaps in approximate form) for the finite-time results P_n. Our interest in proving convergence results relies on this kind of transfer: since we will never actually reach P_{\infty}, its properties are only of interest insofar as they carry over somehow to P_n. Typically, we need uniform convergence for this to happen, not just pointwise convergence.

In the case of Theorem 4.2.4 (with some epsilon chosen), we can always find an N that will work for some finite set of e.c. sequences, simply by finding the N for each individual sequence in the set and taking the maximum. But if we are interested in an infinite set of e.c. sequences, there may be no finite N that works for all of them. More generally, when the paper proves a convergence result for each object of some type, we cannot assume this will extend to infinite sets of such objects.

The authors believe that the LI criterion is strong, not weak, and thus they are concerned with the issue that it might be infeasible in one strong sense or another. So they show that it can be satisfied computably, by constructing a computable procedure which satisfies it, called LIA.

One ingredient of LIA (pp. 54-5) is T^k, a complete list of e.c. traders (i.e. every e.c. strategy for trading is in the list somewhere). The order of this list is left arbitrary, so any such list will give rise to a version of LIA, no matter the order. At time n, LIA “is aware of” the trading strategies T^1, T^2, …, T^n, and devises its probabilities P_n so that none of these n traders can do any arbitrage.

Doing this is not trivial, and most of the talk about LIA in the paper is about the tricks needed to do so. First (in the steps “Budgeter” and “TradingFirm), LIA does some tricks to combine T^1, T^2, …, T^n into a single “supertrader,” which can do arbitrage if any of the constituent traders can. Then (in the step “MarketMaker”), now tasked only with evading the supertrader, LIA does a brute force search to find prices that block the supertrader from arbitrage in all possible “worlds” (roughly: no matter how the truth values of any unsolved conjectures turn out).

LIA was clearly designed with the goal of enabling proofs: it is not supposed to be an especially effective induction procedure, just one which is especially amenable to proof. Indeed, what LIA actually does is pretty weird. Each timestep, it adds one new e.c. trader to the market. At time n, it laboriously constructs (truly) efficient prices for the n traders currently in the market. The traders can “arrive” in any order.

There’s a weird optical-illusion quality to this stuff, where either you see some nice property or its nasty opposite. The whole motivation of the LI criterion was resistance to arbitrage, as motivated by ideas about efficient markets. LIA is an LI because it is “eventually” resistant to each and every (e.c.) arbitrage opportunity. But LIA works by listing the countably-many ways it could be arbitraged, and (at any finite time n) ignoring countably many of them, armoring itself against only a finite number. All of the shiny, “good” properties of LIs derive in some way from the concept of outwitting every e.c. arbitrage opportunity, and there are countably many of those; is it therefore equally “bad” that LIA always ignores countably many e.c. arbitrage opportunities?

In the conclusion of the paper (near the middle of the PDF, since there’s a big proof appendix), the authors say some things about “ensemble methods” which seem very confused to me.

They are fully aware that their results are not “practical” in several senses. LIA’s time complexity is very bad, you can’t compute convergence rates even for a specific LI, etc. But there are optimistic that the ideal (!) represented by their LI criterion may be a fruitful generator of practical methods, which try to approximately satisfy it. They justify their optimism by claiming that LIs are “ensemble methods.” This is encouraging, they say, because (1) ensemble methods are great in practice, often the state of the art, and (2) ensemble methods are (they claim) inspired by the “ideal” of Solomonoff induction, which came before them, so maybe LI will fill the gaps left by the Solomonoff ideal and generate the corresponding practical methods.

The authors cite a good paper (Dietterich 2000) by which clearly delineates the several reasons why ensemble methods work well in practice, but they do not explain how LI relates to these reasons, and in fact the paper is at odds with the idea that ensemble methods derive from Solomonoff. Dietterich lists and visualizes three ways that ensembles can help (see his pp. 3-4). One of them involves computational approximation procedures that can be avoided by ideal reasoners like Solomonoff inductors, and one of them involves limitation on the class of models under consideration, where Solomonoff is supposed to consider all computable models at once.

Similar remarks apply to much of the material in the paper. All of LIA’s computations are exact (if it ever appears to “get stuck in local minima,” this will be due to a bad ordering of T^k, not some computing shortcut), and the pleasant properties of LIs come not from combining traders to evade the limits of the “trader class,” but from the fact that the trader class eventually includes everything.

It is not really clear to me what the authors actually mean by “ensemble method.” They refer to LIs aggregating information from multiple traders and thus going beyond each of the individual traders. But the notion of aggregation/combination here is very loose. Ensemble methods in machine learning train multiple models and then actually average their results (or take the mode, etc.), and this distinguishes them from methods which do not do this. They’re a way of taking a finite and limited model class and making something better from it. The flavor of the LI theorems (and of Solomonoff) is more like “wouldn’t it be great if every model is in your model class?” The authors seem to be calling the set of all models an “ensemble,” but it’s just a model class (the biggest one); an ensemble method needs not just a model class but something ensemble-y to do with it. (Linear regression has a model class, but least squares linear regression is not an ensemble method.)

This is a (marginally) more subjective point than the above, so for documentation, here are the passages I’m objecting to:

Solomonoff’s theory involves a predictor that considers all computable hypotheses about their observations, weighted by simplicity, and uses Bayesian inference to zero in on the best computable hypothesis. This (uncomputable) algorithm is impractical, but has nevertheless been of theoretical use: its basic idiom—consult a series of experts, reward accurate predictions, and penalize complexity—is commonplace in statistics, predictive analytics, and machine learning. These “ensemble methods” often perform quite well in practice. Refer to Opitz and Maclin (1999); Dietterich (2000) for reviews of popular and successful ensemble methods. (p. 68)

Logical inductors are far from efficient, but they do raise an interesting empirical question. While the theoretically ideal ensemble method (the universal semimeasure [Li and Vitányi 1993]) is uncomputable, practical ensemble methods often make very good predictions about their environments. It is therefore plausible that practical logical induction-inspired approaches could manage logical uncertainty well in practice. Imagine we pick some limited domain of reasoning, and a collection of constant- and linear-time traders. Imagine we use standard approximation methods (such as gradient descent) to find approximately-stable market prices that aggregate knowledge from those traders. Given sufficient insight and tweaking, would the resulting algorithm be good at learning to respect logical patterns in practice? This is an empirical question, and it remains to be tested. (p. 73)

That second paragraph (the last one in the paper) is jarring: suddenly, just as the curtain is closing, we are back in the world of practical algorithms, and all the finite-time considerations which were carefully ignored in all the LI stuff (which traders should we consider first? what if we have to compute efficiently?) rear their heads. Why might this conjectured method work? Well, because it approximates an LI, and all LIs are good.

Is this line of reasoning sensible supposing the premises are true? No.

Are all LIs good? Not as far as I can see.

22nd May 201749 notes

longpostsforlaterreading reblogged this from nostalgebraist
minqwi liked this
vo2maxer reblogged this from nostalgebraist
vo2maxer liked this
raginrayguns liked this
malpollyon liked this
iffii liked this
jadagul said: Right. Any finite set of traders is defeated after finite (but arbitrary) time. But if you need to defeat an infinite set of traders, there’s no guarantee that you ever beat all of them.
phenoct liked this
keegsu liked this
dataandphilosophy liked this
znk liked this
nostalgebraist reblogged this from jadagul
bobwoco liked this
squishy333 liked this
type12error reblogged this from nostalgebraist
type12error liked this
locally-normal liked this
molibdenita liked this
timtotal liked this
spectraldilemma liked this
shacklesburst liked this
stumpyjoepete liked this
lostpuntinentofalantis reblogged this from nostalgebraist and added:
Thanks for doing this. Do you think that this type of overstatement is due to incompetence? (The statement you made...
more-whales liked this
furioustimemachinebarbarian liked this
spacific-sunrise liked this
somnilogical liked this
bartlebyshop liked this
xenostalgic liked this
oinopa liked this
eka-mark liked this
nostalgebraist liked this
jadagul liked this
jadagul reblogged this from nostalgebraist and added:
That reminds me of the analysis distinction between pointwise and uniform convergence. We say a sequence of functions...
adjoint-law liked this
bhishma99 reblogged this from nostalgebraist
bhishma99 liked this
eka-mark said: i’m excited to read this more thoroughly tomorrow, thanks for writing it up :)
uranian97 liked this