Some Criticisms of the Logical Induction paper

by IAFF-User-101 min read1st Jul 2017No comments


Personal Blog
9 comments, sorted by Highlighting new comments since Today at 1:40 AM
New Comment

The substance of this criticism mostly reduces to "all results proven are only asymptotic." This is not very useful criticism. Yes, the results are asymptotic, but this is true for half of theoretical computer science: complexity theory, statistical learning theory, you name it... The LI is obviously not a practical algorithm, but it is not supposed to be a practical algorithm. What it's supposed to be, IMO, is an existence proof that certain (asymptotic!) desiderata can be satisfied. To make an analogy, the fundumental theorem of statistical learning theory tells us that the ERM principle is guaranteed to converge to the right answer whenever the VC dimension is finite. Now, if we want to make few domain-specific assumptions, we want to take a hypothesis space as large as possible and, voila, feed-forward NNs have a finite VC dimension but also allow approximating any Boolean circuit of given size, so, like in Solomonoff induction, we can approximate anything computable. On the other hand, applying the ERM principle to NNs is NP-hard, so we do gradient descend instead and nobody knows why it works. If we live long enough, I believe we will eventually find a mathematical characterization for a large natural class of problems where gradient descent for ANNs provably approximates the global optimum. Similarly, Solomonoff induction and the LI (in the abstract setting, not necessarily in the formal logical setting) should also have efficient analogues that work for some large natural class of problems. We don't know these analogues yet (which is probably a good thing), but this doesn't mean they don't exit. Moreover, the notion that "LIA is like an author-bot which generates all possible character strings in some arbitrary order" misses the point that we expect traders with low description complexity to be more useful for the usual Occam's razor reason. So, to sum up, yes, the LI paper doesn't solve all problems in the world, but neither does it pretend to do it, and it seems a decent result by any realistic standard.

I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in , and then you give an exponential-time algorithm for it, and argue that it can probably be solved efficiently in practice because you've established an asymptotic bound, then no one will take this claim very seriously. It seems to me that this is roughly the position logical induction is in, but lifted from efficiency all the way up to computability.

The probabilities given by logical induction converge to coherent probabilities, but there is no computable function that will tell you how long you have to run logical induction to know the eventual probability of a certain sentence within a certain precision. We know logical induction doesn't do this because it cannot be done; if it could be, then there would be a computable set containing all provable sentences and no disprovable sentences (since every sentence would eventually be shown either not to have probability 0 or not to have probability 1), but there is no such computable set. So we know for sure that in the naive sense, there is no way, even in principle, to run logical induction long enough that you have probabilities you can trust to be reasonably accurate. Maybe a better complexity theory analogy would be if you give a algorithm for a problem known not to be in (although there would need to be exciting advancements in complexity theory before that could happen); this gives good reasons to believe that improvements to the algorithm would make it practical to run, and if you run it enough times the majority vote will probably be correct, but it will never be practical to run it enough times for that to happen. Likewise, efficient approximations to logical induction should be able to complete each step in a reasonable amount of time, but will not be able to complete enough steps to give you accurate probabilities.

In order to take the probabilities given by logical induction at very large finite steps seriously, you would need effective asymptotic results, and since this cannot be done for the probabilities converging to coherent probabilities, there would need to be subtler senses in which the probabilities given at finite times can be taken seriously even if they are not close to the limiting probabilities. Now, the logical induction paper does give interesting subtler senses in which the probabilities given by logical induction can be taken seriously before they approach their limiting values, but the fundamental problem is that all of those results say that some pattern will eventually hold, for the same sense of "eventually" in which the probabilities eventually converge to coherent probabilities, so this does not give strong evidence that those desirable patterns can be made to emerge in any reasonable amount of time.

Alex, the difference between your point of view and my own, is that I don't care that much about the applications of LI to formal logic. I'm much more interested in its applications to forecasting. If Fletcher is worried about the same issues as you, IMO it is not very clear from 'eir essay. On the other hand, the original paper doesn't talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it's grounded in a perspective that Fletcher doesn't know about.

Replying to "240"

First, yes, the convergence property is not uniform, but it neither should nor can be uniform. This is completely analogous to usual Bayesian inference where the speed of convergence depends on the probability of the correct hypothesis within the prior, or to the Structural Risk Minimization principle in PAC learning. This is unavoidable: if two models only start to differ at a very late point in time, there no way to distinguish between them before this point. In particular, human brains have the same limitation: it would take you more time to notice a complex pattern than a simple pattern.

Second, of course we can and should work on results about convergence rates, it is just that there is only that much ground you can reasonably cover in one paper. For example, the classical paper on merging of opinions by Blackwell and Dubins also doesn't analyze convergence rates, and nevertheless it has 500+ citations. To make one concrete guess, if we consider a sequence of observations in and look at the deviation between the probability the incomplete model forecaster assigns to the next observation and the probability interval that follows from a given correct incomplete model, then I expect to have some sort of cumulative bound on these deviations analogous to standard regret bounds in online learning. Of course the bound will include some parameter related to the "prior probability" .

Third, the only reason the sum in equation (31) is finite is because I wanted to state Theorem 2 in greater generality, without requiring the gamblers to be bounded. However, the gamblers that I actually use (see section 5) are bounded, so for this purpose I might as well have taken an infinite sum over all gamblers.

Finally, there are two important differences between dominant forecasting and your root finding example: one is that, in the root finding example, proving the possibility of convergence is trivial whereas here it is non-trivial so the result is valuable even if the desideratum is not strong enough by itself; another is that enumerating all rational numbers is contrived in a way that the dominant forecaster construction is not, although obviously in itself this is a subjective claim and it's probably not productive to argue about it if we disagree.

Replying to 240 (I can't reply directly because of some quirk of the forum).

(I'm not sure that I know your name, is it Robert?)

I'm actually not interested in discussing the verbal arguments in the paper. Who reads the verbal arguments anyway? I go straight for the equations ;) The verbal arguments might or might not misrepresent the results, I don't care either way.

I am interested in discussing the mathematical content of the LI paper and whether LI is a valuable mathematical discovery.

I agree that there is a relevant sense in which the dominance property in itself is too weak. The way I see it, LI (more generally, dominant forecasting, since I don't care much about the application to formal logic in particular) is a generalization of Bayesian inference, and the dominance property is roughly analogous to merging of opinions (in my paper there is a more precise analogy, but it doesn't matter). In the Bayesian case, we can also say that merging of opinions is a too weak property in itself. However:

  • Proving this weak property is already non-trivial, so it is a valuable and necessary step towards proving stronger properties.

  • Besides the property, we have an actual construction and I expect this construction to have stronger properties (even though I don't expect it to be a practical algorithm), like Bayesian inference has stronger properties than just merging of opinions.

Now, nobody currently has a proof of stronger properties or even a formulation (I think), but IMO this is not a reason to disregard all work done so far. Rome wasn't built in a day :)

Replying to Rob.

Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:

Let be a family of gamblers, probability distributions supported on . Then, there exists a forecaster s.t. for any () and , if the first condition holds then the second condition holds:

The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.

If you're running LIA and stop enumerating traders at , you will get the same dominance property, only for traders up to .

The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam's razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.

Replying to Rob.

I don't think that stopping at does some kind of "disproportionate" damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within . Similarly, in my paper you need the gambler to ensure your forecaster converges to incomplete model within .

You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses , you can take classes to be . This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where appears in the denominator of the bound, so gamblers that are "far away" in the "prior" can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect "low" hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don't know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively "smooth" and therefore well-approximable by a low degree polynomial.

Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don't impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only "weakly" depends on the choice of UTM. Will different UTMs gives "similar" LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affect by translation between UTMs, so I'm not sure how to formulate an "invariance" property even in the Bayesian case. Looking through Schmidhuber's paper on the speed prior, I see ey do have some kind of invariance (section 4) but I'm too lazy to understand the details right now.

Yeah. An asymptotic thing like Solomonoff induction can still have all sorts of mathematical goodness, like multiple surprisingly equivalent definitions, uniqueness/optimality properties, etc. It doesn't have to be immediately practical to be worth studying. I hope LI can also end up like that.

A few thoughts:

I agree that the LI criterion is "pointwise" in the way that you describe, but I think that this is both pretty good and as much as could actually be asked. A single efficiently computable trader can do a lot. It can enforce coherence on a polynomially growing set of sentences, search for proofs using many different proof strategies, enforce a polynomially growing set of statistical patterns, enforce reflection properties on a polynomially large set of sentences, etc. So, eventually the market will not be exploitable on all these things simultaneously, which seems like a pretty good level of accurate beliefs to have.

On the other side of things, it would be far to strong to ask for a uniform bound of the form "for every , there is some day such that after step , no trader can multiply its wealth by a factor more than ". This is because a trader can be hardcoded with arbitrarily many hard-to-compute facts. For every , there must eventually be a day on which the belief of your logical inductor assign probability less than to some true statement, at which point a trader who has that statement hardcoded can multiply its wealth by . (I can give a construction of such a sentence using self-reference if you want, but it's also intuitively natural - just pick many mutually exclusive statements with nothing to break the symmetry.)

Thus, I wouldn't think of traders as "mistakes", as you do in the post. A trader can gain money on the market if the market doesn't already know all facts that will be listed by the deductive process, but that is a very high bar. Doing well against finitely many traders is already "pretty good".

What you can ask for regarding uniformity is for some simple function such that any trader can multiply its wealth by at most a factor . This is basically the idea of the mistake bound model in learning theory; you bound how many mistakes happen rather than when they happen. This would let you say a more than the one-trader properties I mentioned in my first paragraph. In fact, has this propery; is just the initial wealth of the trader. You may therefore want to do something like setting traders' initial wealths according to some measure of complexity. Admittedly this isn't made explicit in the paper, but there's not much additional that needs to be done to think in this way; it's just the combination of the individual proofs in the paper with the explicit bounds you get from the initial wealths of the traders involved.

I basically agree completely on your last few points. The traders are a model class, not an ensemble method in any substantive way, and it is just confusing to connect them to the papers on ensemble methods that the LI paper references. Also, while I use the idea of logical induction to do research that I hope will be relevant to practical algorithms, it seems unlikely than any practical algorithm will look much like a LI. For one thing, finding fixed points is really hard without some property stronger than continuity, and you'd need a pretty good reason to put it in the inner loop of anything.