Asymptotic Logical Uncertainty: Solomonoff Induction Inspired Approach

by Scott Garrabrant3 min read22nd Jun 2015No comments

3

Personal Blog

This is post is part of the Asymptotic Logical Uncertainty series. This is about a failed attempt to satisfy the Benford test. I will not prove things about this approach. The only reason I am including this approach is because the fact that it does not work is surprising and insightful.

Let Geometric() be the function which outputs with probability .

Fix a prefix-free encoding of probabilistic Turing machines, and let denote the number of bits used to encode . Let be the function which outputs the Turing machine with probability proportional to .

Fix , a deterministic Turing Machine (for example ). Fix a time complexity function and another time complexity function . Consider the following algorithm.

SIA(E,T,R)
 1: n=0
 2: N=1
 3: M=RTM(3)
 4: g=Geometric(1/2)
 5: loop
 6:     if 4^n*R(N)*log(R(N))*N<T(N) then
 7:         run E for one more time step
 8:         if E outputs another bit then
 9:             n=n+1
10:             repeat
11:                 M=RTM(3)
12:                 g=Geometric(1/2)
13:             until M^{gR}(i)=E(i) for all i<=n
14:     output M^{gR}(N)
15:     N=N+1

This code is inspired by Solomonoff Induction. We cannot directly apply Solomonoff induction, because the lack of a time bound. Solomonoff induction would pick out the program , but that will not allow us to predict quickly. We have to restrict the run times of the programs to ensure that they can compute their th bit in time T(N).

When trying to predict , we compute the first values of for some much smaller than . We then sample probabilistic Turing machines until we find one that quickly gets all of these first bits correct. We use that Turing machine to compute our guess at .

The purpose of the geometric random variable is to make the time we allow the sampled Turing machines to run more flexible. The sampled Turing machines can take any amount of time in but get an extra penalty for the size of the constant. The fact that we use instead of is to make the proof that it satisfies the weak Benford test work, would probably work too. Line 6 is only to make sure it runs in time .

passes the weak Benford test, but probably fails the Benford test.

First, let me explain why we thought this approach would work. If you sample a Turing machine that is very likely to get all of the bits correct, but gives suboptimal probabilities on the Benford test sentences, then you can modify that program to form a program that outputs 1 with probability for the Benford test sentences, and follows otherwise.

This modified program will be just as likely to get all the non-Benford sentences correct, and more likely to get the the Benford sentences correct. Further, this improvement on the Benford sentences is unbounded as you consider more and more sentences, and therfore eventually pays for all of the cost associated with replacing with .

The problem is that while is more likely to get the Benford sentences correct, we are not optimizing for programs that are likely to get the Benford sentences correct. Instead, we are optimizing for programs that are likely to get the Benford sentences correct conditioned on getting all the other questions correct.

At first, this may not seem like much of problem. The programs are entangling their results on some inputs with their results on other inputs. The problem comes from the fact that sometimes the program tries to entangle its results with a much much earlier sentence, and when the program was answering that earlier sentence it had much less time to think.

To give a concrete example of why this is a problem, consider the sentence: ="The first digit of is a 1 if and only if the first digit of is a 1."

The probability we should assign to this sentence is . However, by the time we have to assign a probability to the th Benford sentence, we may have already calculated the first digit of and seen that it is a 1. In which case, to maximize the probability of getting all the bits correct, we have to have our answer to the th Benford sentence match our answer to .

The correct thing to do would be to update the probability assigned to when observing that the first digit of is a 1. However, that is not possible, so the next best thing is to change the answer to the th Benford sentence, which causes it to give the wrong answer.

I believe that the fact that this program does not work can be extended to an impossibility result saying that any program which loops through Turing machines, gives them a score based on how many answers they get correct, and samples according to their complexity and their score, must fail the Benford test.

Personal Blog

3

6 comments, sorted by Highlighting new comments since Today at 10:28 AM
New Comment

I'm not sure the counterexample given here works as stated. What follows is my attempt to go through details. (I still thing SIA is doing something wrong and is unlikely to work, but it is important to get the details right to be sure.)

As I understood it, the problem was as follows. (Most of this is just re-wording of things you said in the post.)

We wanted to design SIA such that if there is an optimal heuristic for a quickly computable sub-sequence of , then it would learn to apply that heuristic to those problems. In particular, if the Benford sentences are embedded in a larger sequence such as , it should use Benford probabilities on that subset.

SIA fails to achieve this sub-sequence optimality because the "objective function" is not decoupled: Bayesian updating "incentivizes" a high joint score, not a high score on individual questions. In particular the program "wants" to condition on its getting all previous questions in sequence correct.

As we've discussed extensively in person, this incentive gives an advantage to programs which answer with 1s and 0s deterministically rather than probabilistically. (The stochastic Turing machines will learn to act like deterministic TMs.) The programs want badly to be able to remember their previous answers, to be able to update on them. They can do this by using extra bits in their description length to "memorize" answers, rather than generating answers stochastically. This is worthwhile for the programs even as we make program-description bits more expensive (using rather than ), because a memorized logical fact can be used to get correct answers on so many future logical facts. In effect, giving 0s and 1s rather than stochastic answers is such a good strategy that we cannot incentivize programs out of this behavior (without totally washing out the effect of learning).

Rather than gaining traction by giving answers with Benford probabilities, programs gain traction by using appropriate description languages in memory such that the prior on programs will assign Benford probabilities to the different extensions of a program, purely as a matter of program length. This allows to give good probabilities even though the programs in its mixture distributions are learning not to do so.

Having understood this part of the problem, let's discuss the example you give in the post.

You consider three sentences: , , and . We assume that these are interspersed in , and that has already been trained up to a large on this kind of problem; we wish to show that the answers for the subsequence depend in a problematic way on the answers for subsequence .

The argument seems to be: suppose that the question ordering is such that and are considered long before . Now, when considering , the programs will have a lot more time; in particular, they have time to compute the actual answer to from scratch, and also have time to call themselves on and to see what their earlier selves answered for those questions.

We note that the probability we could independently give to is a specific quantity based on Benford probabilities, but not a Benford probability itself.

Now I'm going to start filling in parts of the argument which I don't see spelled out.

We assume that when considering , has enough time to calculate as an explicit training example. It turns out that this sentence is true. All programs which guessed that it was false are eliminated from the set of possible programs to use on .

Now, when calling themselves to see what they said for simpler problems, the programs can potentially see that they guessed 1 for . They can also see their guess for . Some of the programs will have guessed true and some false. Assume that they answer true and false with proportion . I'll call this assumption (*) for later reference.

The programs have enough information to deduce the answer to which is consistent with their answers so far. is true, so they can simply check what they replied on and reply the same thing for . This is better than guessing with Benford probability, because programs which guessed the wrong answer for will be eliminated later anyway. By assumption (*), we conclude that the probability approaches as increases.

Can we justify assumption (*)? Suppose that when the program considers , it does not have time to look up the answer it gave for . Then its best bet is to answer using the Benford assumption on and , resulting in the probability estimate .

But this seems potentially unrealistic. The program has learned to memorize . It can find what it answered in the time it takes to do a lookup. In this case, the programs are better off having guessed in Benford proportion, conditioned on being true. (They also guess in reverse of Benford proportion for the case when is false, but remember that (*) was an assumption specifically about the proportion conditioning on being true.)

I believe your concern comes from the fact that at the time the program has to assign a probability to , it has not only deduced the truth of but it also earlier guessed at the truth value of . When it guesses here, it loses some probability mass, but it can lose some of that probability mass in a way that is correlated to the answer it gave to . This way, it can still give the correct probability on .

Here is my fix: Instead of consider the case where we are trying to guess only sentences of the form and for some . Meaning we modify to reject any sentence not of that form. Both of these sub sequences are indistinguishable from coin flips with fixed probabilities. In this case, SIA will not get the correct probabilities on both subsequences, because it has an incentive to make its answers to match its answers to (not match when is false), and any program that does not make them match will be trumped by one that will.

This does not mean that we have this property when we consider all of , but the code in no way depends on , and I see no reason to think that it will work for , but not the modified .

I agree, this version works.

To walk through it in a bit more detail:

Now we are only considering two sentence schemas, and . (Also, ignore the (rare) case where is an Ackermann number.)

I'll call the Benford probability , and (as before) the probability .

At the time when considers , we assume it does not give its sampled programs enough time to solve either or . (This assumption is part of the problem setup; it seems likely that cases like this cannot be ruled out by a simple rule, though.) The best thing the programs can do is treat the like coin flips with probability .

At the time when is considered, the program has enough time to compute (again as part of the problem setup). It can also remember what guess it made on . The best thing it can do now is to logically combine those to determine . This causes it to not treat like a random coin. For cases where , the population of sampled programs will guess with frequency approaching . For cases where , the frequency will be .

This behavior is the optimal response to the problem as given to , but is suboptimal for what we actually wanted. The Bayes score of on the sub-sequence consisting of only the is suboptimal. It will average out to probability , but continue to be higher and lower for individual cases, without actually predicting those cases more effectively; is acting like it thinks there is a correlation between and when there is none. (This is especially odd considering that isn't even being asked to predict in general, in this case!)

This is still not a proof, but it looks like it could be turned into one.

I'm hoping writing it out like this unpacks some of the mutual assumptions Scott and I share as a result of talking about this.

No we cannot justify (*). In fact, (*) will not even hold. However if (*) does not hold, I think that is just as bad as failing the Berford test. The sentences are themselves a sequence that is indistinguishable from coming a sequence coming from a weighted coin. Therefore failing to provide probability to the sentences is a strong sign that the code will also give the wrong probability to . The two are not qualitatively different.

A formal proof of why it fails is not written up, but if it is, the conclusion will be that either OR will have incorrect limiting probabilities.

I'm worried about whether we can even formalize the argument that the algorithm passes the weak Benford test, let alone that it fails the strong Benford test. Yes, any particular non-Benfordian would eventually get outclassed by , but this may happen slowly enough that the winner at every particular stage is a different non-Benfordian algorithm...

The fact that it passes weak Benford should not be clear from this post at all, and you are correct to be skeptical from what I showed you. The complete proof is not written up in a nice way yet, because the other algorithm I will share soon is much more important, and I have been focusing on that.

The argument you present is something I am very aware of. The answer is that If there was a sequence of different non-Benfordian algorithms that did increasingly well, then if you consider the algorithm A that picks a random algorithm according to complexity, and runs that algorithm, then A will also do better than Benford (or at least not in big O of how well Benford does), just by being able to sample the algorithms in the infinite sequence.

Making the above argument work is actually the reason I use RTM(3), not RTM(1). I need the hypothetical random algorithm A to discount later algorithms significantly less than the sampling procedure.

I think the fact that this fails strong Bendord is very interesting, and I want to write more about that. I agree that I have not formally shown that it fails strong Benford, and I dont even have a proof of this. All I know is that if you replace L with something that looks at a particular subset of sentences that contains the Benford test sentences, that this approach fails strong Benford in that domain.

However, I do not think the proof that this satisfies weak Benford is all that important. Weak Benford really is a weak test, and passing it is not that impressive.