LBIT Proofs 6: Propositions 39-47

by Diffractor22 min read16th Dec 2020No comments

4

AI
Frontpage

 

Proposition 39: Given a crisp infradistribution  over , an infrakernel  from  to infradistributions over , and suggestively abbreviating  as (hypothesis ) and  as  (your infraprior where you have Knightian uncertainty over how to mix the hypotheses), then

Proof: Assume that  and  are functions of type  and  respectively, ie, likeliehood and utility doesn't depend on which hypothesis you're in, just what happens. First, unpack our abbreviations and what an update means. 

Then use the definition of an infrakernel pushforward.

For the next thing, we're just making types a bit more explicit,  only depend on , not .

Then we pack the semidirect product back up.

And pack the update back up.

At this point, we invoke the Infra-Disintegration Theorem.

We unpack what our new modified prior is, via the Infra-Disintegration Theorem.

and unpack the semidirect product.

Now we unpack  and .

And unpack what  is

And reabbreviate  as ,

And then pack it back up into a suggestive form as a sort of expectation.

And we're done.

Proposition 40: If a likelihood function  is 0 when , and  and , then 

And then we apply Markov's inequality, that for any probability distribution,

Also, (because  is 0 when ), so monotonicity means that

So, we can get:

And we're done.

 

Proposition 41: The IKR-metric is a metric.

So, symmetry is obvious, as is one direction of identity of indiscernibles (that the distance from an infradistribution to itself is 0). That just leaves the triangle inequality and the other direction of identity of indiscernibles. For the triangle inequality, observe that for any particular  (instead of the supremum), it would fulfill the triangle inequality, and then it's an easy exercise for the reader to verify that the same property applies to the supremum, so the only tricky part is the reverse direction of identity of indiscernibles, that two infradistributions which have a distance of 0 are identical.

First, if , then  and  must perfectly agree on all the Lipschitz functions. And then, because uniformly continuous functions are the uniform limit of Lipschitz functions,  and  must perfectly agree on all the uniformly continuous functions.

Now, we're going to need a somewhat more sophisticated argument. Let's say that the sequence  is uniformly bounded and limits to  in  equipped with the compact-open topology (ie, we get uniform convergence of  to  on all compact sets). Then, for any infradistributions,  will limit to . Here's why. For any , there's some compact set  that accounts for almost all of why a function inputted into an infradistribution has the value it does. Then, what we can do is realize that  will, in the limit, be incredibly close to , due to  and  disagreeing by a bounded amount outside the set  and only disagreeing by a tiny amount on the set , and the Lipschitzness of .

Further, according to this mathoverflow answer, uniformly continuous functions are dense in the space of all continuous functions when  is equipped with the compact-open topology, so given any function , we can find a sequence of uniformly continuous functions  limiting to  in the compact-open topology, and then,

And so,  and  agree on all continuous functions, and are identical, if they have a distance of 0, giving us our last piece needed to conclude that  is a metric.

 

Proposition 42: The IKR-metric for infradistributions is strongly equivalent to the Hausdorff distance (w.r.t. the KR-metric) between their corresponding infradistribution sets.

Let's show both directions of this. For the first one, if the Hausdorff-distance between  is , then for all a-measures  in , there's an a-measure  in  that's only  or less distance away, according to the KR-metric (on a-measures).

Now, by LF-duality, a-measures in H correspond to hyperplanes above . Two a-measures being  apart means, by the definition of the KR-metric for a-measures, that they will assign values at most  distance apart for 1-Lipschitz functions in .

So, translating to the concave functional view of things,  and  being  apart means that every hyperplane above h has another hyperplane above  that can only differ on the 1-Lipschitz 1-bounded functions by at most , and vice-versa.

Let's say we've got a Lipschitz function . Fix an affine functional/hyperplane  that touches the graph of  at . Let's try to set an upper bound on what  can be. If  is 1-Lipschitz and 1-bounded, then we can craft a  above  that's nearby, and

Symmetrically, we can swap  and  to get , and put them together to get:

For the 1-Lipschitz functions.

Let's tackle the case where  is either more than 1-Lipschitz, or strays outside of . In that case,  is 1-Lipschitz and bounded in . We can craft a  that only differs on 1-Lipschitz functions by  or less. Then, since, for affine functionals,  and using that  and  are close on 1-Lipschitz functions, which  and 0 are, we can go:

And then we swap out  for  with a known penalty in value, we're taking an overestimate at this point.

This argument works for all . And, even though we just got an upper bound, to rule out  being significantly below , we could run through the same upper bound argument with  instead of , to show that  can't be more than  above .

So, for all Lipschitz . Thus, for all Lipschitz ,

And therefore,

This establishes one part of our inequalities. Now for the other direction.

Here's how things are going to work. Let's say we know the IKR-distance between  and . Our task will be to stick an upper bound on the Hausdorff-distance between  and . Remember that the Hausdorff-distance being low is equivalent to "any hyperplane above  has a corresponding hyperplane above  that attains similar values on the 1-or-less-Lipschitz functions".

So, let's say we've got , and a . Our task is, knowing , to craft a hyperplane above  that's close to  on the 1-Lipschitz functions. Then we can just swap  and , and since every hyperplane above  is close (on the 1-Lipschitz functions) to a hyperplane above , and vice-versa,  and  can be shown to be close. We'll use Hahn-Banach separation for this one.

Accordingly, let the set  be the set of  where , and:

That's... quite a mess. It can be thought of as the convex hull of the hypograph of , and the hypograph of  restricted to the 1-Lipschitz functions in  and shifted down a bit. If there was a  that cuts into  and scores lower than it, ie , we could have , and  to observe that  cuts into the set . Conversely, if an affine functional doesn't cut into the set , then it lies on-or-above the graph of .

Similarly, if  undershoots  over the 1-or-less-Lipschitz functions in , it'd also cut into . Conversely, if the hyperplane  doesn't cut into , then it sticks close to  over the 1-or-less-Lipschitz functions.

This is pretty much what  is doing. If we don't cut into it, we're above  and not too low on the functions with a Lipschitz norm of 1 or less.

For Hahn-Banach separation, we must verify that  is convex and open. Convexity is pretty easy.

First verification: Those numbers at the front add up to 1 (easy to verify), are both in  (this is trivial to verify), and + isn't 1 (this is a mix of two numbers that are both below , so this is easy). Ok, that condition is down. Next up: Is our mix of  and  1-Lipschitz and in ? Yes, the mix of 1-Lipschitz functions in that range is 1-Lipschitz and in that range too. Also, is our mix of  and  still in ? Yup.

That leaves the conditions on the b terms. For the first one, just observe that mixing two points that lie strictly below  (a hyperplane) lies strictly below it as well. For the second one, since  is concave, mixing two points that lie strictly below its graph also lies strictly below its graph. Admittedly, there may be divide-by-zero errors, but only when  is 0, in which case, we can have our new  and  be anything we want as long as it fulfills the conditions, it still defines the same point (because that term gets multiplied by 0 anyways). So  is convex.

But... is  open? Well, observe that the region under the graph of  on  is open, due to Lipschitzness of . We can wiggle  and  around a tiny tiny little bit in any direction without matching or exceeding the graph of . So, given a point in , fix your tiny little open ball around . Since  can't be 1, when you mix with , you can do the same mix with your little open ball instead of the center point, and it just gets scaled down (but doesn't collapse to a point), making a little tiny open ball around your arbitrarily chosen point in . So  is open.

Now, let's define a  that should be convex, so we can get Hahn-Banach separation going (as long as we can show that  and  are disjoint). It should be chosen to forbid our separating hyperplane being too much above  over the 1-or-less Lipschitz functions. So, let  be:

Obviously, cutting into this means your hyperplane is too far above  over the 1-or-less-Lipschitz functions in . And it's obviously convex, because 1-or-less-Lipschitz functions in  are a convex set, and so is the region above a hyperplane .

All we need to do now for Hahn-Banach separation is show that the two sets are disjoint. We'll assume there's a point in both of them and derive a contradiction. So, let's say that  is in both  and . Since it's in ,

But also,  with the 's and 's and  fulfilling the appropriate properties, because it's in . Since  and , we'll write  as  and  as , where  and  are nonzero. Thus, we rewrite as:

We'll be folding  into a single  term so I don't have to write as much stuff. Also,  is an affine function, so we can split things up with that, and make:

Remember,  because . So, we get:

And, if , we get a contradiction straightaway because the left side is negative, and the right side is nonnegative. Therefore, , and we can rewrite as:

And now, we should notice something really really important. Since  can't be  does consistute a nonzero part of , because .

However,  is a 1-or-less Lipschitz function, and bounded in , due to being in ! If  wasn't Lipschitz, then given any slope, you could find areas where it's ascending faster than that rate. This still happens when it's scaled down, and  can only ascend or descend at a rate of 1 or slower there since it's 1-Lipschitz as well. So, in order for  to be 1-or-less Lipschitz,  must be Lipschitz as well.  Actually, we get something stronger, if  has a really high Lipschitz constant, then  needs to be pretty high. Otherwise, again,  wouldn't be 1-or-less Lipschitz, since  of it is composed of , which has areas of big slope. Further, if  has a norm sufficiently far away from 0, then  needs to be pretty high, because otherwise f wouldn't be in , since  of it is composed of  which has areas distant from 0.

Our most recent inequality (derived under the assumption that there's a point in  and ) was:

Assuming hypothetically were were able to show that

then because , we'd get a contradiction, showing that  and  are disjoint. So let's shift our proof target to trying to show

Let's begin. So, our first order of business is that

This should be trivial to verify, remember that 

Now, , and  is 1-Lipschitz, and so is . Our goal now is to impose an upper bound on the Lipschitz constant of . Let us assume that said Lipschitz constant of  is above 1. We can find a pair of points where the rise of  from the first point to the next, divided by the distance between the points is exceptionally close to the Lipschitz constant of , or equal. If we're trying to have  slope up as hard as it possibly can while mixing to make , which is 1-Lipschitz, then the best case for that is one where  is sloping down as hard as it can, at a rate of -1. Therefore, we have that

Ie, mixing  sloping up as hard as possible and  sloping down as hard as possible had better make something that slopes up at a rate of 1 or less. Rearranging this equation, we get:

We can run through almost the same exact argument, but with the norm of . Let us assume that said norm is above 1. We can find a point where  attains its maximum/minimum, whichever is further from 0. Now, if you're trying to have  be as negative/positive as it possibly can be, while mixing to make , which lies in , then the best case for that is one where  is as positive/negative as it can possibly be there, ie, has a value of -1 or 1. In both cases, we have:

Now we can proceed. Since we established that all three of these quantities (1, Lipschitz constant, and norm) are upper bounded by , we have:

And we have exactly our critical

inequality necessary to force a contradiction. Therefore,  and  must be disjoint. Since  is open and convex, and  is convex, we can do Hahn-Banach separation to get something that touches  and doesn't cut into .

Therefore, we've crafted a  that lies above , and is within  of  over the 1-or-less-Lipschitz functions in , because it doesn't cut into  and touches .

This same argument works for any , and it works if we swap  and . Thus, since hyperplanes above the graph of an infradistribution function  or  correspond to points in the corresponding  and , and we can take any point in /affine functional above  and make a point in /affine functional above  (and same if the two are swapped) that approximately agree on , there's always a point in the other infradistribution set that's close in KR-distance and so  and  have 

And with that, we get

And we're done! Hausdorff distance between sets is within a factor of 2 of the IKR-distance between their corresponding infradistributions.

 

Proposition 43: A Cauchy sequence of infradistributions converges to an infradistribution, ie, the space  is complete under .

So, the space of closed subsets of  is complete under the Hausdorff metric. Pretty much, by proposition 42, a Cauchy sequence of infradistributions  in the IKR-distance corresponds to a Cauchy sequence of infradistribution sets  converging in Hausdorff-distance, so to verify completeness, we merely need to double-check that the Hausdorff-limit of the  sets fulfills the various different properties of an infradistribution. Every point in , the limiting set, has the property that there exists some Cauchy sequence of points from the  sets that limit to it, and also every Cauchy sequence of points from the  sets has its limit point be in .

So, for nonemptiness, you have a sequence of nonempty sets of a-measures limiting to each other in Hausdorff-distance, so the limit is going to be nonempty.

For upper completion, given any point , and any  a-measure, you can fix a Cauchy sequence  limiting to , and then consider the sequence , which is obviously Cauchy (you're just adding the same amount to everything, which doesn't affect the KR-distance), and limits to , certifying that , so  is upper-complete.

For closure, the Hausdorff limit of a sequence of closed sets is closed.

For convexity, given any two points  and  in  , and any , we can fix a Cauchy sequence  and  converging to those two points, respectively, and then consider the sequence , which lies in  (due to convexity of all the ), and converges to , witnessing that this point is in , and we've just shown convexity.

For normalization, it's most convenient to work with the positive functionals, and observe that, because all the  and all the  because of normalization, the same property must apply to the limit, and this transfers over to get normalization for your infradistribution set.

Finally, there's the compact-projection property. We will observe that the projection of the a-measures in  to just their measure components, call the set , must converge in Hausdorff-distance. The reason for this is because if they didn't, then you could find some  and arbitrarily late pairs of inframeasures where  and  have Hausdorff-distance , and then pick a point in  (or ) that's  KR-distance away from the other projection. Then you can pair that measure with some gigantic  term to get a point in  (or , depending on which one you're picking from), and there'd be no point in  (or ) within  distance of it, because the measure component would only be able to change by  if you moved that far, and you need to change the measure component by  to land within  (or ).

Because this situation occurs infinitely often, it contradicts the Cauchy-sequence-ness of the  sequence, so the projections  must converge in Hausdorff distance on the space of measures over . Further, they're precompact by the compact-projection property for the  (which are infradistributions), so their closures are compact. Further, the Hausdorff-limit of a series of compact sets is compact, so the Hausdorff limit of the projections  (technically, their closures) is a compact set of measures. Further, any sequence  which converges to some , has its projection being , which limits to show that  is in this Hausdorff limit. Thus, all points in  project down to be in a compact set of measures, and we have compact-projection for , which is the last condition we need to check to see if it's an infradistribution.

So, the Hausdorff-limit of a Cauchy sequence of infradistribution sets is an infradistribution set, and by the strong equivalence of the infra-KR metric and Hausdorff-distance, a Cauchy limit of the infra-KR metric must be an infradistribution, and the space  is complete under the infra-KR metric.

 

Proposition 44: If a sequence of infradistributions converges in the IKR distance for one complete metric that  is equipped with, it will converge in the IKR distance for all complete metrics that  could be equipped with.

So, as a brief recap,  could be equipped with many different complete metrics that produce the relevant topology. Each choice of metric affects what counts as a Lipschitz function, affecting the infra-KR metric on infradistributions, as well as the KR-distance between a-measures, and the Hausdorff-distance. So, we need to show that regardless of the metric on , a sequence of convergent infradistributions will still converge. Use  for the original metric on  and  for the modified metric on , and similarly,  and  for the KR-metrics on measures, and  for the Hausdorff distance induced by the two measures. 

Remember, our infradistribution sets are closed under adding  to them, and converge according to  to the set .

What we'll be doing is slicing up the sets in a particular way. In order to do this, the first result we'll need is that, for all , the set

converges, according to , to the set

So, here's the argument for this. We know that the projection sets

are precompact, ie, have compact closure, and Hausdorff-limit according to  to the set

(well, actually, they limit to the closure of that set)

According to our Lemma 3, this means that the set

(well, actually, its closure) is a compact set in the space of measures. Thus, it must have some maximal amount of measure present, call that quantity , the maximal Lipschitz constant of any of the infradistributions in the sequence. It doesn't depend on the distance metric  is equipped with.

Now, fix any . There's some timestep  where, for all greater timesteps, .

Now, picking a point  in  with , we can travel  distance according to  and get a point in , and the  term can only change by  or less when we move our a-measure a little bit, so we know that our nearby point lies in

But, what if our point  in  has ? Well then, we can pick some arbitrary point  (by normalization for ), and go:

And then we have to be a little careful.  by assumption. Also, we can unpack the distance to get

And the worst-case for distance, since all the measures have their total amount of measure bounded above by , would be  being 1 on one of the measures and -1 on another one of the measures, producing:

So, the distance from  to

according to  is at most 

And then, because this point has a  value of at most

Because , the  value upper bound turns into 

Which is a sufficient condition for that mix of two points to be only  distance from a point in  with a  upper bound on the  term, so we have that the distance from

to

is at most