Theorem 4: limα→1Hα(h) exists for all infradistributions over a finite set X, and equals −supqdh(1;ln(q))
So, as a recap, here's how it works. α>1. The Renyi entropy for α>1 is:
Where q is taken over functions with ∑iqi=1 (we have finitely many points, we're summing it up over all of those). And our task is to take the limit as α tends to 1.
Our first step is to let α=1+ϵ, and with this reindexing, we can instead take the limit as ϵ tends to 0, and get the same answer. So, we're trying to find:
Well, we don't even necessarily know it exists. Looking at this, in the ϵ→0 limit, if q is such that ∀i:qi>0, then qϵ1+ϵ will tend to 1 for all i. And so, h of that will tend to 1, and ln of that will tend to 0 from below, and hopefully the really tiny negative number cancels out with the fact that we're also dividing by a really tiny negative number. To do our approximation work, we're going to need that q is bounded above 0 by something. We can show that the limit exists if we were dealing with a family of q that are bounded above 0.
Accordingly, our proof strategy is going to be getting some bounds (for all ϵ sufficiently close to 0) on how much the results differ if we're taking the sup over all appropriate q, vs if we're looking purely at the q that are bounded above 0 by some (small, but not varying with ϵ) number. Then, we can get some upper bounds and lower bounds, which we will be able to show converge in the ϵ→0 limit. The problem is that the bounds won't be tight enough to produce the same limit, because we're blowing things up with a division by ϵ. So, all we're going to be able to conclude is that the relevant ϵ→0 sequence has all of its limit points lying in an interval. However, the lower our "bound above 0" cutoff is, the tighter the interval will be. So, if we take the limit as our "bound above 0" cutoff shrinks to 0, the interval will shrink to a point, and we will have shown that the limit does in fact exist.
Let's get started on our upper bounds and lower bounds, respectively. Fix some δ which, importantly, won't vary with ϵ (yes, this will lead to a bit of weirdness where we start looking at behavior when ϵ<<δ). For notation, supq≥δ is maximizing q which fulfill the relevant property of ∀i:qi≥δ. Accordingly, δ should be less than 1 divided by the number of possible finite outcomes. So, we can start establishing our upper bound as:
And the natural logarithm is monotonic, so
And then we multiply both sides by a negative number, flipping the sign, so:
And bam, we have an upper bound. The lower bound is significantly more tricky. Let's start with our first question. Given a q≱δ, how can we adjust it to a q′ that fulfills that property? Well, just have:
We should check that A: this still adds up to 1, and B: that when qi=0, q′i≥δ. For later, we'll also need to check C: what's the critical value where this transformation switches from increasing the value to decreasing the value? Let's hit A first.
Excellent. For B, we can quickly check that this function is still in range by inspection if qi is 1, because we're dividing a number by a slightly bigger number. For being above δ, observe that:
And finally, let's check the threshold where this starts moving the value of the function down instead of up. Instead of having to do a bunch of fancy equations, let q be the uniform function, with 1|X| function-mass on each outcome. We're adding an equal amount to everything, and then dividing by some amount, so we get another uniform function. Which, to sum up to 1, must be unchanged. so 1|X| is the critical value where the value of our q start decreasing in value (if they're above it), or increasing (if they're below it).
Armed with our way of translating functions so that they do indeed fulfill our desired δ bound, let's look at the difference between: h(qϵ1+ϵ) and h(q′ϵ1+ϵ)
To begin with, by monotonicity,
then, by the fact that we have a finite Lipschitz constant L for infradistributions when the associated function is bounded above 0. Let ϵ be small enough that q′ϵ1+ϵ≥0.5, this makes the Lipschitz constant not depend on δ.
Now, let's try to evaluate the distance between those two functions, and how it scales with ϵ. The distance between functions is the maximum difference in values they assign to input points. Remember, if qi≤1|X|, q′i has an equal or higher value, so the supremum is gonna mimic q′ there. But for everything else, qi is actually going to be higher, and the supremum will mimic q instead. So, we have:
Now, what we really want to study is how that above quantity varies with ϵ. Fortunately, ax1−x with a∈(0,1] is an analytic function at x=0, so we can take the Taylor expansion of both terms to see what's happening around 0 in terms of ϵ.
And, graphing the function in the sup with Desmos, with y as the variable, and δ and |X| as sliders, it's always maximized at 1. So then we get:
I should note something about the big-O notation. From trying it out on some test values, the coefficient gets larger as y decreases. So we don't have to worry about the coefficient in the big-O blowing up unboundedly, the lower bound on our possible y tames it. Putting it all together, we get (for ϵ sufficiently close to 0),
And our big-O doesn't depend on what q happens to be. Given any q at all, when we rescale it to get our q′, our analysis always works until the big-O terms start manifesting, but they're biggest for the smallest y's, and the lowest our y gets is |X|−1, so we can pick a sufficiently large constant to make everything work regardless of the q. Also, our big-O absorbed the Lipschitz constant. Now, because our big-O doesn't care about what q is, we have:
Pretty much, since the last two terms of our bound are uniform in q, and all our q′≥δ since we designed it that way, we get the above bound. Now, we can take the ln to get
and multiply to get:
Thus, our upper bound on the sequence as a whole is:
and our lower bound on the sequence as a whole for small ϵ is
Let's show convergence for that top one, as it'll eventually let us crack the bottom one as well. Our task now is to show upper and lower bounds for the sequence (as ϵ→0)
Augmented with our nice bounds on how low q can be, let's see if we can find some other function that's close to qϵ1+ϵ for ϵ near 0, so we can swap out for that.
Well... if we take the Taylor expansion of yϵ1+ϵ w.r.t. ϵ around 0, then we get 1+ϵln(y)+O(ϵ2). So what if we swapped out qϵ1+ϵ for 1+ϵln(q)? The second part of that is negative, so it's below 1. And, since, for all i, qi≥δ, once we get to ϵ<1−ln(δ), it's guaranteed to never undershoot 0 and be a legit function. Let's assess the difference between the two functions:
Again, the reason this works is, because we have a uniform lower bound on our qi, they're all in [δ,1], which bounds how bad the big-O constants can be, so this difference is uniform across our q. The difference between qϵ1+ϵ and 1+ϵln(q) is O(ϵ2). By Lipschitzness of h on functions bounded above 0.5 (which happens for low enough ϵ), we can transfer this difference outside the h, and the Lipschitz constant absorbs into the big-O, so:
and then, transferring this to all q (the big-O is uniform), we have:
and so, we get a lower bound where
Now, for the upper bound. This is easy, because by graphing, we can see that 1+ϵln(qi)≤qϵ1+ϵi is always true if ϵ≥0 and qi∈[δ,1]. Thus, by monotonicity,
and, again, transferring this to all q, and then taking the ln and multiplying by our negative constant, we have:
Alright, we've got some bounds. We can keep poking more at reexpressing the bounds, but remember that since we still haven't shown that anything in particular converges, we're gonna keep getting bounds on our bounds on our bounds. At some point we need to ground out and get that some actual limit exists, and use that as a tool to solve all the rest of the bounds.
Let's look at h(1+ϵln(q)) for inspiration. We can notice something interesting. For a particular q, we can perfectly reexpress this as h(1−ϵf), where fi=−ln(qi). Our analogue of qi summing to 1 is ∑ie−fi=1, and our analogue of q being in [δ,1]|X| is that f∈[0,−ln(δ)]|X|. In fact, all f of this form correspond to a q, and vice-versa, just have q=e−f. So, using supf≤−ln(δ) as a shorthand for "we're selecting among f that fulfill these properties as that's isomorphic to selecting a q", we would be able to go:
Now, with this reexpression, fixing a particular f∗, are we able to solve the following equation?
We're just slicing off the smallest nontrivial bit of the problem we can to hopefully get one of these damn bounds to converge to something. And it turns out, we actually can solve this one! The teensy little problem is that L'hopital's rule only applies to differentiable functions, and this... isn't exactly differentiable. So we have to dig deep into the guts of the proof behind L'hopital's rule in order to show that applying the L'hopital procedure pretty much solves this, even though our function isn't differentiable (remember, h is Lipschitz, not differentiable). Once we've solved this, we'll be able to start chaining backwards and solve everything.
Our first order of business is to show that (1+ϵ)ln(h(1−ϵf∗)), as a function of ϵ is differentiable almost everywhere, concave, and monotonically decreasing. First, for concavity of h(1−ϵf∗), observe that:
And, as ϵ gets bigger, by monotonicity of h, h(1−ϵf∗) gets smaller. So, h(1−ϵf∗) is monotonically decreasing, and concave. Now, from math stackexchange, the composition of a concave function with a monotonically increasing concave function (ln in this case) is concave. Thus, ln(h(1−ϵf∗)) is monotonically decreasing in ϵ, and concave. It's 0 when ϵ=0, and slopes down into the negative.
Now, any line from a point on the graph of ln(h(1−ϵf∗)) to another point on it must have a very important property. It slopes down, since this function is monotonically decreasing in ϵ. So, given some ϵ and ϵ′, there's a linear function a−bx with a,b≥0 where, regardless of p,
(this is just the line between two spots on our resulting concave composite function). And then we can go:
The first inequality is by concavity for ln(h(1−ϵf∗)) in ϵ. Multiplying by (1+pϵ+(1−p)ϵ′) produces:
and, then, I'm gonna explain this following inequality in a bit, where it came from.
So, the equality makes sense, we just use the link between our line and ln of h, see above just under "net takeaway". The inequality... what the heck is that about? Well, (1+x)(a−bx) is a parabola opening down, it's concave. So, plugging pϵ+(1−p)ϵ′ in for the x produces a bigger value than p(1+ϵ)(a−bϵ)+(1−p)(1+ϵ′)(a−bϵ′).
Ok, so our equation as a whole is concave. It's also monotonically decreasing because (ln of h of...) starts off at 0 and just heads down from there, and we're multiplying by bigger and bigger positive numbers as ϵ increases.
And also, according to math stackexchange, concave functions can only be non-differentiable at countably many points! So,
the above function is concave in ϵ, is monotonically decreasing in ϵ, and only has countably many non-differentiable points. So as not to rewrite this each time, we'll switch our variable from ϵ to x, and abbreviate this as θ(x), unpacking as needed.
Now that we know our function is nice, we can turn to verifying the important core needed to apply L'hopital's rule to solve the limit. And once that's done, we can start chaining backwards to solve our whole problem.
The key part that works for our function in particular, though it may fail in the general case, is an analogue of Cauchy's mean value theorem. We're about to make a bit of a mathematical leap here. We're going to let the derivative be set-valued at nondifferentiable points. As a toy example, consider the function |x|. if x≠0, the derivative is well-defined. If x=0, then, even though we don't have a derivative here, there's a left-derivative and a right-derivative. So, we can consider the "derivative" at 0 for |x| to be any number between -1 and 1.
Now, the particular thing we need to go through the L'hopital proof for our specific case is: For ϵ and τ∈(0,ϵ), there's a χ∈[τ,ϵ], and a possible choice of derivative (remember, we've got the set-valued derivatives now) where:
We can make the argument for this more intuitive by drawing the following picture.
Remember, our function \theta is concave, and note that that the left-hand-side of our equation is just going "what's the slope of the line" (with positive sign). From concavity, we can just translate said line up until it becomes a tangent line of our function, and that gets us our value of χ∈[τ,ϵ] where there's a possible derivative that induces the above equality (because ddx(−x)(χ)=−1 no matter what, and this also enforces the appropriate sign). So, we do indeed have our analogue of Cauchy's mean value theorem working here. Drawing a line segment between two points on the graph of θ(x), there's a point in between them where the possible derivative matches up with the slope of the line segment.
Now, define the following two functions:
The inf and sup is also taken over possible choices of derivative, by the way. And so, regardless of our choice of ϵ and τ, as long as τ<ϵ, from our analogue of Cauchy's Mean Value Theorem,
Now that we've established this, lock ϵ in place and let τ limit to 0.
That last equality was because 1−τf∗ approaches 1, so by normalization for h, h(1−τf∗) approaches 1, so ln(h(1−τf∗)) approaches 0, and multiplying by (1+τ) dosn't change that, and the bottom doesn't limit to 0.
So, we have our result that, for all our relevant ϵ,
And now we'll use the squeeze theorem. So, what's the limit as ϵ heads to 0 for m(ϵ) and M(ϵ)?
Well, the former is:
and as ϵ gets incredibly close to 0 (forcing χ to do so as well), the ln(h(1−χf∗)) turns into 0, and the 1+χh(1−χf∗) turns into 1. So, this produces:
And now, since h(1−xf∗) is concave in x and monotonically decreasing... Well, no matter which slope we choose for the nondifferentiable points, the shallowest possible slope is at 0. The slope is gonna be negative. Multiplying it by a negative means that we're trying to minimize a positive number. So, we want the shallowest slope we can possibly get, which would mean plugging χ=0 in. Bam, no dependence on ϵ anymore.
And, consulting the Wikipedia page for the Gateux Derivative, this is the Gateaux Derivative of h at 1 in the direction of −f∗!
So, we finally have solved one of our limits, it's:−dh(1;−f∗). Now, what about M(ϵ)? Well, a similar analysis applies.
And now, since h(1−xf∗) is concave in x and monotonically decreasing... Well, the slope is shallowest at 0, and steepest at ϵ itself. The slope is gonna be negative. Multiplying it by a negative means that we're trying to maximize a positive number. So, we want the steepest slope we can possibly get, which would mean plugging χ=ϵ in. So, we have:
Now, the sup in this case is just sup over derivatives, we know which χ to put in. In particular, since h(1−xf∗) is concave in x and monotonically decreasing, the steepest possible derivative we could have is the derivative where the nearby point approaches from higher ϵ.
Now, remember, h(1−xf∗) is a concave function and monotonically decreasing (in x). If we graphed the derivative (in x), it'd be monotonically decreasing and always at 0 or less. There's discontinuities in the graph of the derivative, the graph of the derivative would look like a staircase going down. But remember, there are only countably many discontinuities. We're taking the lowest value of the derivative (read: furthest away from 0), so we can turn it into a (discontinuous) function where, at the "stairstep" discontinuity points, we stay at the bottom of the stairstep. And so, the question is, "if we stick to the bottom of the step at points where there's a step jump, and travel towards 0 ie ϵ goes to 0, do we have a limit?" Well, the value of that "staircase" derivative function is monotonically increasing (going from well below 0 to less below 0) as ϵ goes to 0, and it's got an upper bound of 0, so yeah, there's a limit.
But, in order to identify the limit as ϵ goes to 0 of the lowest possible derivative with just "the derivative at 0", we've got a bit of a problem. There's some free choice in which derivative value to assign at discontinuity points. Given free choice in the matter, do all those choices lead to hitting the same value at 0? Well, let's say we took the upper value of the staircase. Given any ϵ where there's a discontinuity, there's a smaller τ∈[ϵ2,ϵ] where the derivative is well-defined (because there's only countably many points where there's a discontinuity), which attains a value closer to 0 than the closest-to-0 value on the stairstep, because our derivative function stairstep is monotonically increasing as ϵ ticks down. So, even the "take the upper value for each stairstep discontinuity" function can have any particular value exceeded by a sufficiently low value for ϵ when we take the lower value for the stairstep discontinuities. So, it has the same limiting value. Which, again, is the derivative at 0. So, we get:
Alright! We got the same value for an upper bound as a lower bound! Awesome! Now we can get started solving some limits, finally! From here on out, we'll do a lot of backing up to stuff we've already shown. First, we've already shown that:
and, in the ϵ→0 limit, both the left-hand side and right-hand-side have the same limit. Namely, −dh(1;−f∗). The Gateaux derivative of h at 1 in the direction of −f∗. So,
Alright, we've got our first toehold where we were actually able to solve a damn limit. Let's back up a bit and try somthing more ambitious. Let's attempt to solve
Again, we're going to impose upper and lower bounds. The tricky part in this case is that the previous limit was just for a particular function. Once we start taking the supremum over a bunch of different functions, we can't necessarily guarantee that they all limit at the same rate. But, let's try to reduce it to the case of one particular function.
To begin with, regardless of ϵ and f, h(1−ϵf)≤1+ϵdh(1;−f). Why is this the case? Well, the former is a concave function, monotonically decreasing, that starts at 1 where ϵ=0. And the latter is the tangent line to the former function at ϵ=0. So, the tangent line lies above the function. Thus, we can go:
and then, since all the Gateux derivatives are 0 or less, we can move the sup inside (changes nothing), and get:
Ok, cool, we've got a lower bound. Now, what about an upper bound? Well, pick an f∗ where dh(1;−f∗) is extremely extremely close to supf≤−ln(δ)(dh(1;−f)). We can get:
And bam, we have an upper bound.
Now, let's take the ϵ→0 limits of our lower bound and upper bound. Our upper bound limit is:
Our lower bound limit is:
But fortunately, that supremum doesn't change with ϵ! It's just a constant. So the function inside the ln is continuous and differentiable, and so is everything, so we can solve this with vanilla L'hopital's rule.
Sadly, these two bounds are different. We only have that
Or, heck, the limit might not even exist! But all limit points must be in that interval. But... this argument worked independently of our choice of f∗. So, we can select it so the derivative is as close as we want to the supremum derivative! No matter how tight the interval is, we can always back up, pick a different f∗, and make the interval smaller! Thus, the limit must exist, and we have:
Bam, that's another limit out of the way. Going back in the proof, we showed earlier that
(it was just a reexpression), so that nets us another limit,
Now... let's back up to showing what
is. From earlier work, our upper bound (for small enough ϵ) was:
and our lower bound was:
Well, we know what the upper bound limits to already. We do need to check that the O(ϵ2) thing doesn't mess with anything. We're going to Taylor-expand ln about 1. The interval of convergence is (0,2] so this Taylor expansion works for ϵ a bit away from 1, not just in the limit. Taylor-expanding the ln for the lower bound produces:
Now, (supq≥δ(h(1+ϵln(q)))−1) converges to 0 as O(ϵ), so we can neglect all terms but the first (because any later term would converge to 0 as O(ϵ2) or faster, so even after getting blown up by dividing by ϵ, they'd still shrink to 0). And the O(ϵ2) error term in the lower bound, again, even after getting blown up, doesn't affect the ϵ→0 limit. So, since our error term is too small to affect the limit, both the upper bound and lower bound limit to the same thing, −supf≤−ln(δ)(dh(1;−f)). And so, by the squeeze theorem,
Ok, cool. That's another limit we were able to show to exist. But what about our original thing, back from the start of the proof, that we wanted to show? We wanted to solve
Well, our upper bound was:
Again, our upper bound limits to −supf≤−ln(δ)(dh(1;−f)), and our lower bound... well, hang on, there's a O(ϵ) additional term in there, which is large enough to affect the limit. Again, doing the Taylor-expansion of ln around 1, and dropping all terms but the first since they decline as O(ϵ2) or faster and so don't affect the limit, our upper bound turns into:
Ok, so we know what that limit is. But for our lower bound, when we do the Taylor expansion (neglecting higher-order terms), we get: