This is a linkpost for https://www.oliversourbut.net/p/conditions-for-mathematical-equivalence-of-stochastic

Mentioned in

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

18nostalgebraist

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 11:00 PM

This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.

However, the term "stochastic gradient descent" does not just mean "gradient descent with noise." It refers more specifically to *mini-batch gradient descent*. * *(See e.g. Wikipedia.)

In mini-batch gradient descent, the "true" fitness^{[1]} function is the expectation of some function over a data distribution . But you never have access to this function or its gradient. Instead, you draw a finite sample from , compute the mean of over the sample, and take a step in this direction. The noise comes from the variance of the finite-sample mean as an estimator of the expectation.

The model here is quite different. There is no "data distribution," and the true fitness function is not an expectation value which we could noisily estimate with sampling. The noise here comes not from a noisy estimate of the gradient, but from a prescribed stochastic relationship () between the true gradient and the next step.

I don't think the model in this post behaves like mini-batch gradient descent. Consider a case where we're doing SGD on a vector , and two of its components have the following properties:

- The "true gradient" (the expected gradient over the data distribution) is 0 in the and directions.
- The and components of the
*per-example*gradient are perfectly (positively) correlated with one another.

If you like, you can think of the per-example gradient as sampling a single number from a distribution with mean 0, and setting the and components to and respectively, for some positive constants .

When we sample a mini-batch and average over it, these components are simply and , where is the average of over the mini-batch. So the perfect correlation carries over to the mini-batch gradient, and thus to the SGD step. If SGD increases , it will always increase alongside it (etc.)

However, applying the model from this post to the same case:

- Candidate steps are sampled according to , which is radially symmetric. So (e.g.) a candidate step with positive and negative is just as likely as one with both positive, all else being equal.
- The probability of accepting a candidate step depends only on the true gradient
^{[2]}, which is 0 in the directions of interest. So, the and components of a candidate step have no effect on its probability of selection.

Thus, the the and components of the step will be uncorrelated, rather than perfectly correlated as in SGD.

Some other comments:

- The descendant-generation process in this post seems very different from the familiar biological cases it's trying to draw an analogy to.
- In biology, "selection" generally involves having more or fewer descendants relative to the population average.
- Here, there is always exactly one descendant. "Selection" occurs because we generate (real) descendants by first generating a ghostly "candidate descendant," comparing it to
*its parent*(or a clone of its parent), possibly rejecting it against the parent and drawing another candidate, etc. - This could be physically implemented in principle, I guess. (Maybe it has been, somewhere?) But I'm not convinced it's equivalent to any familiar case of biological selection. Nor it is clear to me how close the relationship is, if it's not equivalence.

- The connection drawn here to gradient descent is not exact, even setting aside the stochastic part.
- You note that we get a "gradient-dependent learning rate," essentially because can have all sorts of shapes -- we only know that it's monotonic, which gives us a monotonic relation between step size and gradient norm, but nothing more.
- But notably, (S)GD does
*not*have a gradient-dependent learning rate. To call this an equivalence, I'd want to know the conditions under which the learning rate is constant (if this is possible). - It is also is possible this model always corresponds to vanilla GD (i.e. with a constant learning rate), except instead of ascending , we are ascending some function related to both and .

- This post calls the "fitness function," which is not (AFAIK) how the term "fitness" is used evolutionary biology.
- Fitness in biology typically means "expected number of descendants" (absolute fitness) or "expected change in population fraction" (relative fitness).
- Neither of these have direct analogues here, but they are more conceptually analogous to than . The fitness should directly tell you how much more or less of something you should expect in the next generation.
- That is, biology-fitness is about what
*actually happens*when we "run the whole model" forward by a timestep, rather than being an isolated component of the model. - (In cases like the replicator equation, there is model component called a "fitness function," but the name is justified by its relationship to biology-fitness given the full model dynamics.)
- Arguably this is just semantics? But if we stop calling by a suggestive name, it's no longer clear what importance we should attach to it, if any. We might care about the quantity whose gradient we're ascending, or about the biology-fitness, but is not either of those.

^{^}I'm using this term here for consistency with the post, though I call it into question later on. "Loss function" or "cost function" would be more standard in SGD.

^{^}There is no such thing as a per-example gradient in the model. I'm assuming the "true gradient" from SGD corresponds to in the model, since the intended analogy seems to be "the model steps look like ascending plus noise, just like SGD steps look like descending the true loss function plus noise."

Many thanks to Peter Barnett, my alpha interlocutor for the first version of the proof presented, and draft reader.Analogies are often drawn between natural selection and gradient descent (and other training procedures for parametric algorithms). It is important to understand to what extent these are useful and applicable analogies.

Here, under some modest (but ultimately approximating) simplifying assumptions, natural selection is found to be mathematically equivalent to an implementation of stochastic gradient descent.

The simplifying assumptions are presented first, followed by a proof of equivalence. Finally, a first attempt is made to consider the effect of relaxing each of these assumptions and what departure it causes from the equivalence, and an alternative set of assumptions which retrieves a similar equivalence is presented.

## Summary of simplifying assumptions

It is essential to understand that the equivalence rests on some simplifying assumptions, none of which is wholly true in real natural selection.

(NB assumptions 5 and 6 will be addressed in detail and amended later)

## Proof

## Setup and assumptions

Let us make some substantial modelling simplifications, while retaining the spirit of natural selection, to yield an 'annealing style' selection process.

We have continuous genome xt∈Rn with fixed fitness/objective function f:Rn→R

Mutations are distributed according to probability density Pm:Rn→[0,1] which is

radially symmetricWe also consider the case of mutations

very smallrelative to the genome, tending to infinitesimal.Selection is stochastic, but monotonically determined by the fitness differential, according to selection function Ps:R→[0,1]

^{[1]}, i.e. a normalised ratio of exponentials, but this is not essential for the proofThe population alternates between a single parent and its two offspring, one of which is selected to become the next generation's parent.

## Theorem

Now consider a binary fission from xt, yielding one perfect clone and one mutant clone x′t where mutation (xt+1−xt)∼Pm.

Define the next 'step', xt+1 as whatever mutant offspring x′t eventually happens to be successfully selected vs a perfect clone, according to the selection function on their fitness differential. (If the perfect clone gets selected, it may take many intermediate generations to constitute a 'step' here.) Denote by αs the resulting normalisation constant over mutations.

So the distribution over xt+1 given xt is

P(xt+1|xt)=αsPm(xt+1−xt)Ps(f(xt+1)−f(x))

Call the mutation ϵ=xt+1−xt. Then we find

P(ϵ|xt)=αsPm(ϵ)Ps(f(xt+ϵ)−f(xt))≃αsPm(ϵ)Ps(ϵ⋅∇f(xt))

by considering the directional derivative of f along ϵ at xt and the limit as ϵ→0

^{[2]}. (Prior to this infinitesimal limit, we have instead theempirical approximation tothe directional derivative.)Now characterising ϵ by length rϵ=|ϵ| and angle-from-gradient θϵ=∠(∇f(xt),ϵ)

P(ϵ|xt)≃αsg(rϵ)Ps(rϵ|∇f(xt)|cosθϵ)

At this point it is clear that our step procedure depends, stochastically, on how closely the direction of the mutations match the fitness function's gradient.

By inspecting the expected value of the step direction, θϵ, we can make a more precise claim

E[θϵ|xt]=∫P(ϵ|xt)θdϵ≃∫αsg(rϵ)Ps(rϵ|∇f(xt)|cosθϵ)θϵdϵ

and finally, by noticing the integral of an odd function

^{[3]}in θE[θϵ|xt]≃0

Thus the update between steps, ϵ, is a stochastic realisation of a variable whose orientation is, in expectation, exactly the same as that of the gradient ∇f(xt) of the fitness function.

By similar inspection of E[rϵ|xt] we can see that it is a monotonic function of |∇f(xt)|, depending on the particulars of Pm and Ps, which together provide a gradient-dependent 'learning rate'.

So natural selection in this form really is nothing but an implementation of unbiased stochastic gradient descent!

## Discussion of simplifying assumptions

To what extent are the simplifying assumptions realistic? What happens to the equivalence when we relax any of the assumptions?

## Fixed 'fitness function'

In real natural selection, the interaction between a changing environment and a dynamic distribution of organisms collaborating and competing leads to a time- and location-dependent fitness function.

Variable fitness functions can lead to interesting situations like evolutionarily stable equilibria with

mixturesof genotypes, or time- or space-cyclic fitness functions, or (locally) divergent fitness functions, among other phenomena.Such a nonstationary fitness function is comparable to the use of techniques like self-play in RL, especially in conjunction with Population-Based Training, but is less comparable to vanilla SGD.

As such it may be appropriate to think of real natural selection as performing something

locallyequivalent to SGD butgloballymore like self-play PBT.## Continuous fixed-dimensional genome and radially-symmetric mutation probability density

Moving from a continuous to a discrete genome means that the notion of a gradient is no longer defined in the same way, but we can still talk about empirical approximate gradients and differences.

The mechanisms which introduce mutations in real natural selection are certainly symmetrical in certain ways, but probably not in any way which straightforwardly maps to radial symmetry in a fixed-dimensional vector space.

Without radial symmetry, much of the mathematics goes through similarly, but instead of an unbiased estimate of the gradient direction, it is biased by the mutation sampling. As such, we might think of real natural selection as performing a

biasedstochastic gradient descent.A comparison may be made to regularisation techniques (depending on whether they are construed as part of the training procedure or part of the objective function), or to the many techniques exploiting bias-variance tradeoffs in sampling-based gradient-estimation for RL, though these tend to be deliberately chosen with variance-reduction in mind, while natural selection may not exhibit such preferences.

## Limit case to infinitesimal mutation

In reality, mutations are not infinitesimal, but in practice very small relative to the genome. If we do not take the limit, instead of an exact directional derivative, we find an empirical-approximate directional derivative, yielding

empirical-approximatestochastic gradient descent.This means that in real natural selection, the implied 'step size' or 'learning rate' is coupled with the particulars of the selection strength, the variance of the stochastic gradient, and the degree of empirical approximation applied. In contrast, stochastic gradient descent per se need not couple these factors together.

## A degenerate population of 1 or 2

If we expand the population to arbitrary size, it is possible to retrieve the equivalence with additional assumptions.

Instead of a parent

individualand cloned and mutated offspringindividuals, considering parent and offspringpopulations, the same reasoning and proof is immediately applicable if we assume thatmutations arise sufficiently rarelyto be either fixed or lost before the next mutation arises. In this case Ps, the probability of selection, becomes the probability of fixation.Of course this is not true for real natural selection.

If instead we allow for multiple contemporary mutant populations, an identical treatment

can notbe applied.## No recombination or horizontal transfer

One of the most fascinating and

mathematically complicatingaspects of real natural selection ismultiple heredityof genome elements, whether via horizontal transfer or sexual recombination.The preceding proof of equivalence for natural selection and stochastic gradient descent rests on a model which does not include any notion of multiple heredity.

## Recovering the equivalence allowing arbitrary population size and recombination

Interestingly, the 'complicating' factor of multiple heredity provides a way to retrieve the equivalence in the presence of multiple contemporary mutations, as long as we continue to consider the limit of

infinitesimal mutations.For a single-heredity population, with multiple contemporary mutant subpopulations, we must either model 'only one winner', or model an ongoing mixture of subpopulations of varying sizes, either of which Ps is unable to model without modification.

On the other hand, in a multiple-heredity population, assuming

eventually-universal mixing, and crucially continuing to assume afixed fitness function(independent of the population mixture), a particular mutation must either fix or go extinct^{[4]}.## Proof sketch

So let us consider (instead of xt and xt+1) xt0 and xt1, representing the

fixed partof the genotype at times t0 and t1>t0 respectively, that is the initial genome x0 plus all so-far-fixed mutations.In the time between t0 and t1 the population will experience some integer number of mutation events (perhaps roughly Poisson-distributed but this is inessential for the proof), each of which is distributed according to Pm. Furthermore, at time t0 some mutations from earlier times may be 'in flight' and not yet fixed or extinct.

Assuming fixed fitness, and infinitesimal mutations, we can represent the

probability of fixation by timet1, namely Pf with exactly the same properties as formerly assumed for Ps^{[5]}. Thus each mutationfixedbetween t0 and t1 satisfies exactly the same unbiased-gradient-sampling property derived earlier, and so, therefore, does their sum xt1−xt0.This relies on all in-flight mutations

not affectingthe fitness differential, and thus Pf, of their contemporaries, which is certainly the case in the limit of infinitesimal mutations, but not the case for real natural selection.## Summary of additional assumptions

In particular, this means no speciation.

NB we also rely on 4. the limit to infinitesimal mutations, in an additional capacity. We also exclude all 'self-play-like' interactions arising from the larger population by relying further on 1. fixed 'fitness function'.

It may be feasible to retrieve a similar equivalence without excluding population-dependent fitness interactions with a different framing, for example considering gradients over 'mixed strategies' implied by population distributions.

## Conclusion

Natural selection, under certain conditions, carries out an implementation of stochastic gradient descent. As such, analogies drawn from one to the other are not baseless; we should, however, examine the necessary assumptions and be mindful of the impact of departures from those assumptions.

In particular, two sets of assumptions are presented here which together are sufficient to retrieve an equivalence:

or, keeping assumptions 1 to 4 and relaxing assumptions 5 and 6

This is not enough to cover all instances of real natural selection, but provides an approximate mapping from many instances.

Assumptions 2 and 3 together yield 'unbiased' SGD, and in their absence varying degrees of bias arise.

Assumption 1 rules out, most importantly, 'self play' and 'population-based' aspects of natural selection, which have other analogies in machine learning but which are firmly absent from vanilla SGD.

Further work could uncover other parameters of the emergent SGD, such as the variance of the implied gradient, the size of the implicit learning rate, the bias caused by relaxing assumption 3, or quantify the coupling between those factors.

Further scrutiny, especially of the assumptions related to population, 1, 5, 6, and 7, could better quantify the effect of making weaker or different assumptions.

This can be justified in a few ways

The directional derivative in question is, for ϵ=|ϵ|uϵ,

lim|ϵ|→0f(xt+ϵ)−f(xt)=|ϵ|lim|ϵ|→0f(xt+|ϵ|uϵ)−f(xt)|ϵ|=|ϵ|(uϵ⋅∇f(xt))=ϵ⋅∇f(xt) ↩︎

Cautious readers may note that the integral as presented is not posed in the right coordinate system for its integrand.

By a coordinate transformation from Euclidean to hyperspherical coordinates, centred on 0, with ∇f(xt) providing the principal axis, r the radial length, θ the principal angular coordinate, and ϕ the other n−2 angular coordinates with axes chosen arbitrarily orthogonally,

E[θ|xt]≃∫αsg(r)Ps(r|∇f(xt)|cosθ)θdϵ=∫αsg(r)Ps(r|∇f(xt)|cosθ)θ∣∣∣∂(r,θ,ϕ)∂(ϵ)∣∣∣drdθdϕ=∫∞0∫π−παsg(r)Ps(r|∇f(xt)|cosθ)θ[∫[0,π]n−2∣∣∣∂(r,θ,ϕ)∂(ϵ)∣∣∣dϕ]dθdr=∫∞0∫π−παsg(r)Ps(r|∇f(xt)|cosθ)θ[∫[0,π]n−2rn−1h(ϕ)dϕ]dθdr=∫∞0∫π−παsg(r)Ps(r|∇f(xt)|cosθ)θkrn−1dθdr=∫∞0αskg(r)rn−1[∫π−πPs(r|∇f(xt)|cosθ)θdθ]dr=0

where we use the fact that the hyperspherical Jacobian ∂(r,θ,ϕ)∂(ϵ) is independent of its principal angular coordinate θ and denote by krn−1 the result of integrating out the Jacobian over the other angular coordinates, and again noting that the symmetrical integral over an odd function is zero. ↩︎

If we do not have a fixed fitness function, and in particular, if it is allowed to vary dependent on the distribution of the population, there are many evolutionarily stable equilibria which can arise where some trait is stably

neverfixed nor extinguished, but rather persists indefinitely in some proportion of the population. (A classic example is sex ratios.) ↩︎We can be more precise if we have Pf:R+→R→[0,1] where the additional first parameter represents time-elapsed, so that Pf(δt,δf) is the probability of a mutation with fitness delta δf being fixed after elapsed time δt.

Here we impose on Pf(δt) (for fixed δt time-elapsed) the same monotonicity requirement over fitness differential as imposed on Ps before.

The various 'in-flight' and intervening mutations in the proof also therefore implicitly carry with them tm, the time they emerged, and the additional argument to Pf is thus δt=t1−tm.

In practice we should expect Pf to vary time-wise as a monotonically nondecreasing asymptote, but this property is not required for the proof. ↩︎