Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.

However, the term "stochastic gradient descent" does not just mean "gradient descent with noise." It refers more specifically to mini-batch gradient descent. (See e.g. Wikipedia.)

In mini-batch gradient descent, the "true" fitness^[1] function is the expectation of some function over a data distribution $ρ$ . But you never have access to this function or its gradient. Instead, you draw a finite sample from $ρ$ , compute the mean of $\nabla L$ over the sample, and take a step in this direction. The noise comes from the variance of the finite-sample mean as an estimator of the expectation.

The model here is quite different. There is no "data distribution," and the true fitness function is not an expectation value which we could noisily estimate with sampling. The noise here comes not from a noisy estimate of the gradient, but from a prescribed stochastic relationship ( $P_{s}$ ) between the true gradient and the next step.

I don't think the model in this post behaves like mini-batch gradient descent. Consider a case where we're doing SGD on a vector $x$ , and two of its components $x_{i}, x_{j}$ have the following properties:

The "true gradient" (the expected gradient over the data distribution) is 0 in the $x_{i}$ and $x_{j}$ directions.
The $x_{i}$ and $x_{j}$ components of the per-example gradient are perfectly (positively) correlated with one another.

If you like, you can think of the per-example gradient as sampling a single number $z$ from a distribution with mean 0, and setting the $x_{i}$ and $x_{j}$ components to $a_{i} z$ and $a_{j} z$ respectively, for some positive constants $a_{i}, a_{j}$ .

When we sample a mini-batch and average over it, these components are simply $a_{i} ¯ z$ and $a_{j} ¯ z$ , where $¯ z$ is the average of $z$ over the mini-batch. So the perfect correlation carries over to the mini-batch gradient, and thus to the SGD step. If SGD increases $x_{i}$ , it will always increase $x_{j}$ alongside it (etc.)

However, applying the model from this post to the same case:

Candidate steps are sampled according to $P_{m}$ , which is radially symmetric. So (e.g.) a candidate step with positive $x_{i}$ and negative $x_{j}$ is just as likely as one with both positive, all else being equal.
The probability of accepting a candidate step depends only on the true gradient^[2], which is 0 in the directions of interest. So, the $x_{i}$ and $x_{j}$ components of a candidate step have no effect on its probability of selection.

Thus, the the $x_{i}$ and $x_{j}$ components of the step will be uncorrelated, rather than perfectly correlated as in SGD.

Some other comments:

The descendant-generation process in this post seems very different from the familiar biological cases it's trying to draw an analogy to.
- In biology, "selection" generally involves having more or fewer descendants relative to the population average.
- Here, there is always exactly one descendant. "Selection" occurs because we generate (real) descendants by first generating a ghostly "candidate descendant," comparing it to its parent (or a clone of its parent), possibly rejecting it against the parent and drawing another candidate, etc.
- This could be physically implemented in principle, I guess. (Maybe it has been, somewhere?) But I'm not convinced it's equivalent to any familiar case of biological selection. Nor it is clear to me how close the relationship is, if it's not equivalence.
The connection drawn here to gradient descent is not exact, even setting aside the stochastic part.
- You note that we get a "gradient-dependent learning rate," essentially because $P_{s}$ can have all sorts of shapes -- we only know that it's monotonic, which gives us a monotonic relation between step size and gradient norm, but nothing more.
- But notably, (S)GD does not have a gradient-dependent learning rate. To call this an equivalence, I'd want to know the conditions under which the learning rate is constant (if this is possible).
- It is also is possible this model always corresponds to vanilla GD (i.e. with a constant learning rate), except instead of ascending $f$ , we are ascending some function related to both $P_{s}$ and $f$ .
This post calls $f$ the "fitness function," which is not (AFAIK) how the term "fitness" is used evolutionary biology.
- Fitness in biology typically means "expected number of descendants" (absolute fitness) or "expected change in population fraction" (relative fitness).
- Neither of these have direct analogues here, but they are more conceptually analogous to $P (x_{t + 1} | x_{t})$ than $f$ . The fitness should directly tell you how much more or less of something you should expect in the next generation.
- That is, biology-fitness is about what actually happens when we "run the whole model" forward by a timestep, rather than being an isolated component of the model.
- (In cases like the replicator equation, there is model component called a "fitness function," but the name is justified by its relationship to biology-fitness given the full model dynamics.)
- Arguably this is just semantics? But if we stop calling $f$ by a suggestive name, it's no longer clear what importance we should attach to it, if any. We might care about the quantity whose gradient we're ascending, or about the biology-fitness, but $f$ is not either of those.

^{^}
I'm using this term here for consistency with the post, though I call it into question later on. "Loss function" or "cost function" would be more standard in SGD.
^{^}
There is no such thing as a per-example gradient in the model. I'm assuming the "true gradient" from SGD corresponds to $\nabla f$ in the model, since the intended analogy seems to be "the model steps look like ascending $f$ plus noise, just like SGD steps look like descending the true loss function plus noise."

This can be justified in a few ways
- If fitness is something like an Elo rating then a Boltzmann distribution is implied
- If we want to extend the two-individual case to the n-individual case but remain invariant to the arbitrary choice of 'baseline' fitness score, then a normalised ratio of exponentials is implied
- We may further appeal to the maximum entropy property of Boltzmann distributions as a natural choice
↩︎
The directional derivative in question is, for $ϵ = | ϵ | u_{ϵ}$ ,

$\begin{matrix} lim | ϵ | \to 0 f (x_{t} + ϵ) - f (x_{t}) & = | ϵ | lim | ϵ | \to 0 \frac{f (x_{t} + | ϵ | u_{ϵ}) - f (x_{t})}{| ϵ |} = | ϵ | (u_{ϵ} \cdot \nabla f (x_{t})) = ϵ \cdot \nabla f (x_{t}) \end{matrix}$ ↩︎
Cautious readers may note that the integral as presented is not posed in the right coordinate system for its integrand.

By a coordinate transformation from Euclidean to hyperspherical coordinates, centred on $0$ , with $\nabla f (x_{t})$ providing the principal axis, $r$ the radial length, $θ$ the principal angular coordinate, and $ϕ$ the other $n - 2$ angular coordinates with axes chosen arbitrarily orthogonally,

$\begin{matrix} E [θ | x_{t}] & ≃ \int α_{s} g (r) P_{s} (r | \nabla f (x_{t}) | cos θ) θ d ϵ = \int α_{s} g (r) P_{s} (r | \nabla f (x_{t}) | cos θ) θ ∣ ∣ ∣ \frac{\partial (r, θ, ϕ)}{\partial (ϵ)} ∣ ∣ ∣ d r d θ d ϕ = \int_{0}^{\infty} \int_{- π}^{π} α_{s} g (r) P_{s} (r | \nabla f (x_{t}) | cos θ) θ [\int_{[0, π]^{n - 2}} ∣ ∣ ∣ \frac{\partial (r, θ, ϕ)}{\partial (ϵ)} ∣ ∣ ∣ d ϕ] d θ d r = \int_{0}^{\infty} \int_{- π}^{π} α_{s} g (r) P_{s} (r | \nabla f (x_{t}) | cos θ) θ [\int_{[0, π]^{n - 2}} r^{n - 1} h (ϕ) d ϕ] d θ d r = \int_{0}^{\infty} \int_{- π}^{π} α_{s} g (r) P_{s} (r | \nabla f (x_{t}) | cos θ) θ k r^{n - 1} d θ d r = \int_{0}^{\infty} α_{s} k g (r) r^{n - 1} [\int_{- π}^{π} P_{s} (r | \nabla f (x_{t}) | cos θ) θ d θ] d r = 0 \end{matrix}$

where we use the fact that the hyperspherical Jacobian $\frac{\partial (r, θ, ϕ)}{\partial (ϵ)}$ is independent of its principal angular coordinate $θ$ and denote by $k r^{n - 1}$ the result of integrating out the Jacobian over the other angular coordinates, and again noting that the symmetrical integral over an odd function is zero. ↩︎
If we do not have a fixed fitness function, and in particular, if it is allowed to vary dependent on the distribution of the population, there are many evolutionarily stable equilibria which can arise where some trait is stably never fixed nor extinguished, but rather persists indefinitely in some proportion of the population. (A classic example is sex ratios.) ↩︎
We can be more precise if we have $P_{f} : R^{+} \to R \to [0, 1]$ where the additional first parameter represents time-elapsed, so that $P_{f} (δ t, δ f)$ is the probability of a mutation with fitness delta $δ f$ being fixed after elapsed time $δ t$ .

Here we impose on $P_{f} (δ t)$ (for fixed $δ t$ time-elapsed) the same monotonicity requirement over fitness differential as imposed on $P_{s}$ before.

The various 'in-flight' and intervening mutations in the proof also therefore implicitly carry with them $t_{m}$ , the time they emerged, and the additional argument to $P_{f}$ is thus $δ t = t_{1} - t_{m}$ .

In practice we should expect $P_{f}$ to vary time-wise as a monotonically nondecreasing asymptote, but this property is not required for the proof. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

28

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

28

Summary of simplifying assumptions

Proof

Setup and assumptions

Theorem

Discussion of simplifying assumptions

Fixed 'fitness function'

Continuous fixed-dimensional genome and radially-symmetric mutation probability density

Limit case to infinitesimal mutation

A degenerate population of 1 or 2

No recombination or horizontal transfer

Recovering the equivalence allowing arbitrary population size and recombination

Proof sketch

Summary of additional assumptions

Conclusion