Good explanation, conceptually.
Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:
One criticism: at a few points I think this post doesn't cleanly distinguish between reward-maximization and utility-maximization. For instance, the optimizing for "the abstract concept of ‘I want to be able to sing well’" definitely sounds like utility-maximization.
If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...
Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.
A couple notes:
In terms of other uses... you could e.g. put some "sensors" and "actuators" in the simulation, then train some controller to control the simulated system, and see whether the data structures learned by the controller correspond to singular vectors of the jacobian. That could make for an interesting set of experiments, looking at different sensor/actuator setups and different controller architectures/training schemes to see which ones do/don't end up using the singular-value structure of the system.
My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD).
That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding.
Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.
I'm still wrapping my head around this myself, so this comment is quite useful.
Here's a different way to set up the model, where the phenomenon is more obvious.
Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state k, it's equally likely to transition to k+1 or k−1). The rate λk at which the system leaves state k serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state k and state k−1 satisfy
... i.e. the flux from values-k-and-above to values-below-k is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if λk>λk−1, then pk<pk−1: the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).
Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.
I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)
Also, one simple way to see that the "drift" interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave - i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.
does it represent a bias towards less variance over the different gradients one can sample at a given point?
This is a good summary.
I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.)
On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can't be separated from goals, and me to say the opposite.
Picture a linear approximation, like this:
The tangent space at point a is that whole line labelled "tangent".
The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)
The reason we want to talk about "the tangent space" is that it lets us precisely state things like e.g. Newton's method in terms of search: Newton's method finds a point at which f(x) is approximately 0 by finding a point where the tangent space hits zero (i.e. where the line in the picture above hits the x-axis). So, the tangent space effectively specifies the "search objective" or "optimization objective" for one step of Newton's method.
In the NTK/GP model, neural net training is functionally-identical to one step of Newton's method (though it's Newton's method in many dimensions, rather than one dimension).