The construction is correct.
Note that for M2, conceptually we don't need to modify it, we just need to use the original M2 but apply it only to the subcomponents of the new X-variable which correspond to the original X-variable. Alternatively, we can take the approach you do: construct M′2 which has a distribution over the new X, but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.
It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.
Good explanation, conceptually.
Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:
One criticism: at a few points I think this post doesn't cleanly distinguish between reward-maximization and utility-maximization. For instance, the optimizing for "the abstract concept of ‘I want to be able to sing well’" definitely sounds like utility-maximization.
If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...
Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.
A couple notes:
In terms of other uses... you could e.g. put some "sensors" and "actuators" in the simulation, then train some controller to control the simulated system, and see whether the data structures learned by the controller correspond to singular vectors of the jacobian. That could make for an interesting set of experiments, looking at different sensor/actuator setups and different controller architectures/training schemes to see which ones do/don't end up using the singular-value structure of the system.
My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD).
That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding.
Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.
I'm still wrapping my head around this myself, so this comment is quite useful.
Here's a different way to set up the model, where the phenomenon is more obvious.
Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state k, it's equally likely to transition to k+1 or k−1). The rate λk at which the system leaves state k serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state k and state k−1 satisfy
... i.e. the flux from values-k-and-above to values-below-k is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if λk>λk−1, then pk<pk−1: the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).
Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.
I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)
Also, one simple way to see that the "drift" interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave - i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.
does it represent a bias towards less variance over the different gradients one can sample at a given point?