johnswentworth

Search-in-Territory vs Search-in-Map

It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.

+1

Reward Is Not Enough

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.

Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

- The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
- The argument from Fixing The Good Regulator Theorem. That post uses some incoming information to "choose" between many different objectives, but that's essentially emulating multiple objectives. If we have multiple objectives
*explicitly*, then the argument should simplify. Summary: if we need to keep around information relevant to many different objectives, but have limited space, that forces the use of a map/model in a certain sense.

One criticism: at a few points I think this post doesn't cleanly distinguish between reward-maximization and utility-maximization. For instance, the optimizing for "the abstract concept of ‘I want to be able to sing well’" definitely sounds like utility-maximization.

Testing The Natural Abstraction Hypothesis: Project Intro

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Abstraction Talk

Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.

Testing The Natural Abstraction Hypothesis: Project Intro

Nice!

A couple notes:

- Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
- If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector and uses backpropagation to return . This is the same as the time-0-to-time-t jacobian dotted with , but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a lot faster. Then just wrap that method in a LinearOperator (or the equivalent in your favorite numerical library), and you'll be able to pass it directly to an SVD method.

In terms of other uses... you could e.g. put some "sensors" and "actuators" in the simulation, then train some controller to control the simulated system, and see whether the data structures learned by the controller correspond to singular vectors of the jacobian. That could make for an interesting set of experiments, looking at different sensor/actuator setups and different controller architectures/training schemes to see which ones do/don't end up using the singular-value structure of the system.

SGD's Bias

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD).

That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding.

Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.

SGD's Bias

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to or ). The rate at which the system leaves state serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state and state satisfy

... i.e. the flux from values--and-above to values-below- is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if , then : the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).

Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.

I haven't looked at the math for the diode-resistor system, but if the voltage *averages* to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)

Also, one simple way to see that the "drift" interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave - i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.

SGD's Bias

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

The construction is correct.

Note that for M2, conceptually we don't need to modify it, we just need to use the original M2 but apply it only to the subcomponents of the new X-variable which correspond to the original X-variable. Alternatively, we can take the approach you do: construct M′2 which has a distribution over the new X, but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.