Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.

Note that their improvement over Strassen on 4x4 matrices is for *finite fields only*, i.e. modular arithmetic, not what most neural networks use.

44mo

That's a very important correction. For real arithmetic they only improve for
rectangular matrices (e.g. 3x4 multiplied by 4x5) which is less important and
less well studied.

44mo

In fact, the 47 multiplication result is on Z/2Z, so it's not even general
modular arithmetic.
That being said, there are still speedups on standard floating point arithmetic
both in terms of number of multiplications, but also wall clock time.

**[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the **** solution to begin with ****before**** weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes i**...

For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?