All of dsj's Comments + Replies

For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?

There was a recent Twitter thread about this. See here and here.

Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.

Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use.

4Paul Christiano4mo
That's a very important correction. For real arithmetic they only improve for rectangular matrices (e.g. 3x4 multiplied by 4x5) which is less important and less well studied.
4Lawrence Chan4mo
In fact, the 47 multiplication result is on Z/2Z, so it's not even general modular arithmetic.  That being said, there are still speedups on standard floating point arithmetic both in terms of number of multiplications, but also wall clock time.   

[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the  solution to begin with before weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes i... (read more)