For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?
There was a recent Twitter thread about this. See here and here.
Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.
Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use.
[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the 12M+12G solution to begin with before weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes i... (read more)