I am not sure if the calculation in Appendix B is quite accurate; I would like to ask you for a better explanation if I am not quite right.

In the first line (calculation of 'm'), we can clearly see that there are 4 operations. Now, we could assume that (1-beta1) could be pre-calculated, and hence there are only 3 operations.

If we accept that argument, then in the calculations of 'm_hat' and 'v_hat', should be considered to have only 1 operation each. I do see the transpose there, which is weird to me too; although PyTorch's documentation gives the same set of mathematical equations, the default parameters use scalar values for beta1 and beta2.

I am really trying to make sense of the calculation here, but I really can't. Could you please provide more information on this?

I am not sure if the calculation in Appendix B is quite accurate; I would like to ask you for a better explanation if I am not quite right.

In the first line (calculation of 'm'), we can clearly see that there are 4 operations. Now, we could assume that (1-beta1) could be pre-calculated, and hence there are only 3 operations.

If we accept that argument, then in the calculations of 'm_hat' and 'v_hat', should be considered to have only 1 operation each. I do see the transpose there, which is weird to me too; although PyTorch's documentation gives the same set of mathematical equations, the default parameters use scalar values for beta1 and beta2.

I am really trying to make sense of the calculation here, but I really can't. Could you please provide more information on this?