by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho
You can find the complete article here. We provide a short summary below.
In short: To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.
Method 1: Counting operations in the model
Method 2: GPU time
We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.
You can read more about method 1 here and about method 2 here.
Other parts of interest of this article include:
- We argue that the ratio of operations of backward and forward pass of neural networks is often close to 2:1. More.
- We discuss how the formula of method 1 changes for recurrent models. More.
- We argue that dropout does not affect the number of operations per forward and backward pass. More.
- We have elaborated a table with parameter and operation counts for common neural network layers. More.
- We give a detailed example of method 1. More.
- We discuss commonly used number representation formats in ML. More.
- We share an estimate of the average performance of GPU cards each year. More.
- We share some reported GPU usages in real experiments. More.
- We give a detailed example of method 2. More.
- We compare both methods and conclude they result in similar estimates. More.
- We discuss the use of profilers to measure compute. More.
You can find the article here.
This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven't generally been too forthcoming with compute estimates. (There are exceptions.)
As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:
Nonetheless, my flying guess would be that your method is pretty much guaranteed to be right within an OOM, and probably within a factor of 2 or less. That seems pretty good! It's certainly an improvement over anything I've seen previously along these lines. Congrats!
This seems pretty well done! Some thoughts on future research in this direction:
Thank you Alex! You make some great points.
We thought so too - but in practice it has been surprisingly hard. Profilers are surprisingly buggy. Our colleague Marious looked into this more in depth here.
Maybe we are just going the wrong way about it. If someone here figures out how to directly measure compute in eg a pytorch or TF model it would be a huge boon to us.
Great suggestions! I think those would be a great future caveats to look into.
My naive impression is that our conclusions do not change much. You would just need to plug into the effective performance (peak performance×utilization) in the second formula.
Probably the trickiest part might be figuring out the utilization rate for the custom hardware - though this is a general problem with the second method.
I think that would be nice! We started a public spreadsheet with some info on different hardware. This might be of help to someone who wants to dig deeper into the topic!
I'd be pretty excited to see more work on this. Jaime already shared our hardware sheet where we collect information on GPUs but as you outline that's the peak performance and sometimes misleading.
Indeed, the MLPerf benchmarks are useful. I've already gathered their data in this sheet and would love to see someone playing around with it. Next to MLPerf, Lambda Labs also shares some standardized benchmarks.