Hi Steven.
A simple stylized example: imagine you have some algorithm for processing each cluster of inputs from the retina.
You might think that because that algorithm is symmetric* - you want to run the same algorithm regardless of which cluster it is - you only need one copy of the bytecode that represents the compiled copy of the algorithm.
This is not the case. Information wise, sure. There is only one program that takes n bytes of information. You can save disk space for holding your model.
RAM/cache consumption : each of the parallel processing units you have to use (you will not get realtime results for images if you try to do it serially) must have another copy of the algorithm.
And this rule applies throughout the human body : every nerve cluster, audio processing, etc.
This also greatly expands the memory required over your 24 gig 4090 example. For one thing, the human brain is very sparse, and while nvidia has managed to improve sparse network performance, it still requires memory to represent all the sparse values.
I might note that you could have tried to fill in the "cartoon switch" for human synapses. They are likely a MAC for each incoming axon at no better than 8 bits of precision added to an accumulator for the cell membrane at the synapse that has no better than 16 bits of precision. (it's probably less but the digital version has to use 16 bits)
So add up the number of synapses in the human brain, assume 1 khz, and that's how many TOPs you need.
Let me do the math for you real quick:
68 billion neurons, about 1000 connections each, at 1khz. (it's very sparse). So we need 68 billion x 1000 x 1000 = 6.8e+16 = 68000 TOPs.
Current gen data center GPU: https://www.nvidia.com/en-us/data-center/h100/
So we would need 17 of them to hit 1 human brain. We can assume you will never get maximum performance (especially due to the very high sparsity), so maybe 2-3 nodes with 16 cards each?
Note they would have a total of 3840 gigabytes of memory.
Since we have 68 billion x 1000 x (1 byte) = 68 terabytes of weights in a brain, that's the problem. We only have 5% as much memory as we need.
This is the reason for neuromorphic compute based on SSDs: compute's not the bottleneck, memory is.
We can get brain scale perf with 20 times as much hardware, or 20*3*16 = 960 A100s. They are 25k each so 24 million for the GPUs, plus all the motherboards and processors and rack space. Maybe 50 million?
That's a drop in the bucket and easily affordable by current AI companies.
Epistemic notes: I'm a computer engineer (CS masters/ml) and I work on inference accelerators.
Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even.
There's a MAC in there. It's because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat)
These 2 variables can be combined to a single coefficient, you can think of it as "voltage delta" (it can be + or -)
So it's (1) * (voltage gain) = change in target cell voltage.
For ANN, it's <activation output> * <weight> = change in target node activation input.
The brain also uses timing to get more information than just "1", the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want.
So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage.
Aka you have to multiply 2 numbers together and add, which is what "multiply-accumulate" units do.
Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits - 1 part in 256 - of precision. That's realistically probably generous, it's probably not even that good.
There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second - many biological processes just don't matter at all because of this. Same how a transistor's behavior is irrelevant, it's a cartoon switch.
For training, sure, if we wanted a system to work like a brain we'd have to model some of this, but we don't. We can train using whatever algorithm measurably is optimal.
Similarly we never have to bother with a "minicolumn". We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that's what a transformer is.