I haven’t read the linked post/comment yet, and perhaps I am missing something very obvious, but: we have exaflop computing (that’s 10^18) right now. Is Tim Dettmers really saying that we’re not going to see a 1000x speed-up, in a century or possibly ever? That seems like a shocking claim, and I struggle to imagine what could justify it.
EDIT: I have now read the linked comment; it speaks of fundamental physical limitations such as speed of light, heat dissipation, etc., and says:
These are all hard physical boundaries that we cannot alter. Yet, all these physical boundaries will be hit within a couple of years and we will fall very, very far short of human processing capabilities and our models will not improve much further. Two orders of magnitude of additional capability are realistic, but anything beyond that is just wishful thinking.
I do not find this convincing. Taking the outside view, we can see all sorts of similar predictions of limitations having been made over the course of computing history, and yet Moore’s Law is still going strong despite quite a few years of predictions of imminent trend-crashing. (Take a look at the “Recent trends” and “Alternative materials research” sections of the Wikipedia page; do you really see any indication that we’re about to hit a hard barrier? I don’t…)
Also, these physical limits – insofar as they are hard limits – are limits on various aspects of the impressiveness of the technology, but not on the cost of producing the technology. Learning-by-doing, economies of scale, process-engineering R&D, and spillover effects should still allow for costs to come down, even if the technology itself can hardly be improved.
Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace.
Of course, all this is providing an upper bound on the amount of compute needed to make a human level AI. Tim Dettmers is trying to prove it can't be done. This needs a lower bound. To get a lower bound, don't look at how long it takes a computer to simulate a human. Look at how long it takes a human to simulate a computer. This bound is really rather useless, compared to modern levels of compute. However, it might give us some rough idea how bad overhead can be. Suppose we thought "Compute needed to be at least as smart as a human" was uniformly distributed somewhere between "compute needed to simulate a human" and "compute a human can simulate".
Well actually, it depends on what intelligence test we give. Human brains have been optimised towards (human stuff) so it probably takes more compute to socialize to a human level than it takes to solve integrals to a human level.
Interesting but probably irrelevant note.
There are subtleties in even the very loose lower bound of a human simulating a cpu. Suppose there was some currently unknown magic algorithm. This algorithm can hypothetically solve all sorts of really tricky problems in a handful of cpu cycles. It is so fast that a human mentally simulating a cpu running this algorithm will still beat current humans on a lot of important problems. (Not problems humans can solve too quickly, because no algorithm can do much in <1 clock cycle.) If such a magic algorithm exists, then its possible that even an AI running on a 1 operation per day computer could be arguably superhuman. Of course, I am somewhat doubtful that an algorithm that magic exists (although I have no strong evidence of non existence, some weak evidence namely that evolution didn't find it and we haven't found it yet.) Either way, we are far into the realm of instant takeoff on any computer.
I followed a link on Twitter to a fun and informative 2015 blog post by Tim Dettmers:
The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near
The headline conclusion is that it takes at least 1021 FLOP/s to run the algorithms of a human brain, and therefore "it is unlikely that there will be a technological singularity in this century." I disagree with that, and this post explores why.
(Specifically, I disagree with "at least 1021 FLOP/s". There's a separate step to go from "at least 1021 FLOP/s" to "it is unlikely that there will be a technological singularity in this century"—this step is related to Moore's law, bandwidth requirements for parallelization, etc. Tim's blog post has extensive discussion of this second step, and I won't say anything about that here; I'd have to think about it more.)
(I'm writing this in 2021, six years later, but Tim has a comment on this very site that says he still stands by that post; in fact he now goes even further and says "I believe that AGI will be physically impossible with classical computers.")
I highly recommend the original post. Indeed, if I didn't like the post so much, I would not have bothered writing a response. :-)
Are brain algorithms computationally expensive to simulate?
Yes! Definitely! I think it's especially telling that nobody has applied the Dileep George brain-inspired image-processing model to ImageNet, sticking to much smaller images with far fewer categories of objects (MNIST, CAPTCHAs etc.).
Likewise, this Randall O'Reilly paper has a fascinating computational exploration of (in my opinion) different and complementary aspects of the human visual system. That paper tests its theories on a set of ≈1000 256×256-pixel, 8-frame movies from 100 categories—compare to ImageNet's 14 million images from 20,000 categories ... or compare it to the number of visual categories that you can recognize. Training the model still took 512 InfiniBand-connected processor nodes running for ≈24 hours on their campus supercomputer (source: personal communication). The real human vision system is dramatically larger and more complicated than this model, and the whole brain is larger and more complicated still!
But, when I say "computationally expensive to simulate" above, I mean it in, like, normal-person-in-2021 standards of what's computationally expensive to simulate. A very different question is whether the brain is "computationally expensive to simulate" by the standards of GPT-3, the standards of big tech data centers, the standards of "what will be feasible in 2030 or 2040 or 2050", and things like that. There, I don't have a strong opinion. I consider it an open question.
Note also that the two brain-inspired image-recognition examples just above are pushing innovative algorithms, and therefore are presumably handicapped by things like
So anyway, the fact that a couple of today's "most brain-like algorithms" (as judged by me) seem to be computationally expensive to scale up is not much evidence one way or the other for whether brain-like AGI algorithms would be "computationally expensive" with industrial-scale investment in the long-term or even short-term. Again, I consider it an open question.
Tim's blog post argues that it is not an open question: his estimate is 1021 FLOP/s to run the algorithms of a human brain, which (he says) puts it out of reach for the century, and maybe (as in his recent comment) simply beyond what you can do with a classical computer. And he says that's an underestimate!
This is quite a bit more skeptical than Joseph Carlsmith's recent OpenPhil report "How Much Computational Power Does It Take to Match the Human Brain?". That offers many estimation methods which come in at 1012-1018 FLOP/s, with 1021 being an extreme upper end.
What accounts for the discrepancy?
Where does Tim's estimate of 1021 FLOP/s come from?
(Be warned that it's very possible I'm misunderstanding something, and that I have zero experience simulating neurons. I've simulated lots of other things, and I've read about simulating neurons, but that's different from actually making a neuron simulation with my own hands.)
Let's just jump to the headline calculation:
1021=8.6e10×200×(10,000×5)×(5×50×5).
Let's go through the terms one by one.
So all in all, the implicit story behind multiplying these numbers together is:
Take each neuron A in each timestep B. Then take each synapse C on that neuron, and take each dendritic branch D on that neuron. Take one of the five most recent timesteps E for the synapse, and another one of the five most recent timesteps F for the dendritic branch. Now do at least one floating-point operation involving these particular ingredients, and repeat for all possible combinations.
I say "no way". That just can't be right, can it?
Let's start with the idea of multiplying the number of synapses by the number of branches. So take a random synapse (synapse #49) and independently take a random branch of a random dendrite (branch #12). Most of the time the synapse is not that branch, and indeed not even on that dendrite! Why would we need to do a calculation specifically involving those two things?
If any influence can spread from a synapse way over here to a branch way over there, I think it would be the kind of thing that can be dealt with in a hierarchical calculation. Like, from the perspective of dendrite #6, you don't need to know the fine-grained details of what's happening in each individual synapse on dendrite #2; all you need to know is some aggregated measure of what's going on at dendrite #2, e.g. whether it's spiking, what mix of chemicals it's dumping out into the soma, or whatever.
So I want to say that the calculation is not O(number of synapses × number of branches), but rather O(number of synapses) + O(number of branches). You do calculations for each synapse, then you do calculations for each branch (or each segment of each branch) that gradually aggregate the effects of those synapses over larger scales. Or something like that.
Next, the time model. I disagree with this too. Again, Tim is budgeting 5×5=25 operations per timestep to deal with time-history. The idea is that at timestep N, you're doing a calculation involving "the state of synapse #18 in timestep (N-3) and of branch #59 in timestep (N-1)", and a different calculation for (N-1) and (N-4), and yet another for (N-2) and (N), etc. etc. I don't think that's how it would work. Instead I imagine that you would track a bunch of state variables for the neuron, and update the state each timestep. Then your timestep calculation would input the previous state and what's happening now, and would output the new state. So I think it should be a factor of order 1 to account for effects that are prolonged in time. Admittedly, you could say that the number "25" is arguably "a factor of order 1", but whatever. :-P
Oh, also, in a typical timestep, most synapses haven't fired for the previous hundreds of milliseconds, so you get another order of magnitude or so reduction in computational cost from sparsity.
So put all that together, and now my back-of-the-envelope is like 50,000× lower than Tim's.
(By the way, please don't divide 1021 FLOP/s by 50,000 and call it "Steve's estimate of the computational cost of brain simulations". This is a negative case against the 1021 number, not a positive case for any model in particular. If you want my opinion, I don't have one right now, as I said above. In the meantime I defer to the OpenPhil report.)
(Parts of this section are copying points made in the comment section of Tim's blog.)
(Also, my favorite paper proposing an algorithmic purpose of dendritic spikes in cortical pyramidal neurons basically proposes that it functions as an awfully simple set of ANDs and ORs, more or less. I don't read too much into that—I think the dendritic spikes are doing other computations too, which might or might not be more complicated. But I find that example suggestive.)
What about dynamic gene expression, axonal computations, subthreshold learning, etc.?
To be clear, Tim posited that the 1021 FLOP/s was an underestimate, because there were lots of other complications neglected by this model. Here's a quote from his post:
My main response is a post I wrote earlier: Building brain-inspired AGI is infinitely easier than understanding the brain. To elaborate and summarize a bit:
I don't pretend that this is a rigorous argument, it's intuitions knocking against each other. I'm open to discussion. :-)