Thoughts on hardware / compute requirements for AGI

[-]jacob_cannell3y1316

He writes that the human brain has “1e13-1e15 spikes through synapses per second (1e14-1e15 synapses × 0.1-1 spikes per second)”. I think Joe was being overly conservative, and I feel comfortable editing this to “1e13-1e14 spikes through synapses per second”, for reasons in this footnote→^[9].

I agree that 1e14 synaptic spikes/second is the better median estimate, but those are highly sparse ops.

So when you say:

So I feel like 1e14 FLOP/s is a very conservative upper bound on compute requirements for AGI. And conveniently for my narrative, that number is about the same as the 8.3e13 FLOP/s that one can perform on the RTX 4090 retail gaming GPU that I mentioned in the intro.

You are missing some foundational differences in how von neumann arch machines (GPUs) run neural circuits vs how neuromorphic hardware (like the brain) runs neural circuits.

The 4090 can hit around 1e14 - even up to 1e15 - flops/s, but only for dense matrix multiplication. The flops required to run a brain model using that dense matrix hardware are more like 1e17 flops/s, not 1e14 flops/s. The 1e14 synapses are at least 10x locally sparse in the cortex, so dense emulation requires 1e15 synapses (mostly zeroes) running at 100hz. The cerebellum is actually even more expensive to simulate .. because of the more extreme connection sparsity there.

But that isn't the only performance issue. The GPU only runs matrix matrix multiplication, not the more general vector matrix multiplication. So in that sense the dense flop perf is useless, and the perf would instead be RAM bandwidth limited and require 100 4090's to run a single 1e14 synapse model - as it requires about 1B of bandwidth per flop - so 1e14 bytes/s vs the 4090's 1e12 bytes/s.

Your reply seems to be "but the brain isn't storing 1e14 bytes of information", but as other comments point out that has little to do with the neural circuit size.

The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.

This is a general computational principle, with many specific examples: compressed neural frequency encodings of 3D scenes (NERFs) which access/use all network parameters to decode a single point O(N) are enormously less computationally efficient (runtime throughput, latency, etc) than maximally sparse representations (using trees, hashtables etc) which approach O(log(N)) or O(C), but the sparse representations are enormously less compressed/compact. These tradeoffs are foundational and unavoidable.

We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).

[-]Steven Byrnes3y3-2

Thanks!

We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people, maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right?

The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.

In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.

Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.

[-]jacob_cannell3y41

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include

Uncontroversial was perhaps a bit tongue-in-cheek, but that claim is specifically about a narrow correspondence between LLMs and linguistic cortex, not about LLMs and the entire brain or the entire cortex.

And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless - because the task itself determines the solution.

Examples from recent neurosci literature:

From "Brains and algorithms partially converge in natural language processing":

Deep learning algorithms trained to predict masked words from large amount of text have recently been shown to generate activations similar to those of the human brain. However, what drives this similarity remains currently unknown. Here, we systematically compare a variety of deep language models to identify the computational principles that lead them to generate brain-like representations of sentences

From "The neural architecture of language: Integrative modeling converges on predictive processing":

Here, we report a first step toward addressing this gap by connecting recent artificial neural networks from machine learning to human recordings during language processing. We find that the most powerful models predict neural and behavioral responses across different datasets up to noise levels.

From "Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain"

We found a striking correspondence between the layer-by-layer sequence of embeddings from GPT2-XL and the temporal sequence of neural activity in language areas. In addition, we found evidence for the gradual accumulation of recurrent information along the linguistic processing hierarchy. However, we also noticed additional neural processes that took place in the brain, but not in DLMs, during the processing of surprising (unpredictable) words. These findings point to a connection between language processing in humans and DLMs where the layer-by-layer accumulation of contextual information in DLM embeddings matches the temporal dynamics of neural activity in high-order language areas.

Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it.

Scaling up GPT-3 by itself is like scaling up linguistic cortex by itself, and doesn't lead to AGI any more/less than that would (pretty straightforward consequence of the LLM <-> linguistic_cortex (mostly) functional equivalence).

In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.

The comparison should between GPT-3 and linguistic-cortex, not the whole brain. For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task. For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data. In terms of flops-equivalent it's perhaps 1e22 sparse flops for training linguistic cortex (1e13 flops * 1e9 seconds) vs 3e23 flops for training GPT-3. So fairly close, but the brain is probably trading some compute efficiency for data efficiency.

[-]Steven Byrnes3y31

The comparison should between GPT-3 and linguistic-cortex

For the record, I did account for language-related cortical areas being ≈10× smaller than the whole cortex, in my Section 3.3.1 comparison. I was guessing that a double-pass through GPT-3 involves 10× fewer FLOP than running language-related cortical areas for 0.3 seconds, and those two things strike me as accomplishing a vaguely comparable amount of useful thinking stuff, I figure.

So if the hypothesis is “those cortical areas require a massive number of synapses because that’s how the brain reduces the number of FLOP involved in querying the model”, then I find that hypothesis hard to believe. You would have to say that the brain’s model is inherently much much more complicated than GPT-3, such that even after putting it in this heavy-on-synapses-lite-on-FLOP format, it still takes much more FLOP to query the brain’s language model than to query GPT-3. And I don’t think that. (Although I suppose this is an area where reasonable people can disagree.)

For inference the linguistic cortex uses many orders of magnitude less energy to perform the same task.

I don’t think energy use is important. For example, if a silicon chip takes 1000× more energy to do the same calculations as a brain, nobody would care. Indeed, I think they’d barely even notice—the electricity costs would still be much less than my local minimum wage. (20 W × 1000 × 10¢/kWh = $2/hr. Maybe a bit more after HVAC and so on.).

I’ve noticed that you bring up energy consumption with some regularity, so I guess you must think that energy efficiency is very important, but I don’t understand why you think that.

For training it uses many orders of magnitude less energy to reach the same capability, and several OOM less data…So fairly close, but the brain is probably trading some compute efficiency for data efficiency.

Other than Section 4, this post was about using an AGI, not training it from scratch. If your argument is “data efficiency is an important part of the secret sauce of human intelligence, not just in training-from-scratch but also in online learning, and the brain is much better at that than GPT-3, and we can’t directly see that because GPT-3 doesn’t have online learning in the first place, and the reason that the brain is much better at that is because it has this super-duper-over-parametrized model”, then OK that’s a coherent argument, even if I happen to think it’s mostly wrong. (Is that your argument?)

And this claim should now be uncontroversial. The neuroscience experiments have been done, and linguistic cortex computes something similar to what LLMs compute, and almost certainly uses a similar predictive training objective. It obviously implements those computations in a completely different way on very different hardware, but they are mostly the same computations nonetheless - because the task itself determines the solution.

Suppose (for the sake of argument—I don’t actually believe this) that human visual cortex is literally a ConvNet. And suppose that human scientists have never had the idea of ConvNets, but they have invented fully-connected feedforward neural nets. So they set about testing the hypothesis that “visual cortex is a fully-connected feedforward neural net”. I suspect that they would find a lot of evidence that apparently confirms this hypothesis, of the same sorts that you describe. For example, similar features would be learned in similar layers. There would be some puzzling discrepancies—especially sample efficiency, and probably also the handling of weird out-of-distribution inputs—but lots of experiments would miss those. So then (in this hypothetical universe) many people would be trumpeting the conclusion: “visual cortex is a fully-connected feedforward DNN”! But they would be wrong! And the careful neuroscientists—the ones who are scrutinizing brain structures, and/or doing experiments more sophisticated than correlating activities in unmanipulated naturalistic data, etc.—would be well aware of that.

There’s a skeptical discussion about specifically LLMs-vs-brains, with some references, in the first part of Section 5 of Bowers et al..

[-]TekhneMakre3y32

I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like "we found brain areas that light up when the person reads 'cat', just like how this part of the neural net lights up when given input 'cat'" and less like "the LLM is useful for other tasks in the same way as the neural version is useful for other tasks". Am I confused about what the paper says, and if so, how? What sort of claim are you making?

[-]hold_my_fish3y54

I figure, at least 10%ish of the cortex is probably mainly storing information which one could also find in a 2022-era large language model (LLM).

This seems to me to be essentially assuming the conclusion. The assumption here is that a 2022 LLM already stores all the information necessary for human-level language ability and that no capacity is needed beyond that. But "how much capacity is required to match human-level ability" is the hardest part of the question.

(The "no capacity is needed beyond that" part is tricky too. I take AI_WAIFU's core point to be that having excess capacity is helpful for algorithmic reasons, even though it's beyond what's strictly necessary to store the information if you were to compress it. But those algorithmic reasons, or similar ones, might apply to AI as well.)

I might as well link my own attempt at this estimate. It's not estimating the same thing (since I'm estimating capacity and you're estimating stored information), so the numbers aren't necessarily in disagreement. My intuition though is that capacity is quite important algorithmically, so it's the more relevant number.

(Edit: Among the sources of that intuition is Neural Tangent Kernel theory, which studies a particular infinite-capacity limit.)

[-]Steven Byrnes3y40

If you think human brains are storing hundreds or thousands of GB or more of information about (themselves / the world / something), do you have any thoughts on what that information is? Like, can you give (stylized) examples? (See also my footnote 13.)

Also, see my footnote 16 and surrounding discussion; maybe that’s a crux?

[-]hold_my_fish3y40

That's an interesting question. I don't have an opinion about how much information is stored. Having a lot of capacity appears to be important, but whether that's because it's necessary to store information or for some other reason, I don't know.

It got me thinking, though: the purpose of our brain is to guide our behavior, not to remember our training data. (Whether we can remember our training data seems unclear. Apparently the existence of photographic memory is disputed, but there are people with extraordinarily good memories, even if not photographic.)

It could be that the preprocessing necessary to guide our future behavior unavoidably increases the amount of stored data by a large factor. (There are all sorts of examples of this sort of design pattern in classic computer science algorithms, so it wouldn't be particularly surprising.) If that's the case, I have no idea how to measure how much of it there is.

[-]Gunnar_Zarncke3y52

Just a data point that support hold_my_fish's argument: Savant Kim Peek did likely memorize gigabytes of information and could access them quite reliably:

https://personal.utdallas.edu/~otoole/CGS_CV/R13_savant.pdf

[-]Steven Byrnes3y20

Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"

[-]Steven Byrnes3y51

You say “Having a lot of capacity appears to be important” but that’s “essentially assuming the conclusion”, right? hehe :)

You claim that there’s a lot of capacity, but I say we don’t really know that. As a stupid example, if my computer’s SRAM has N cells, but it uses an error-correcting code by redundantly storing each bit in three different cells, then its “capacity” is ⅓ the number of cells. (And in 6T SRAM, the number of cells is in turn ⅙ the number of transistors, etc.)

Anyway, all things considered right now, the most plausible-to-me theory is that counting synapses gives a 2-3OOM overestimate of capacity. I don’t see this as particularly implausible. For one thing, as I wrote in the OP, the synapse is not just an information-storage-unit, it’s also a thing-that-does-calculations. If one bit of stored information (e.g. information about how the world works) needs to be involved in 1000 different calculations, it seems plausible that it would need to be associated with 1000 synapses. For another thing, here’s a model where one functional “connection” requires a group of 10 nearby synapses onto the same dendrite. That’s 1 OOM right there! I think there’s another OOM or two lurking in the fact that each cortical minicolumn is 100 neurons and each cortical column is 100 minicolumns, but there’s some sense in which minicolumns (and to a lesser degree, columns) are a single functional unit. So, without getting into details, which I’m hazy on anyway, I wouldn’t be surprised to learn that “one connection” involved not only 10 nearby synapses on one dendrite, but a similar group on 10 synapses onto a neuron within each of 10 neighboring minicolumns, and those 10 minicolumns are working together to implement a certain kind of computation, which by the way you could trivially do on a GPU in a few clock cycles. Or whatever, I dunno.

Or maybe you’re saying “Having a lot of capacity appears to be important” because humans can do things that modern ML can’t, and we need to explain that somehow, and capacity seems like an obvious candidate? If so, I disagree, I think there are other more-plausible candidates, again see footnote 16 and surrounding discussion.

It could be that the preprocessing necessary to guide our future behavior unavoidably increases the amount of stored data by a large factor.

You mean, cached computations or something? I’m not sure what you have in mind. Everything I can think of has some analogy in things-that-LLMs-can-do, or other types of sub-GB ML systems. LLMs do in fact have “behavior” of a sort, in the sense that they output text, and (implicitly) plan out multiple tokens in advance.

[-]hold_my_fish3y30

I haven't fully digested this comment, but:

You mean, cached computations or something?

In some sense there's probably no option other than that, since creating a synapse should count as a computational operation. But there'd be different options for what the computations would be.

The simplest might just be storing pairwise relationships. That's going to add size, even if sparse.

I agree that LLMs do that too, but I'm skeptical about claims that LLMs are near human ability. It's not that I'm confident that they aren't--it just seems hard to say. (I do think they now have surface-level language ability similar to humans, but they still struggle at deeper understanding, and I don't know how much improvement is needed to fix that weakness.)

[-][anonymous]3y41

Hi Steven.

A simple stylized example: imagine you have some algorithm for processing each cluster of inputs from the retina.

You might think that because that algorithm is symmetric* - you want to run the same algorithm regardless of which cluster it is - you only need one copy of the bytecode that represents the compiled copy of the algorithm.

This is not the case. Information wise, sure. There is only one program that takes n bytes of information. You can save disk space for holding your model.

RAM/cache consumption : each of the parallel processing units you have to use (you will not get realtime results for images if you try to do it serially) must have another copy of the algorithm.

And this rule applies throughout the human body : every nerve cluster, audio processing, etc.

This also greatly expands the memory required over your 24 gig 4090 example. For one thing, the human brain is very sparse, and while nvidia has managed to improve sparse network performance, it still requires memory to represent all the sparse values.

I might note that you could have tried to fill in the "cartoon switch" for human synapses. They are likely a MAC for each incoming axon at no better than 8 bits of precision added to an accumulator for the cell membrane at the synapse that has no better than 16 bits of precision. (it's probably less but the digital version has to use 16 bits)

So add up the number of synapses in the human brain, assume 1 khz, and that's how many TOPs you need.

Let me do the math for you real quick:

68 billion neurons, about 1000 connections each, at 1khz. (it's very sparse). So we need 68 billion x 1000 x 1000 = 6.8e+16 = 68000 TOPs.

Current gen data center GPU: https://www.nvidia.com/en-us/data-center/h100/

So we would need 17 of them to hit 1 human brain. We can assume you will never get maximum performance (especially due to the very high sparsity), so maybe 2-3 nodes with 16 cards each?

Note they would have a total of 3840 gigabytes of memory.

Since we have 68 billion x 1000 x (1 byte) = 68 terabytes of weights in a brain, that's the problem. We only have 5% as much memory as we need.

This is the reason for neuromorphic compute based on SSDs: compute's not the bottleneck, memory is.

We can get brain scale perf with 20 times as much hardware, or 20*3*16 = 960 A100s. They are 25k each so 24 million for the GPUs, plus all the motherboards and processors and rack space. Maybe 50 million?

That's a drop in the bucket and easily affordable by current AI companies.

Epistemic notes: I'm a computer engineer (CS masters/ml) and I work on inference accelerators.

it's not symmetric - the retinal density varies by position in the eye

[-]Steven Byrnes3y30

Thanks for your comment! I am not a GPU expert, if you didn’t notice. :)

I might note that you could have tried to fill in the "cartoon switch" for human synapses. They are likely a MAC for each incoming axon…

This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-neurons-on-the-chip. I find that unlikely. Back to that paper just above, things happening in the brain are (supposedly) encoded as random sparse subsets of active neurons drawn from a giant pool of neurons. We could do that on the chip, if we wanted to, but we don’t have to! We could assign them serial numbers instead! We can do whatever we want! Also, cortical neurons are arranged into six layers vertically, and in the other direction, 100 neurons are tied into a closely-interconnected cortical minicolumn, and 100 minicolumns in turn form a cortical column. There’s a lot of structure there! Nobody really knows, but my best guess from what I’ve seen is that a future programmer might have one functional unit in the learning algorithm called a “minicolumn” and it’s doing, umm, whatever it is that minicolumns do, but we don’t need to implement that minicolumn in our code by building it out of 100 different interconnected virtual neurons. Yes the brain builds it that way, but the brain has lots of constraints that we won’t have when we’re writing our own code—for example, a GPU instruction set can do way more things than biological neurons can (partly because biological neurons are so insanely slow that any operation that requires more than a couple serial steps is a nonstarter).

[-][anonymous]3y33

Please read a neuroscience book, even an introductory one, on how a synapse works. Just 1 chapter, even.

There's a MAC in there. It's because the incoming action potential hits the synapse, and sends a certain quantity of neurotransmitters across a gap. The sender cell can vary how much neurotransmitter it sends, and the receiving cell can vary how many active receptors it has. The type of neurotransmitter determines the gain and sign. (this is like the exponent and sign bit for 8 bit BFloat)

These 2 variables can be combined to a single coefficient, you can think of it as "voltage delta" (it can be + or -)

So it's (1) * (voltage gain) = change in target cell voltage.

For ANN, it's <activation output> * <weight> = change in target node activation input.

The brain also uses timing to get more information than just "1", the exact time the pulse arrived matters to a certain amount of resolution. It is NOT infinite, for reasons I can explain if you want.

So the final equation is (1) * (synapse state) * (voltage gain) = change in target cell voltage.

Aka you have to multiply 2 numbers together and add, which is what "multiply-accumulate" units do.

Due to all the horrible electrical noise in the brain, and biological forms of noise and contaminants, and other factors, this is the reason for me making it only 8 bits - 1 part in 256 - of precision. That's realistically probably generous, it's probably not even that good.

There is immense amounts of additional complexity in the brain, but almost none of this matters for determining inference outputs. The action potentials rush out of the synapse at kilometers per second - many biological processes just don't matter at all because of this. Same how a transistor's behavior is irrelevant, it's a cartoon switch.

For training, sure, if we wanted a system to work like a brain we'd have to model some of this, but we don't. We can train using whatever algorithm measurably is optimal.

Similarly we never have to bother with a "minicolumn". We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.

We probably will find something way better than a minicolumn. Some argue that's what a transformer is.

[-]Steven Byrnes3y30

I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P

Similarly we never have to bother with a "minicolumn". We only care about what works best. Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.
We probably will find something way better than a minicolumn. Some argue that's what a transformer is.

I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)

There's a MAC in there.

OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):

Dendrites have lots of clusters of 10 nearby synapses
Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
Different clusters on the same dendritic tree can each be treated independently
- As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.

In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?

Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.

If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.

And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.

At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.

[-]Steven Byrnes3y32

One of AI_WAIFU’s points was that the brain has some redundancy because synapses randomly fail to fire and neurons randomly die etc. That part wouldn’t be relevant to running the same algorithms on chips, presumably. Then the other thing they said was that over-parameterization helps with data efficiency. I presume that there’s some background theory behind that claim which I’m not immediately familiar with. But I mean, is it really plausible that the brain is overparameterizing by 3+ orders of magnitude? Seems pretty implausible to me, although I’m open to being convinced.

Also, Neural Tangent Kernel is an infinite-capacity, but people can do those calculations without using an infinitely large memory, right? People have found a way to reformulate the algorithm such that it involves doing different operations on a different representation which does not require ∞ memory. By the same token, if we’re talking about some network which is so overparametrized that it can be compressed by 99.9%, then I’m strongly inclined to guess that there’s some way to do the same calculations and updates directly on the compressed representation.

[-]hold_my_fish3y30

NTK training requires training time that scales quadratically with the number of training examples, so it's not usable for large training datasets (nor with data augmentation, since that simulates a larger dataset). (I'm not an NTK expert, but, from what I understand, this quadratic growth is not easy to get rid of.)

[-]Matthew Barnett3y30

Some people seem to be hoping that nobody will ever make a misaligned human-level AGI thanks to some combination of regulation, monitoring, and enlightened self-interest. That story looks more plausible if we’re talking about an algorithm that can only run on a giant compute cluster containing thousands of high-end GPUs, and less plausible if we’re talking about an algorithm that can run on one 2023 gaming PC.

Isn't the relevant fact whether we could train an AGI with modest computational resources, not whether we could run one? If training runs are curtailed from regulation, then presumably the main effect is that AGI will be delayed until software and hardware progress permits the covert training of an AGI with modest computational resources, which could be a long time depending on how hard it is to evade the regulation.

[-]Steven Byrnes3y20

Hmm, maybe. I talk about training compute in Section 4 of this post (upshot: I’m confused…). See also Section 3.1 of this other post. If training is super-expensive, then run-compute would nevertheless be important if (1) we assume that the code / weights / whatever will get leaked in short order, (2) the motivations are changeable from "safe" to "unsafe" via fine-tuning or decompiling or online-learning or whatever. (I happen to strongly expect powerful AGI to necessarily use online learning, including online updating the RL value function which is related to motivations / goals. Hope I’m wrong! Not many people seem to agree with me on that.)

[-]Joshua Cogliati6mo20

This is an interesting post and I agree with it. Also, modeling a calculator at the transistor level is a useful analogy.

As for the discussion on "286s or 1995 home computers" these computers seem capable of containing in memory both the DNA description of a simple Bacteria like Pelagibacter ubique (1,308,759 base pairs) as well as simple model of all the atoms in the largest protein that Pelagibacter ubique contains. So this level of computer might be sufficient to bootstrap to a more powerful computer by designing the computational and energy gathering technology and then building new components. Conversely, it would be very hard to prove that a 1995 home computer is incapable of becoming a seed AGI. I have expanded my thoughts on this in a draft paper at https://www.researchgate.net/publication/388398902_Memory_and_FLOPS_Hardware_Limits_to_Prevent_AGI

[-]Steven Byrnes8mo*20

Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:

Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.
Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.
My estimate would be a lot lower if we were just talking about “inference” rather than learning and memory. STDP seems to have complex temporal dynamics at the 10ms scale.
There also seem to be complex intracellular dynamics at play, possibly including regulatory networks, obviously regarding synaptic weight but also other tunable properties of individual compartments.
The standard arguments for the causal irrelevance of these to cognition (they’re too slow to affect the “forward pass”) don’t apply to learning. I’m estimating there’s like a 10-dimensional dynamical system in each compartment evolving at ~100Hz in importantly nonlinear ways.

[-]Vladimir_Nesov3y20

a human-level (more specifically, John von Neumann level) AGI

I think it's plausible that LLM simulacrum AGIs are initially below von Neumann level, and that there are no straightforward ways of quickly improving on that without risking additional misalignment. If so, the initial AGIs might coordinate to keep it this way a significant amount of time through the singularity (like, nanotech industry-rebuilding comes earlier than this) for AI safety reasons, because making the less straightforward improvements leads to unnecessary unpredictability, and it takes a lot of subjective time at a level below von Neumann to ensure that this becomes a safe thing to do.

The concept of AGI should track whatever is sufficient to trigger/sustain a singularity by autonomously converting compute to research progress, and shouldn't require even modest and plausible superpowers such as matching John von Neumann that are not strictly necessary for that purpose.

[-]harfe3y31

Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level.

Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.

[-]Vladimir_Nesov3y*10

The argument is that once there is an AGI at IQ 130-150 level (not "very dumb", but hardly von Neumann), that's sufficient to autonomously accelerate research using the fact that AGIs have much higher serial speed than humans. This can continue for a long enough time to access research from very distant future, including nanotech for building much better AGI hardware at scale. There is no need for stronger intelligence in order to get there. The motivation for this to happen is the AI safety concern with allowing cognition that's more dangerous than necessary, and any non-straightforward improvements to how AGI thinks create such danger. For LLM-based AGIs, anchoring to human level that's available in the training corpus seems more plausible than for other kinds of AGIs (so that improvement in capability would become less than absolutely straightforward specifically at human level). If AGIs have an opportunity to prevent this AI safety risk, they might be motivated to take that opportinity, which would result in intentional significant delay in further improvement of AGI capabilities.

Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that.

I'm not saying that this is an intuitively self-evident claim, there is a specific reason I'm giving for why I see it as plausible. Even when there is a technical capability to build giant AGIs the size of cities, there is still the necessary intermediate of motive in bridging the gap from capability to actuality.

^{^}

To be specific, let's say I'm interested in AGI hardware / compute requirements 3 years after it's widely known how to make an AGI that can do science as well as John von Neumann could. That’s assuming that the AGIs themselves are not appreciably helping (for some reason), or proportionally less than 3 years if the AGIs are helping, or proportionally more than 3 years if the knowledge is secret and thus only a few people are working on it.

^{^}

I’m assuming business-as-usual continues in the computing hardware R&D and production ecosystem between now and then, e.g. assuming that nobody bombs the Taiwan fabs.

^{^}

It’s entirely possible that one AGI could teleoperate more than one robot simultaneously, e.g. if it also builds a sub-AGI autopilot system that can handle basic movements and only “escalates” to the AGI occasionally when it faces a tricky decision. But if so, that just changes the number of required GPUs by a constant factor.

^{^}

Even if hardware progress stops forever right now, and even if I am also underestimating the computational cost of AGI by an order of magnitude, and even if practical intelligence maxes out at the level of John von Neumann for some reason, well you can already rent eight interconnected A100 GPUs for <$10/hr, and it’s pretty clear that an ability to create arbitrary numbers of John von Neumanns for $10/hour each would be radically transformative.

^{^}

E.g. Sam Altman 2023: “shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang”.

^{^}

I expect that most people reading this are already familiar with the case for banning gain-of-function research; see for example here.

^{^}

I’m guessing AlphaZero-chess is in the same ballpark as AlphaGo Zero, whose parameters are guessed from random people on the internet here (“23 million”) and here (“46 million”). (OTOH, this says “original AlphaGo” has “about 5M” parameters, which seems like the odd one out? So I’m figuring that’s a mistake, unless AlphaGo Zero had 5-10× more parameters than the original AlphaGo?) If I cared more I would carefully read the description of the “dual-res” network in the AlphaGo Zero paper and calculate how many parameters it has, or maybe read the LeelaZero documentation more carefully. Please comment if you know, so I can update this post.

^{^}

I basically think that Anthony’s claim "I'd find it pretty surprising if biology did not find a computation solution with neurons that is within a couple of orders of magnitude of the [Landauer] limit" is not really based on anything, and so I put very little stock in it, compared to the more specific arguments discussed below. For example, when an unmyelinated axon carries a spike from the neuron's soma, the axon is expending energy (proportional to its length) on running ion pumps, but it's doing no irreversible computation whatsoever, because it’s just moving a bit from one place to another. So that particular functionality is ∞ orders of magnitude less efficient than the Landauer Limit. See also a much more careful discussion of what we can and can’t learn about brain computation from the Landauer limit in Joe Carlsmith’s report Section 4. For example, the report notes that “A typical cortical spike dissipates around 1e10-1e11 kT.” Do we really think the brain is doing millions of FLOP of unavoidable computational work per cortical spike? I sure don’t. Again, see more detailed discussion below.

^{^}

For one thing, it seems that 1e14 synapses is a much better guess than 1e15. Joe’s 1e15 number seems to come from an unexplained number in a Henry Markram talk, along with a calculation that involved rounding up several numbers simultaneously and forgetting that, while cortical pyramidal cells have thousands of synapses each, not all brain neurons are cortical pyramidal neurons. For another thing, my read on the evidence is that 0.1 spike/second is a better guess than 1 spike/second, when we weight over synapses (and hence ignore fast-spiking interneurons, which are small); see Joe’s footnote 134 and surrounding discussion.

^{^}

For example, directly multiplying the numbers might take a few clock cycles of your laptop’s CPU, whereas simulating a pocket calculator microcontroller chip entails tracking the states of (at least) thousands of transistors and wires and capacitors, as they interact with each other and evolve in time.

^{^}

Why do I say “at least 10%ish”? Well, from fMRI and lesion studies, we have some knowledge about what parts of the cortex are storing what kind of information. And for some of those parts of the cortex, the information they’re storing is either language-related or high-level-semantic information of the type that can also be found in LLMs. So basically, certain parts of superior & middle temporal gyrus, temporal pole, angular gyrus, hippocampus, and prefrontal cortex (in my opinion). I think those areas add up to maybe 10% of the cortex? The “at least” is because probably just about every part of the cortex has at least some information that can be found in a LLM, even if it also has information that can’t be found in a LLM, e.g. hard-to-articulate details about the visual appearances of objects.

^{^}

GPT-3 has 175B parameters, which for float32 parameters would correspond to 700GB. But I’m guessing (e.g. see here) that GPT-3 could be compressed to 100GB by quantization (and/or other methods) with minor performance loss, and I’m guessing that this compressed GPT-3 would still have 100× more information content regarding grammar and vocabulary and celebrity gossip and entomology and whatnot than any human. I am open to arguments that this estimate is way off.

^{^}

Maybe you don’t buy my LLM-based argument, and you think that the cortex actually has >>10GB of information content, and it’s all coming from non-LLM things like audio analysis and production, image analysis and production, motor control, etc. In that case, I would disagree on the grounds that the deep learning community has already produced models that do at least decently on any of those tasks (e.g. speech recognition, speech synthesis, image segmentation, robot motor control, etc.), and I believe that all of those models can be compressed to sub-GB size. So what exactly is all this information that these alleged hundreds or thousands of GB are storing?

^{^}

There’s some controversy in the neuroscience community over whether the human brain stores learned information in synapses, or in gene expression, or something else, or all of the above. As it happens, I’m mostly on Team Synapse, but it doesn’t matter for this post. My “10 GB” claim is total information storage, without taking a stand on exactly how that information is physically encoded.

^{^}

Note: When running the AGI, the information that needs to be in RAM is not only long-term-stored information / knowledge / habits / etc., but also short-term information like the current contents of working memory, and “memory traces” for learning, etc. I didn’t mention the latter, but I figure it’s probably comparatively small—or at least, not so large as to significantly threaten my “≲10GB” claim. For example, there are only ≈20 billion neocortical neurons total (ref), so for each, we can store some information about how long since it last fired etc. without the total data getting too high.

^{^}

Some readers are thinking: No Steve, the human ability to “figure things out” is not about RL or inductive biases or whatever, it is about more and better stored information—it’s just that the stored information is related to meta-learning, and includes better generalizations that stem from diverse real-world training data, etc. My response is: that’s a reasonable hypothesis to entertain, and it is undoubtedly true to some extent, but I still think it’s mostly wrong, and I stand by what I wrote. However, I’m not going to try to convince you of that, because my opinion is coming from “inside view” considerations that I don’t want to get into here.

^{^}

The 10 GB versus 100 GB change doesn’t have strong downstream consequences within my mental models—I already had a bunch of wiggle room on that. See where that came up in the text.

The distinction between memory capacity and useful memory content, i.e. “keeping track of (infinitesimal) credences on a zillion zany hypotheses, almost all of which will turn out to be wrong”, doesn’t really change my headline beliefs much either. The reason is not because I think sample efficiency stops mattering when we’re “running an AGI” rather than training. Quite the contrary, I’m very big into continuous learning (weight updates during deployment) as being absolutely critical for real AGI. Instead, some of my reasons are: (1) When I finally understood this point, it mostly converted my outside-my-model causes for doubt (“I’m confused about where Jacob Cannell is coming from, maybe I’m missing something??”) into within-my-model causes for doubt (“maybe memory bandwidth is a huge constraint after all?”), so I had already kinda accounted for it in the headline belief. And (2) my guess is that “incrementing the credences on a zillion zany little hypotheses” is something that’s relevant <1% of the time (i.e. <1% of spikes-through-synapses), and for the other >99% it’s sufficient to ignore those in favor of the main active beliefs about the world, which can be queried in a much much smaller compressed data structure optimized for inference. (This includes things like (i) I don’t think the cortex updates itself unless there’s local evidence of surprise / confusion; (ii) I think that the vast majority of spikes-through-synapses are related to the nuts-and-bolts machinery of the inference algorithm, and unrelated to the learning algorithm.) Plus other stuff I won’t get into. Basically, I still mostly don’t think memory bottleneck will be an issue in the first place, and if it is, then I expect clever future AI researchers to have lots of options to mitigate it. So anyway, after accounting for those things, plus wiggle-room in my other guesses, I don’t wind up with much of an update at all. I’m still mainly concerned about “unknown unknowns” and the other items in 3.3.1—that’s the main reason my 75%/85% headline is not higher.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

27

Thoughts on hardware / compute requirements for AGI

27

Table of Contents / TL;DR

1. Prologue: Why should we care about the answer to this question?

1.1 Prologue bonus: Counterarguments (i.e. reasons that I shouldn't care), and my responses

2. Some prior discussion

2.1 Eliezer Yudkowsky’s comment about 286s and 1995 home computers

2.2 Paul Christiano’s comment about superhuman chess versus insect brains

2.3 Metaculus

3. The brain bound

3.1 Compute: 1e14 FLOP/s seems like more than enough

3.2 Memory and the von Neumann bottleneck

3.2.1 Background on the von Neumann bottleneck

3.2.2 I think an adult brain is storing only maybe ≲10GB of information (0.001 bits per synapse!!)

3.2.3 Upshot on memory

3.3 Conclusion

3.3.1 My biggest lingering doubts

4. Side note: Ratio of training compute to deployed compute

Updates / Corrigenda after initially publishing this post (added June 2024)