52

[Epistemic status: Strong opinions lightly held, this time with a cool graph.]

I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable.

In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes.

In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does.

The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor.

Plan:

1. Illustrative Analogy
2. Exciting Graph
3. Analysis
1. Extra brute force can make the problem a lot easier
2. Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.
3. What’s bogus and what’s not
4. Example: Data-efficiency
4. Conclusion
5. Appendix

1909 French military plane, the Antionette VII.

By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1615429

Exciting Graph

This data shows that Shorty was entirely correct about forecasting heavier-than-air flight. (For details about the data, see appendix.) Whether Shorty will also be correct about forecasting TAI remains to be seen.

In some sense, Shorty has already made two successful predictions: I started writing this argument before having any of this data; I just had an intuition that power-to-weight is the key variable for flight and that therefore we probably got flying machines shortly after having comparable power-to-weight as bird muscle. Halfway through the first draft, I googled and confirmed that yes, the Wright Flyer’s motor was close to bird muscle in power-to-weight. Then, while writing the second draft, I hired an RA, Amogh Nanjajjar, to collect more data and build this graph. As expected, there was a trend of power-to-weight improving over time, with flight happening right around the time bird-muscle parity was reached.

I had previously heard from a friend, who read a book about the invention of flight, that the Wright brothers were the first because they (a) studied birds and learned some insights from them, and (b) did a bunch of trial and error, rapid iteration, etc. (e.g. in wind tunnels). The story I heard was all about the importance of insight and experimentation--but this graph seems to show that the key constraint was engine power-to-weight. Insight and experimentation were important for determining who invented flight, but not for determining which decade flight was invented in.

Analysis

Part 1: Extra brute force can make the problem a lot easier

One way in which compute can substitute for insights/algorithms/architectures/ideas is that you can use compute to search for them. But there is a different and arguably more important way in which compute can substitute for insights/etc.: Scaling up the key variables, so that the problem becomes easier, so that fewer insights/etc. are needed.

For example, with flight, the problem becomes easier the more power/weight ratio your motors have. Even if the Wright brothers didn’t exist and nobody else had their insights, eventually we would have achieved powered flight anyway, because when our engines are 100x more powerful for the same weight, we can use extremely simple, inefficient designs. (For example, imagine a u-shaped craft with a low center of gravity and helicopter-style rotors on each tip. Add a third, smaller propeller on a turret somewhere for steering. EDIT: Oops, lol, I'm actually wrong about this. Keeping center of gravity low doesn't help. Welp, this is embarrassing.)

With neural nets, we have plenty of evidence now that bigger = better, with theory to back it up. Suppose the problem of making human-level AGI with HBHL levels of compute is really difficult. OK, 10x the parameter count and 10x the training time and try again. Still too hard? Repeat.

Note that I’m not saying that if you take a particular design that doesn’t work, and make it bigger, it’ll start working. (If you took Da Vinci’s flying machine and made the engine 100x more powerful, it would not work). Rather, I’m saying that the problem of finding a design that works gets qualitatively easier the more parameters and training time you have to work with.

Finally, remember that human-level AGI is not the only kind of TAI. Sufficiently powerful R&D tools would work, as would sufficiently powerful persuasion tools, as might something that is agenty and inferior to humans in some ways but vastly superior in others.

Part 2: Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes.

Suppose that actually all we have to do to get TAI is something fairly simple and obvious, but with a neural net 10x the size of my (actual) brain and trained for 10x longer. In this world, does the human brain look any different than it does in the actual world?

No. Here is a nonexhaustive list of reasons why evolution would evolve human brains to look like they do, with all their complexity and mysteriousness and efficiency, even if the same capability levels could be reached with 10x more neurons and a very simple architecture. Feel free to skip ahead if you think this is obvious.

1. In general, evolved creatures are complex and mysterious to us, even when simple and human-comprehensible architectures work fine. Take birds, for example: As mentioned before, all the way up to the Wright brothers there were a lot of very basic things about birds that were still not understood. From this article: “They watched buzzards glide from horizon to horizon without moving their wings, and guessed they must be sucking some mysterious essence of upness from the air. Few seemed to realize that air moves up and down as well as horizontally.” I don’t know much about ornithology but I’d be willing to bet that there were lots of important things discovered about birds after airplanes already existed, and that there are still at least a few remaining mysteries about how birds fly. (Spot check: Yep, the history of ornithopters page says “...the development of comprehensive aerodynamic theory for flapping remains an outstanding problem...”). And of course evolved creatures are often more efficient in various ways than their still-useful engineered counterparts.
2. Making the brain 10x bigger would be enormously costly to fitness, because it would cost 10x more energy and restrict mobility (not to mention the difficulties of getting through the birth canal!) Much better to come up with clever modules, instincts, optimizations, etc. that achieve the same capabilities in a smaller brain.
3. Evolution is heavily constrained on training data, perhaps even more than on brain size. It can’t just evolve the organism to have 10x more training data, because longer-lived organisms have more opportunities to be eaten or suffer accidents, especially in their 10x-longer childhoods. Far better to hard-code some behaviors as instincts.
4. Evolution gets clever optimizations and modules and such “for free” in some sense. Since it is evolving millions of individuals for millions of generations anyway, it’s not a big deal for it to perform massive search and gradient descent through architecture-space.
5. Completely blank slate brains (i.e. extremely simple architecture, no instincts or finely tuned priors) would be unfit even if they were highly capable because they wouldn’t be aligned to evolution’s values (i.e. reproduction.) Perhaps most of the complexity in the human brain--the instincts, inbuilt priors, and even most of the modules--isn’t for capabilities at all, but rather for alignment.

Part 3: What’s bogus and what’s not

The general pattern of argument I think is bogus is:

The brain has property X, which seems to be important to how it functions. We don’t know how to make AI’s with property X. It took evolution a long time to make brains have property X. This is reason to think TAI is not near.

As argued above, if TAI is near, there should still be many X which are important to how the brain functions, which we don’t know how to reproduce in AI, and which it took evolution a long time to produce. So rattling off a bunch of X’s is basically zero evidence against TAI being near.

Put differently, here are two objections any particular argument of this type needs to overcome:

1. TAI does not actually require X (analogous to how airplanes didn’t require anywhere near the energy-efficiency of birds, nor the ability to soar, nor the ability to flap their wings, nor the ability to take off from unimproved surfaces… the list goes on)
2. We’ll figure out how to get property X in AIs soon after we have the other key properties (size and training time), because (a) we can do search, like evolution did but much more efficient, (b) we can increase the other key variables to make our design/search problem easier, and (c) we can use human ingenuity & biological inspiration. Historically there is plenty of precedent for the previous three factors being strong enough; see e.g. the case of powered flight.

This reveals how the arguments could be reformulated to become non-bogus! They need to argue (a) that X is probably necessary for TAI, and (b) that X isn’t something that we’ll figure out fairly quickly once the key variables of size and training time are surpassed.

In some cases there are decent arguments to be made for both (a) and (b). I think efficiency is one of them, so I’ll use that as my example below.

Part 4: Example: Data-efficiency

Let’s work through the example of data-efficiency. A bad version of this argument would be:

Humans are much more data-efficient learners than current AI systems. Data-efficiency is very important; any human who learned as inefficiently as current AI would basically be mentally disabled. This is reason to think TAI is not near.

The rebuttal to this bad argument is:

If birds were as energy-inefficient as planes, they’d be disabled too, and would probably die quickly. Yet planes work fine. (See Table 1 from this AI Impacts page) Even if TAI is near, there are going to be lots of X’s that are important for the brain, that we don’t know how to make in AI yet, but that are either unnecessary for TAI or not too difficult to get once we have the other key variables. So even if TAI is near, I should expect to hear people going around pointing out various X’s and claiming that this is reason to think TAI is far away. You haven’t done anything to convince me that this isn’t what’s happening with X = data-efficiency.

However, I do think the argument can be reformulated and expanded to become good. Here’s a sketch, inspired by Ajeya Cotra’s argument here.

We probably can’t get TAI without figuring out how to make AIs that are as data-efficient as humans. It’s true that there are some useful tasks for which there is plenty of data--like call center work, or driving trucks--but AIs that can do these tasks won’t be transformative. Transformative AI will be doing things like managing corporations, leading armies, designing new chips, and writing AI theory publications. Insofar as AI learns more slowly than humans, by the time it accumulates enough experience doing one of these tasks, (a) the world would have changed enough that its skills would be obsolete, and/or (b) it would have made a lot of expensive mistakes in the meantime.

Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time--decades at least. This is because 1. We’ve been trying to figure this out for decades and haven’t succeeded, and 2. Having a few orders of magnitude more compute won’t help much. Now, to justify point #2: Neural nets actually do get more data-efficient as they get bigger, but we can plot the trend and see that they will still be less data-efficient than humans when they are a few orders of magnitude bigger. So making them bigger won’t be enough, we’ll need new architectures/algorithms/etc. As for using compute to search for architectures/etc., that might work, but given how long evolution took, we should think it’s unlikely that we could do this with only a few orders of magnitude of searching—probably we’d need to do many generations of large population size. (We could also think of this search process as analogous to typical deep learning training runs, in which case we should expect it’ll take many gradient updates with large batch size.) Anyhow, there’s no reason to think that data-efficient learning is something you need to be human-brain-sized to do. If we can’t make our tiny AIs learn efficiently after several decades of trying, we shouldn’t be able to make big AIs learn efficiently after just one more decade of trying.

I think this is a good argument. Do I buy it? Not yet. For one thing, I haven’t verified whether the claims it makes are true, I just made them up as plausible claims which would be persuasive to me if true. For another, some of the claims actually seem false to me. Finally, I suspect that in 1895 someone could have made a similarly plausible argument about energy efficiency, and another similarly plausible argument about flight control, and both arguments would have been wrong: Energy efficiency turned out to be insufficiently necessary, and flight control turned out to be insufficiently difficult!

Conclusion

What I am not saying: I am not saying that the case of birds and planes is strong evidence that TAI will happen once we hit the HBHL milestone. I do think it is evidence, but it is weak evidence. (For my all-things-considered view of how many orders of magnitude of compute it’ll take to get TAI, see future posts, or ask me.) I would like to see a more thorough investigation of cases in which humans attempt to design something that has an obvious biological analogue. It would be interesting to see if the case of flight was typical. Flight being typical would be strong evidence for short timelines, I think.

What I am saying: I am saying that many common anti-short-timelines arguments are bogus. They need to do much more than just appeal to the complexity/mysteriousness/efficiency of the brain; they need to argue that some property X is both necessary for TAI and not about to be figured out for AI anytime soon, not even after the HBHL milestone is passed by several orders of magnitude.

Why this matters: In my opinion the biggest source of uncertainty about AI timelines has to do with how much “special sauce” is necessary for making transformative AI. As jylin04 puts it,

A first and frequently debated crux is whether we can get to TAI from end-to-end training of models specified by relatively few bits of information at initialization, such as neural networks initialized with random weights. OpenAI in particular seems to take the affirmative view[^3], while people in academia, especially those with more of a neuroscience / cognitive science background, seem to think instead that we'll have to hard-code in lots of inductive biases from neuroscience to get to AGI [^4].

In my words: Evolution clearly put lots of special sauce into humans, and took millions of generations of millions of individuals to do so. How much special sauce will we need to get TAI?

Shorty is one end of a spectrum of disagreement on this question. Shorty thinks the amount of special sauce required is small enough that we’ll “work out the details” within a few years of having the key variables (size and training time). At the other end of the spectrum would be someone who thought that the amount of special sauce required is similar to the amount found in the brain. Longs is in the middle. Longs thinks the amount of special sauce required is large enough that the HBHL milestone isn’t particularly relevant to timelines; we’ll either have to brute-force search for the special sauce like evolution did, or have some brilliant new insights, or mimic the brain, etc.

This post rebutted common arguments against Shorty’s position. It also presented weak evidence in favor of Shorty’s position: the precedent of birds and planes. In future posts I’ll say more about what I think the probability distribution over amount-of-special-sauce-needed should be and why.

Acknowedgements: Thanks to my RA, Amogh Nanjajjar, for compiling the data and building the graph. Thanks to Kaj Sotala, Max Daniel, Lukas Gloor, and Carl Shulman for comments on drafts.

Appendix

Some footnotes:

1. I didn’t say anything about why we might think size and training time are the key variables, or even what “key variables” means. Hopefully I’ll get a chance in the comments or in subsequent posts.
2. I deliberately left vague what “training time” means and what “size” means. Thus, I’m not commiting myself to any particular way of calculating the HBHL milestone yet. I’m open to being convinced that the HBHL milestone is farther in the future than it might seem.
3. Persuasion tools, even very powerful ones, wouldn’t be TAI by the standard definition. However they would constitute a potential-AI-induced-point-of-no-return, so they still count for timelines purposes.
4. This "How much special sauce is needed?" variable is very similar to Ajeya Cotra's variable "how much compute would lead to TAI given 2020's algorithms."

Some bookkeeping details about the data:

1. This dataset is not complete. Amogh did a reasonably thorough search for engines throughout the period (with a focus on stuff before 1910) but was unable to find power or weight stats for many of the engines we heard about. Nevertheless I am reasonably confident that this dataset is representative; if an engine was significantly better than the others of its time, probably this would have been mentioned and Amogh would have flagged it as a potential outlier.
2. Many of the points for steam engine power/weight should really be bumped up slightly. This is because most of the data we had was for the weight of the entire locomotive of a steam-powered train, rather than just the steam engine part. I don’t know what fraction of a locomotive is non-steam-engine but 50% seems like a reasonable guess. I don’t think this changes the overall picture much; in particular, the two highest red dots do not need to be bumped up at all (I checked).
3. The birds bar is the power/weight ratio for the muscles of a particular species of bird, reported by this source, which reports the power/weight for a particular species of bird. Amogh has done a bit of searching and doesn’t think muscle power/weight is significantly different for other species of bird. Seems plausible to me; even if the average bird has muscles that are twice (or half) as powerful-per-kilogram, the overall graph would look basically the same.
4. I attempted to find estimates of human muscle power-to-weight ratio; it gets smaller the more tired the muscles get, but at peak performance for fit individuals it seems to be about an order of magnitude less than bird muscle. (This chart lists power-to-weight ratio for human cyclists, which according to this are probably about half muscle, so look at the left-hand column and double it.) Interestingly, this means that the engines of the first flying machines were possibly the first engines to be substantially better than human flapping/pedaling as a source of flying-machine power.
5. EDIT Gaaah I forgot to include a link to the data! Here's the spreadsheet.

52

New Comment

Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time--decades at least.

I know you weren't endorsing this claim as definitely true, but FYI my take is that other families of learning algorithms besides deep neural networks are in fact as data-efficient as humans, particularly those related to probabilistic programming and analysis-by-synthesis, see examples here.

Planned summary for the Alignment Newsletter:

This post argues against a particular class of arguments about AI timelines. These arguments have the form: “The brain has property X, but we don’t know how to make AIs with property X. Since it took evolution a long time to make brains with property X, we should expect it will take us a long time as well”. The reason these are not compelling is because humans often use different approaches to solve problems than evolution did, and so humans might solve the overall problem without ever needing to have property X. To make these arguments more convincing, you need to argue 1) why property X really is _necessary_ and 2) why property X won’t follow quickly once everything else is in place.

This is illustrated with a hypothetical example of someone trying to predict when humans would achieve heavier-than-air flight: in practice, you could have made decent predictions just by looking at the power to weight ratios of engines vs. birds. Someone who argued that we were far away because “we don’t know how to make wings that flap” would have made incorrect predictions.

Planned opinion:

This all seems generally right to me, and is part of the reason I like the <@biological anchors approach@>(@Draft report on AI timelines@) to forecasting transformative AI.

Sounds good to me! I suggest you replace "we don't know how to make wings that flap" with "we don't even know how birds stay up for so long without flapping their wings," because IMO it's a more compelling example. But it's not a big deal either way.

As an aside, I'd be interested to hear your views given this shared framing. Since your timelines are much longer than mine, and similar to Ajeya's, my guess is that you'd say TAI requires data-efficiency and that said data-efficiency will be really hard to get, even once we are routinely training AIs the size of the human brain for longer than a human lifetime. In other words, I'd guess that you would make some argument like the one I sketched in Part 3. Am I right? If so, I'd love to hear a more fleshed-out version of that argument from someone who endorses it -- I suppose there's what Ajeya has in her report...

Sorry, what in this post contradicts anything in Ajeya's report? I agree with your headline conclusion of

If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does.

This also seems to be the assumption that Ajeya uses. I actually suspect we could get away with a smaller neural net ,that is similar in size to or somewhat smaller than the brain.

I guess the report then uses existing ML scaling laws to predict how much compute we need to train a neural net the size of a brain, whereas you prefer to use the human lifetime to predict it instead? From my perspective, the former just seems way more principled / well-motivated / likely to give you the right answer, given that the scaling laws seem to be quite precise and reasonably robust.

I would predict that we won't get human-level data efficiency for neural net training, but that's a consequence of my trust in scaling laws (+ a simple model for why that would be the case, namely that evolution can bake in some prior knowledge that it will be harder for humans to do, and you need more data to compensate).

I suggest you replace "we don't know how to make wings that flap" with "we don't even know how birds stay up for so long without flapping their wings,"

Done.

OK, so here is a fuller response:

First of all, yeah, as far as I can tell you and I agree on everything in the OP. Like I said, this disagreement is an aside.

Now that you mention it / I think about it more, there's another strong point to add to the argument I sketched in part 3: Insofar as our NN's aren't data-efficient, it'll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be. (Because in the short term, we don't have much more compute. I'm embarrassed I didn't notice this earlier and include it in the argument.) That helps the argument a lot; it means that all the argument has to do is establish that we aren't going to get more data-efficient NN's anytime soon.

And yeah, I agree the scaling laws are a great source of evidence about this. I had them in mind when I wrote the argument in part 3. I guess I'm just not as convinced as you (?) that (a) when we are routinely training NN's with 10e15 params, it'll take roughly 10e15 data points to get to a useful level of performance, and (b) average horizon length for the data points will need to be more than short.

Some reasons I currently doubt (a):

--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT's had using various already-developed tricks and techniques.

--The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally. It could be that at 10e15 params and 10e15 data points, performance is actually much higher than merely useful; maybe only 10e13 params and 10e13 data points would be the first to cross the usefulness threshold. (Counterpoint: Extrapolating GPT performance trends on text prediction suggests it wouldn't be human-level at text prediction until about 10e15 params and 10e15 data points, according to data I got from Lanrian. Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph).

Some reasons I currently doubt (b):

--I've been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.

--I think that humans have a tiny horizon length -- our brains are constantly updating, right? I guess it's hard to make the comparison, given how it's an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that's all you need.

--Having a small average horizon length doesn't preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.

I'm very uncertain about all of this and would love to hear your thoughts, which is why I asked. :)

Now that you mention it / I think about it more, there's another strong point to add to the argument I sketched in part 3: Insofar as our NN's aren't data-efficient, it'll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be.

Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.

--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT's had using various already-developed tricks and techniques.

Note that Ajeya's report does have a term for "algorithmic efficiency", that has a doubling time of 2-3 years.

Certainly "several OOMs using tricks and techniques we could implement in a year" would be way faster than that trend, but you've really got to wonder why these people haven't done it yet -- if I interpret "several OOMs" as "at least 3 OOMs", that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I'll happily take a 10:1 bet against a model as competent as GPT-3 being trained on$1000 of compute within the next year.

Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years -- if so, this seems plausibly consistent with the 2-3 year doubling time.

-The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally.

Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make. I agree there's uncertainty here, but I don't see why the uncertainty should bias us towards shorter timelines rather than longer timelines.

I could see it if we thought we were better than evolution, since then we could say "we'd figure something out that evolution missed and that would bias towards short timelines"; but this is also something that Ajeya considered and iirc she then estimated that evolution tended to be ~10x better than us (lots of caveats here though).

Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph

Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of "transformative AI". The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.

(b) average horizon length for the data points will need to be more than short.

I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don't think this is making a huge difference (though certainly 10 years is substantial).

--I've been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.

I see horizon length (as used in the report) as a function of a task, so "horizon length of GPT-3" feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million "effective examples" of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.

--I think that humans have a tiny horizon length -- our brains are constantly updating, right? I guess it's hard to make the comparison, given how it's an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that's all you need.

Again, this feels like a type error to me. Horizon length isn't about the optimization algorithm, it's about the task.

(You can of course define your own version of "horizon length" that's about the optimization algorithm, but then I think you need to have some way of incorporating the "difficulty" of a transformative task into your timelines estimate, given that the scaling laws are all calculated on "easy" tasks.)

--Having a small average horizon length doesn't preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.

Agree with this. I remember mentioning this to Ajeya but I don't actually remember what the conclusion was.

EDIT: Oh, I remember now. The argument I was making is that you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.

Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.

I didn't quite follow this part. Do you think I'm not reasoning from the thing I believe is the bottleneck?

Certainly "several OOMs using tricks and techniques we could implement in a year" would be way faster than that trend, but you've really got to wonder why these people haven't done it yet -- if I interpret "several OOMs" as "at least 3 OOMs", that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I'll happily take a 10:1 bet against a model as competent as GPT-3 being trained on$1000 of compute within the next year.
Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years -- if so, this seems plausibly consistent with the 2-3 year doubling time.

Some of the people I talked to said about 2 OOMs, others expressed it differently, saying that the faster scaling law can be continued past the kink point predicted by Kaplan et al. Still others simply said that GPT-3 was done in a deliberately simple, non-cutting-edge way to prove a point and that it could have used its compute much more compute-efficiently if they threw the latest bags of tricks at it. I am skeptical of all this, of course, but perhaps less skeptical than you? 2 OOMs is 7 doublings, which will happen around 2037 according to Ajeya. Would you be willing to take a 10:1 bet that there won't be something as good as GPT-3 trained on 2 OOMs less compute by 2030? I think I'd take the other side of that bet.

Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make.

I don't think evolution was going for compute-optimal performance in the relevant sense. With AI, we can easily trade off between training models longer and making models bigger, and according to the scaling laws it seems like we should increase training time by 0.75 OOMs for every OOM of parameter count increase. With biological systems, sure maybe it is true that if you faced a trade-off where you were trying to minimize total number of neuron firings over the course of the organism's childhood, the right ratio would be 0.75 OOMs of extra childhood duration for every 1 OOM of extra synapses... maybe. But even if this were true, it's pretty non-obvious that that's the trade-off regime evolution faces. There are all sorts of other pros and cons associated with more synapses and longer childhoods. For example, maybe evolution finds it easier to increase synapse count than to increase childhood, because increased childhood reduces fitness significantly (more chances to die before you reproduce, longer doubling time of population).

Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of "transformative AI". The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.

Yeah, sorry, by useful I meant useful for transformative tasks.

Yes, obviously the tasks in the graph are not transformative. But it seems to me to be... like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it's because they've developed general intelligence in the relevant sense. Or maybe they haven't but it's a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture. Like, yeah those tasks are "particularly easy" compared to taking over the world, but they are also incredibly hard in some sense; IIRC GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.

I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don't think this is making a huge difference (though certainly 10 years is substantial).

Huh. When I put 100% mass on short horizon in my version of Ajeya's model, it says median 2031. Admittedly, I had made some changes to some other parameters too, also not hugely iirc. I wonder if this means those other-parameter changes matter more than I'd thought.

I see horizon length (as used in the report) as a function of a task, so "horizon length of GPT-3" feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million "effective examples" of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.

Huh, that's totally not how I saw it. From Ajeya's report:

I’ll define the “effective horizon length” of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance. If we believe that the number of “samples” required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of “subjective seconds per sample.”

To me this really sounds like it's saying the horizon length = the number of subjective seconds per sample during training. So, maybe it makes sense to talk about "horizon length of task X" (i.e. number of subjective seconds per sample during training of a typical ML model on that task) but it seems to make even more sense to talk about "horizon length of model X" since model X actually had a training run and actually had an average number of subjective seconds per sample.

But I'm happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.

At any rate, deferring to you on this doesn't undermine the point I was making at all, as far as I can tell.

you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.

Exactly. I think this is what humans do too, to a large extent. I'd be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.

I didn't quite follow this part. Do you think I'm not reasoning from the thing I believe is the bottleneck?

I actually don't remember what I meant to convey with that :/

Would you be willing to take a 10:1 bet that there won't be something as good as GPT-3 trained on 2 OOMs less compute by 2030?

No, I'd also take the other side of the bet. A few reasons:

• Estimated algorithmic efficiency in the report is low because researchers are not currently optimizing for "efficiency on a transformative task", whereas researchers probably are optimizing for "efficiency of GPT-3 style systems", suggesting faster improvements in algorithmic efficiency for GPT-3 than estimated in the report.
• 90% confidence is quite a lot; I do not have high certainty in the algorithmic efficiency part of the report.

(Note that 2 OOMs in 10 years seems significantly different from "we can get several OOMs more data-efficient training than the GPT's had using various already-developed tricks and techniques". I also assume that you have more than 10% credence in this, since 10% seems too low to make a difference to timelines.)

I don't think evolution was going for compute-optimal performance in the relevant sense.

I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?

I think you'd need to argue that there is a specific other property that evolution was optimizing for, that clearly trades off against compute-efficiency, to argue that we should expect that in this case evolution was worse than in other cases.

But it seems to me to be... like, 25% likely or so that once we have pre-trained, unsupervised models that build up high skill level at all those tasks on the graph, it's because they've developed general intelligence in the relevant sense.

This seems like it is realist about rationality, which I mostly don't buy. Still, 25% doesn't seem crazy, I'd probably put 10 or 20% on it myself. But even at 25% that seems pretty consistent with my timelines; 25% does not make the median.

Or maybe they haven't but it's a sign that general intelligence is near, perhaps with a more sophisticated training regime and architecture.

Why aren't we already using the most sophisticated training regime and architecture? I agree it will continue to improve, but that's already what the model does.

GPT-3 was also tested on a big dataset of exam questions used for high school, college, and graduate-level admissions, and got 50% or so whereas every other AI system got 25%, random chance, and I bet most english-speaking literate humans in the world today would have done worse than 50%.

1. I don't particularly care about comparisons of memory / knowledge between GPT-3 and humans. Humans weren't optimized for that.
2. I expect that Google search beats GPT-3 on that dataset.

I don't really know what you mean when you say that this task is "hard". Sure, humans don't do it very well. We also don't do arithmetic very well, while calculators do.

But I'm happy to 70% defer to your judgment on this since you probably have talked to Ajeya etc. and know more about this than me.

Er, note that I've talked to Ajeya for like an hour or two on the entire report. I'm not that confident that Ajeya also believes the things I'm saying (maybe I'm 80% confident).

To me this really sounds like it's saying the horizon length = the number of subjective seconds per sample during training. [...]

I agree that the definition used in the report does seem consistent with that. I think that's mostly because the report assumes that you are training a model to perform a single (transformative) task, and so a definition in terms of the model is equivalent to definition in terms of the task. The report doesn't really talk about the unsupervised pretraining approach so its definitions didn't have to handle that case.

But like, irrespective of what Ajeya meant, I think the important concept would be task-based. You would want to have different timelines for "when a neural net can do human-level summarization" and "when a neural net can be a human-level personal assistant", even if you expect to use unsupervised pretraining for both. The only parameter in the model that can plausibly do that is the horizon length. If you don't use the horizon length for that purpose, I think you should have some other way of incorporating "difficulty of the task" into your timelines.

Exactly. I think this is what humans do too, to a large extent. I'd be curious to hear why you put so much weight on medium and long horizons. I put 50% on short, 20% on medium, and 10% on long.

I mean, I'm at 30 / 40 / 10, so that isn't that much of a difference. Half of the difference could be explained by your 25% on general reasoning, vs my (let's say) 15% on it.

Thanks again. My general impression is that we disagree less than it first appeared, and that our disagreements are currently bottoming out in different intuitions rather than obvious cruxes we can drill down on. Plus I'm getting tired. ;) So, I say we call it a day. To be continued later, perhaps in person, perhaps in future comment chains on future posts!

I don't really know what you mean when you say that this task is "hard". Sure, humans don't do it very well. We also don't do arithmetic very well, while calculators do.

By "hard" I mean something like "Difficult to get AIs to do well." If we imagine all the tasks we can get AIs to do lined up by difficulty, there is some transformative task A which is least difficult. As the tasks we succeed at getting AIs to do get harder and harder, we must be getting closer to A. I think that getting an AI to do well on all the benchmarks we throw at it despite not being trained for any of them (but rather just trained to predict random internet text) seems like a sign that we are getting close to A. You say this is because I believe in realism about rationality; I hope not, since I don't believe in realism about rationality. Maybe there's a contradiction in my views then which you have pointed to, but I don't see it yet.

I feel like this is already taken into account by the methodology by which we estimated the ratio of evolution to human design? Like, taking your example of flight, presumably evolution was not optimizing just for power-to-weight ratio, it was optimizing for a bunch of other things; nonetheless we ignore those other things when making the comparison. Similarly, in the report the estimate is that evolution is ~10x better than humans on the chosen metrics, even though evolution was not literally optimizing just for the chosen metric. Why not expect the same here?

At this point I feel the need to break things down into premise-conclusion form because I am feeling confused about how the various bits of your argument are connecting to each other. I realize this is a big ask, so don't feel any particular pressure to do it.

I totally agree that evolution wasn't optimizing just for power-to-weight ratio. But I never claimed that it was. I don't think that my comparison relied on the assumption that evolution was optimizing for power-to-weight ratio. By contrast, you explicitly said "presumably evolution was also going for compute-optimal performance." Once we reject that claim, my original point stands that it's not clear how we should apply the scaling laws to the human brain, since the scaling laws are about compute-optimal performance, i.e. how you should trade off size and training time if all you care about is minimizing compute. Since evolution obviously cares about a lot more than that (and indeed doesn't care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren't directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or... etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.

The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.

The scaling laws from the Kaplan et al papers do tell you this.

The relevant law is , for the early-stopped test loss given parameter count  and data size .  It has the functional form

with .

The result that you should scale  comes from trying to keep the two terms in this formula about the same size.

This is not exactly a heuristic for managing compute (since  is not dependent on compute, it's dependent on how much data you can source).  It's more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.

You always can train models that are "too large" on datasets that are "too small" according to the heuristic, and they won't diverge or do poorly or anything.  They just won't improve much upon the results of smaller models.

In terms of the above, you are setting  and then asking what  ought to be.  If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected."  Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to  rather than using a smaller model to get almost identical performance.

I find it more intuitive to think about the following, both discussed in the papers:

• , the  limit of
• meaning: the peak data efficiency possible with this model class
• , the  limit of
• meaning: the scaling of loss with parameters when not data-constrained but still using early stopping

If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between  and  to ensure we are not in either limit.

Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold ).  Ajeya's approach essentially assumes that we'll cross this threshold at a particular value of , and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.

I'm not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the  or the  curve until it hits .

Huh, thanks, now I'm more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions:

--In my discussion with Rohin I said:

Since evolution obviously cares about a lot more than that (and indeed doesn't care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren't directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or... etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.

Do you agree or disagree? My guess is that you'd disagree, since you say:

If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected."  Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼10^15 rather than using a smaller model to get almost identical performance.

which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don't think that... OK, yeah, I'm just very confused here, please help!)

2. You say "This is not exactly a heuristic for managing compute (since D is not dependent on compute, it's dependent on how much data you can source)." Well, isn't it both? You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won't be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)

You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

This is a subtle and confusing thing about the Kaplan et al papers.  (It's also the subject of my post that I linked earlier, so I recommend you check that out.)

There are two things in the papers that could be called "optimal compute budgeting" laws:

• A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps  and params .
• The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size  vs params .

I said the  vs  law was "not a heuristic for managing compute" because the  vs  law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.

However, the  vs  law does tell you about how to spend compute in an indirect way, for the exact reason you say, that  is related to how long you train.  Comparing the two laws yields the "breakdown" or "kink point."

Do you agree or disagree? ... I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?

Sorry, why do you expect I disagree?  I think I agree.  But also, I'm not really claiming the scaling laws say or don't say anything about the brain, I'm just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems).  We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.

Perhaps it would help me if I could visualize it in two dimensions

This part is 100% qualitatively accurate, I think.  The one exception is that there are two "optimal compute" lines on the plot with different slopes, for the two laws referred to above.  But yeah, I'm saying we won't be on either of those lines, but on the L(N) or the L(D) line.

I didn't confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: " The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance. " was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you'd disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.

I'm glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.

Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.

Here is what the actual visualization looks like.  More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.

https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png

In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:

• If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis.  That is, in this regime, N doesn't matter and loss is effectively a function of D alone.
• This is L(D).
• It looks like the color changes you see if you move horizontally through the upper left region.
• Likewise, in the lower right region, D doesn't matter and loss depends on N alone.
• This is L(N).
• It looks like the color changes you see if you move vertically through the lower right region.

To restate my earlier claims...

If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower).  So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).

This is what motives the heuristic that you scale D with N, to stay on the diagonal line.

On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive.  For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.

When I said that it's intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach.  And that's going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.

Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot.  It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region ... or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.

In Ajeya's work, this question means "let's assume we're using an N=1e15 model, and then let's assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let's figure out how big D has to be to get there."

So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params".

What feels weird to me -- which you touched on above -- is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI.  Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it... we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.

OK, wow, I didn't realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya's methodology was great after all -- my worries have been largely dispelled!

Given that the indifference curves are so close to being L-shaped, it seems there'a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can't be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn't have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren't that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.

The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to "within a few OOMs of 10e15."

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.

So I no longer feel weird about this; I feel like this part of Ajeya's analysis makes sense.

But I am now intensely curious as to how many "data points" the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.

Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second... Huh, that seems a bit much.

What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that's why you need more data to do better -- but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that's what humans are doing -- "only" a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?

And then there's the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that's how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)

What do you think of these three possibilities?

I'm don't think this step makes sense:

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.

In the picture, it looks like there's something special about having a 1:1 ratio of data to params.  But this is a coincidence due to the authors' choice of units.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens.  If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!

To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps.  This depends on your choice of units.  And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems "have the same scaling law."  Scaling is about relationships between differences, not relationships between absolute magnitudes.

On the larger topic, I'm pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for "a data point" is.  This is mostly for "Could a Neuroscientist Understand a Microprocessor?"-type reasons.  I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens.  If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!

Holy shit, mind blown! Then... how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between... Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as "how long you run the model during training" (which in turn is maybe "how many times the average parameter of the model is activated during training?") Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * "horizon length."

I'm very interested to hear your thoughts on Ajeya's methodology. Is my sketch of it above accurate? Do you agree it's a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn't have got more easily with a smaller model--regardless of what the horizon length is, or what your training environment is, or what the task is?

...

As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya's report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance... that would probably make her timelines shorter, funnily enough!

Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the "tokens" definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the "single pass through the network" definition, which would mean we are looking for about 10^12... then we get a small discrepancy; the maximum firing rate of neurons is 250 - 1000 times per second, which means 10^11.5 or so... actually this more or less checks out I'd say. Assuming it's the max rate that matters and not the average rate (the average rate is about once per second).

Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can "few-shot learn" totally new tasks, and also "fine-tune" on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what's going on is just transfer learning from its vast pre-training experience.

humans have a tiny horizon length

What do you mean by horizon length here?

I intended to mean something similar to what Ajeya meant in her report:

I’ll define the “effective horizon length” of an ML problem as the amount of data it takes (on average) to tell whether a perturbation to the model improves performance or worsens performance. If we believe that the number of “samples” required to train a model of size P is given by KP, then the number of subjective seconds that would be required should be given by HKP, where H is the effective horizon length expressed in units of “subjective seconds per sample.”

To be clear, I'm still a bit confused about the concept of horizon length. I'm not sure it's a good idea to think about things this way. But it seems reasonable enough for now.

I've been working on a draft blog post kinda related to that, if you're interested in I can DM you a link, it could use a second pair of eyes.

Sure!

Nothing in this post directly contradicts anything in Ajeya's report. The conflict, insofar as there is any, is in that Part 3 I mentioned, where I sketch an argument for long timelines based on data-efficiency. That argument sketch was inspired by what Ajeya said; it's what my model of her (and of you) would say. Indeed it's what you are saying now (e.g. you are saying the scaling laws tell us how data-efficient our models will be once they are bigger, and it's still not data-efficient enough to be transformative, according to you.) I think. So, the only conflict is external to this post I guess: I think this is a decent argument but I'm not yet fully convinced, whereas (I think) you and Ajeya think it or something like it is a more convincing argument. I intend to sleep on it and get back to you tomorrow with a more considered response.

Quick self-review:

Yep, I still endorse this post. I remember it fondly because it was really fun to write and read. I still marvel at how nicely the prediction worked out for me (predicting correctly before seeing the data that power/weight ratio was the key metric for forecasting when planes would be invented). My main regret is that I fell for the pendulum rocket fallacy and so picked an example that inadvertently contradicted, rather than illustrated, the point I wanted to make! I still think the point overall is solid but I do actually think this embarrassment made me take somewhat more seriously the "we are missing important insights" hypothesis. Sometimes you don't know what you don't know.

I still see lots of people making claims about the efficiency and mysteriousness of the brain to justify longer timelines. Frustratingly I usually can't tell from their offhand remarks whether they are using the bogus arguments I criticize in this post, or whether they have something more sophisticated and legit in mind. I'd have to interrogate them further, and probably get them to read this post, to find out, and in conversation there usually isn't time or energy to do that.

Curated.

This post laid out some important arguments pretty clearly.

Great post!

we’ll either have to brute-force search for the special sauce like evolution did

I would drop the "brute-force" here (evolution is not a random/naive search).

Re the footnote:

This "How much special sauce is needed?" variable is very similar to Ajeya Cotra's variable "how much compute would lead to TAI given 2020's algorithms."

I don't see how they are similar.

Thanks! Fair enough re: brute force; I guess my problem is that I don't have a good catchy term for the level of search evolution does. It's better than pure random search, but a lot worse than human-intelligent search.

I think "how much compute would lead to TAI given 2020's algorithms" is sort of an operationalization of "how much special sauce is needed." There are three ways to get special sauce: Brute-force search, awesome new insights, or copy it from the brain. "given 2020's algorithms" rules out two of the three. It's like operationalizing "distance to Edinburgh" as "time it would take to get to Edinburgh by helicopter."

My understanding is that the 2020 algorithms in Ajeya Cotra's draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).

Huh, that's not how I interpreted it. I should reread the report. Thanks for raising this issue.

"automated search"?