Evolution provides no evidence for the sharp left turn

Quintin Pope

Does human evolution imply a sharp left turn from AIs?

Arguments for the sharp left turn in AI capabilities often appeal to an “evolution -> human capabilities” analogy and say that evolution's outer optimization process built a much faster human inner optimization process whose capability gains vastly outstripped those which evolution built into humans. Such arguments claim we will see a similar transition while training AIs, with SGD creating some 'inner thing' which is not SGD and which gains capabilities much faster than SGD can insert them into the AI. Then, just like human civilization exploded in capabilities over a tiny evolutionary time frame, so too will AIs explode in capabilities over a tiny "SGD time frame".

Evolution’s sharp left turn happened for evolution-specific reasons

I think that "evolution -> human capabilities" is a bad analogy for "AI training -> AI capabilities". Let’s compare evolution to within lifetime learning for a single generation of an animal species:

A generation is born.
The animals of the generation learn throughout their lifetimes, collectively performing many billions of steps of learning.
The generation dies, and all of the accumulated products of within lifetime learning are lost.
Differential reproductive success slightly changes the balance of traits across the species.

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

Now let’s look at the same comparison for humans:

A generation is born.
The humans of the generation learn throughout their lifetimes, collectively performing many billions of steps of learning.
The current generation transmits some fraction of their learned knowledge to the next generation through culture.
The generation dies, but only some of the accumulated products of within lifetime learning are lost.
Differential reproductive success slightly changes the balance of genomic traits across humanity.

Human culture allows some fraction of the current generation’s within lifetime learning to transmit directly to the next generation. In the language of machine learning, the next generation benefits from a kind of knowledge distillation, thanks to the prior generation providing higher quality 'training data' for the next generation's within-lifetime learning.

This is extremely important because within-lifetime learning happens much, much faster than evolution. Even if we conservatively say that brains do two updates per second, and that a generation is just 20 years long, that means a single person’s brain will perform ~1.2 billion updates per generation. Additionally, the human brain probably uses a stronger base optimizer than evolution, so each within-lifetime brain update is also probably better at accumulating information than a single cross-generational evolutionary update. Even if we assume that only 1 / 10,000 of the information learned by each generation makes its way into humanity's cross-generational, persistent endowment of cultural information, that still means culture advances ~100,000 times faster than biological evolution.

I think that "evolution -> human capabilities" is a very bad reference class to make predictions about "AI training -> AI capabilities". We don't train AIs via an outer optimizer over possible inner learning processes, where each inner learning process is initialized from scratch, then takes billions of inner learning steps before the outer optimization process takes one step, and then is deleted after the outer optimizer's single step. Such a bi-level training process would necessarily experience a sharp left turn once each inner learner became capable of building off the progress made by the previous inner learner (which happened in humans via culture / technological progress from one generation to another).

However, this sharp left turn does not occur because the inner learning processes suddenly become much better / more foomy / more general in a handful of outer optimization steps. It happens because you devoted billions of times more optimization power to the inner learning processes, but then deleted each inner learner shortly thereafter. Once the inner learning processes become able to pass non-trivial amounts of knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.

In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization. For a given compute budget, the best (known) way to buy capabilities is to train a single big model in accordance with empirical scaling laws such as those discovered in the Chinchilla paper, not to split the compute budget across millions of different training runs for vastly tinier models with slightly different architectures and training processes. In fact, we can be even more clever and use small models to tune the training process, before scaling up to a single large run, as OpenAI did with GPT-4.

(See also: Gwern on the blessings of scale.)

It’s true that we train each new AI from scratch, rather than reusing any of the compute that went into previous models. However, the situation is very different from human evolution because each new state of the art model uses geometrically more compute than the prior state of the art model. Even if we could perfectly reuse the compute from previous models, it wouldn't be nearly so sharp an improvement to the rate of progress as occurred in the transition from biological evolution to human cultural accumulation. I don’t think it’s plausible for AI capabilities research to have the same sort of hidden factor of ~billion resource overhang that can be suddenly unleashed in a short-to-humans timescale.

The capabilities of ancestral humans increased smoothly as their brains increased in scale and/or algorithmic efficiency. Until culture allowed for the brain’s within-lifetime learning to accumulate information across generations, this steady improvement in brain capabilities didn’t matter much. Once culture allowed such accumulation, the brain’s vastly superior within-lifetime learning capacity allowed cultural accumulation of information to vastly exceed the rate at which evolution had been accumulating information. This caused the human sharp left turn.

However, the impact of scaling or algorithmic improvements on the capabilities of individual brains is still continuous. which is what matters for predicting how suddenly AI capabilities will increase as a result of scaling or algorithmic improvements. Humans just had this one particular bottleneck in cross-generational accumulation of capabilities-related information over time, leading to vastly faster progress once culture bypassed this bottleneck.

Don't misgeneralize from evolution to AI

Evolution's sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.

History need not repeat itself. Human evolution is not an allegory or a warning. It was a series of events that happened for specific, mechanistic reasons. If those mechanistic reasons do not extend to AI research, then we ought not (mis)apply the lessons from evolution to our predictions for AI.

This last paragraph makes an extremely important claim that I want to ensure I convey fully:

- IF we understand the mechanism behind humanity's sharp left turn with respect to evolution

- AND that mechanism is inapplicable to AI development

- THEN, there's no reason to reference evolution at all when forecasting AI development rates, not as evidence for a sharp left turn, not as an "illustrative example" of some mechanism / intuition which might supposedly lead to a sharp left turn in AI development, not for anything.

Here's an analogy to further illustrate the point:

Imagine that we were trying to figure out how to build very reliable cars. We've so far built a number of car prototypes, but none have reached the full load-bearing capacity of even a single strong human, never mind the vastly superhuman transport capacity that the laws of physics seem to permit.
Someone raises the concern that, once we try to scale current prototypes to the superhuman limit, they'll tend to spontaneously combust, despite the fact that none of the prototypes have ever done so. As evidence for such an event, the person points to the fact that a previous car building effort, led by EVO-Inc., actually had built cars that did sometimes explode randomly.
Concerned, we investigate EVO-Inc.'s car building effort, hoping to avoid whatever failure plagues their cars. Only, upon investigating EVO-Inc., it turns out that they're actually run by insane space clowns, and the reason their cars occasionally explode is because they used armed landmines in place of hubcaps.

My point is that other car builders can learn ~zero lessons from EVO-Inc.^[1] The mechanism behind their cars' spontaneous detonation is easily avoided by not using landmines as hubcaps. The organizational-level failures that led to this design choice on EVO-Inc.'s part are also easily avoided by not being insane space clowns. We should not act like there might be some general factor of "explodeyness" which will infect other car building efforts, simply by virtue of those efforts tackling a similar problem to the one EVO-Inc. failed at.

EVO-Inc's failures arose from mechanisms which do not apply to human organizations tackling similar problems. EVO-Inc. didn't use landmines as hubcaps because they were run by greedy, myopic executives who cut corners on safety to increase profits. They didn't do so because they were naive optimists who failed to understand why building non-exploding cars is hard like computer security or rocket science, and who failed to apply proper security mindset to their endeavors. EVO-Inc used landmines as hubcaps because they were run by insane space clowns who did insane space clown things.

Human car builders may have to tackle problems superficially similar to the spontaneous combustion of the EVO-Inc. cars. E.g., they may have to design the fuel tanks of their cars to avoid combustion during a crash. However, those efforts still should not take lessons from EVO-Inc. E.g., if other car builders were to look at crash data from EVO-Inc.'s cars, and naively generalize from the surface-level outcomes of an EVO-Inc. car crash to their own mechanistically different circumstances, they might assume that supersonic fragments posed a significant risk during a crash, and then add ballistic armor between the driver and the wheels, despite this doing nothing to prevent a car's fuel tank from igniting during a crash.

I think our epistemic relationship with evolution's example should be about the same as the human car builders' epistemic relationship with EVO-Inc. Evolution's combined sharp left turn and alignment failures happened because evolution is a very different process compared to human-led AI development, leading to evolution-specific mechanisms, which no sane AI developer would replicate.

In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:

Deliberately create a (very obvious^[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.^[3]
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.^[4]
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.^[5]

This is the AI development equivalent of using landmines as hubcaps. It's not just that this is an insane idea from an alignment perspective. It's also an insane idea from just about any other perspective. Even if you're only trying to maximize AI capabilities, it's a terrible idea to have such an extreme disparity in resources between the inner and outer loops.

AI researchers have actually experimented with bi-level optimization processes such as neural architecture search and second-order meta learning. Based on current results, I don't think anything approaching multiple orders of magnitude difference in resource use between the inner and outer optimizers is plausible. It's just not efficient, and we have better approaches. From the GPT-4 paper:

A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× – 10, 000× less compute.

Even if we could magically repurpose all of the compute used throughout OpenAI's tuning of the GPT-4 architecture / training process, I doubt it would even amount to as much compute as they used in the final GPT-4 training run, much less exceed that quantity by orders of magnitude. Modern training practices simply lack that sort of free energy.

See also: Model Agnostic Meta Learning proposed a bi-level optimization process that used between 10 and 40 times more compute in the inner loop, only for Rapid Learning or Feature Reuse? to show they could get about the same performance while removing almost all the compute from the inner loop, or even by getting rid of the inner loop entirely.

Fast takeoff is still possible

The prior sections argue that we should not use an evolutionary analogy as evidence that an inner learner will sufficiently outperform the outer optimizer that constructed it so as to cause a massive spike in capabilities as a result of the same mechanisms that drove the sharp left turn in human evolution.

However, introducing new types of positive feedback loops across multiple training runs may lead to fast takeoff, but it would be a mechanistically different process than the evolutionary sharp left turn, meaning there's no reason to assume takeoff dynamics mirroring those of human evolution. There are two specific mechanisms that I think could produce a fast takeoff:

AIs contributing to AI capabilities research, producing a positive feedback loop with a sharp upwards kink around the time that AI contributions exceed human contributions.
AIs deliberately seeking out new training data that grant them useful capabilities. E.g., an AI trying to improve its bioengineering capabilities may set up a very fast cycle of gathering and analyzing new biological data, which significantly outpaces the rate of human scientific innovation.

If fast takeoff is still plausible, why does the specific type of positive feedback loop matter? What changes, as a result of considering various AI-specific fast takeoff mechanisms, as opposed to the general expectation of sudden transitions, as implied by the evolution analogy? Here are four alignment-relevant implications:

Takeoff is less abrupt. Both of the above mechanisms are vaguely similar to how human cultural development allowed us to jump forwards in capabilities by feeding the outputs of one generation into the “training data” of the next generation. However, I expect that neither mechanism will produce as much of a relative jump in AI capabilities, as cultural development produced in humans. Neither mechanism would suddenly unleash an optimizer multiple orders of magnitude faster than anything that came before, as was the case when humans transitioned from biological evolution to cultural development.
Takeoff becomes easier to navigate. These specific mechanisms of capabilities advance probably both allow for iteration and experimentation. We currently have examples of both AI capabilities advances and of online learning / exploration processes. We can run experiments on current systems to assess the alignment risks posed by both these sources of capabilities improvement.
Capabilities gains are less general. "capabilities generalize further than alignment" is a common refrain in discussions about the sharp left turn. Usually, this claim is justified by making an analogy to how human capabilities started to quickly generalize across many domains simultaneously.

However, the process responsible for human breadth of generality was not some small architectural modification evolution made to the human brain. It was humanity's cross-generational process of expanding and improving our available "training data" to cover a broader and broader range of capabilities across many domains (a process we sometimes call "science"). The evolutionary analogy thus offers no reason to expect sudden jumps in generality without corresponding extensions of the training data.

Without this evolutionary analogy, why should we even elevate the very specific claim that 'AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.' to consideration at all, much less put significant weight on it?
Alignment probably generalizes pretty well. Speaking of alignment techniques failing, I expect alignment techniques to mostly generalize across capabilities jumps caused by either of the above mechanisms for sudden capabilities gain.

Will alignment generalize across sudden capabilities jumps?

The previous section argued that the mechanisms driving the sharp left turn in human evolution are not present in AI development, and so we shouldn't generalize from the results of human evolution to those of AI development, even when considering positive feedback loops whose surface-level features are reminiscent of the sharp left turn in human evolution.

This section will first reference and briefly summarize some past writing of mine arguing that our "misalignment" with inclusive genetic fitness isn't evidence for AI misalignment with our values. Then, I'll examine both mechanisms for a possible fast takeoff that I described above from an "inside view" machine learning perspective, rather than assuming outcomes mirroring those of human evolutionary history.

Human "misalignment" with inclusive genetic fitness provides no evidence for AI misalignment

I previously wrote a post, Evolution is a bad analogy for AGI: inner alignment, arguing that evolutionary analogies between human values and inclusive genetic fitness have little to tell us about the degree of values misgeneralization we should expect from AI training runs, and that analogies to human within-lifetime learning are actually much more informative^[6].

I also wrote this subsection in a much longer post, which explains why I think evolution is mechanistically very different from AI training, such that we cannot easily infer lessons about AI misgeneralization by looking at how human behaviors differ between the modern and ancestral environments.

Very briefly: "human behavior in the ancestral environment" versus "human behavior in the modern environment" isn't a valid example of behavioral differences between training and deployment environments. Humans weren't "trained" in the ancestral environment, then "deployed" in the modern environment. Instead, humans are continuously "trained" throughout our lifetimes (via reward signals and sensory predictive error signals). Humans in the ancestral and modern environments are different "training runs".

As a result, human evolution is not an example of:

We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.

It's an example of:

We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.

The near-total misalignment between inclusive genetic fitness and human values is an easily predicted consequence of this (evolution-specific) bi-level optimization paradigm, just like the human sharp left turn is an easily predicted consequence of the (evolution-specific) extreme resource disparity between the two optimization levels. And just like evolution provides no reason to assume our own AI development efforts will experience a sharp left turn, so to does evolution not provide any reason to assume our AI development efforts will show extreme misgeneralization between training and deployment.

Capabilities jumps due to AI driving AI capabilities research

For the first mechanism of AIs contributing to AI capability research, I first note that this is an entirely different sort of process than the one responsible for the human sharp left turn. Evolution made very few modifications to the human brain's architecture during the timeframe in which our cultural advancement catapulted us far beyond the limits of our ancestral capabilities. Additionally, humans have so far been completely incapable of changing our own architectures, so there was never a positive feedback loop of the sort that we might see with AIs researching AI capabilities.

Because of this large difference in underlying process between this possible fast takeoff mechanism and the evolutionary sharp left turn, I think we should mostly rely on the current evidence available from AI development for our predictions of future AI development, rather than analogies to our evolutionary history. Additionally, I claim that alignment techniques already generalize across human contributions to AI capability research. Let’s consider eight specific alignment techniques:

and eleven recent capabilities advances:

Optimally training language models using the Chinchilla scaling laws
Transcending Scaling Laws with 0.1% Extra Compute
Better tuning of training and architectural hyperparameters (example)
Retrieval mechanisms for language models, such as RETRO
1 bit Adam for efficiently sharing gradient info across GPUs
Doing more than one epoch on high quality text
(Possibly) an improvement on the Adam optimizer
Distributed training across many low-memory GPUs
Stable, 8-bit transformer implementations
Applying layer norms to query and key outputs of attention layers to stabilize training.
The Hyena operator as a replacement for attention, to (maybe?) scalable sub-quadratic sequence processing architectures

I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances. E.g., if you first develop your RLHF techniques for models trained using the original OpenAI scaling laws, I expect those techniques to transfer pretty well to models trained with the Chinchilla scaling laws.

I expect there is some interference. I expect that switching your architecture from a vanilla transformer to a RETRO architecture will cause issues like throwing off whatever RLHF hyperparameters you’d found worked best for the vanilla architecture, or complicate analysis of the system because there’s now an additional moving part (the retrieval mechanism), which you also need to track in your analysis.

However, I expect we can overcome such issues with “ordinary” engineering efforts, rather than, say, RLHF techniques as a whole becoming entirely useless for the new architecture. Similarly, whatever behavioral analysis pipeline you’d developed to track models based on the vanilla architecture can probably be retrofitted for models based on the RETRO architecture without having to start from scratch.

Importantly, the researchers behind the capabilities advances were not explicitly optimizing to maintain backward compatibility with prior alignment approaches. I expect that we can decrease interference further by just, like, bothering to even try and avoid it.

I’d like to note that, despite my optimistic predictions above, I do think we should carefully measure the degree of interference between capabilities and alignment techniques. In fact, doing so seems very very important. And we can even start right now! We have multiple techniques for both alignment and capabilities. You can just choose a random alignment technique from the alignment list, a random capabilities technique from the capabilities list, then see if applying the capabilities technique makes the alignment technique less effective.

The major exception to my non-interference claim is for alignment techniques that rely on details of trained models’ internal structures, such as mechanistic interpretability. CNNs and transformers require different sorts of interpretability techniques, and likely have different flavors of internal circuitry. This is one reason why I’m more skeptical of mechanistic interpretability as an alignment approach^[7].

Capabilities jumps due to AI iteratively refining its training data

I think the second potential fast takeoff mechanism, of AIs continuously refining their training data, is riskier, since it allows strange feedback loops that could take an AI away from human-compatible values. Additionally, most current models derive values and goal-orientated behaviors much more from their training data, as opposed to their architecture, hyperparameters, and the like.

E.g., I expect that choosing to use the LION optimizer in place of the Adam optimizer would have very little impact on, say, the niceness of a language model you were training, except insofar as your choice of optimizer influences the convergence of the training process. Architecture choices seem 'values neutral' in a way that data choices are not.

I still think the risks are manageable, since the first-order effect of training a model to perform an action X in circumstance Y is to make the model more likely to perform actions similar to X in circumstances similar to Y. Additionally, current practice is to train language models on an enormous variety of content from the internet. The odds of any given subset of model data catastrophically interfering with our current alignment techniques cannot be that high, otherwise our current alignment techniques wouldn't work on our current models.

However, second order effects may be less predictable, especially longer term second-order effects of, e.g., training future models on the outputs of current models. Such iterative approaches appear to be gaining popularity, now that current LMs are good enough to do basic data curation tasks. In fact, one of the linked alignment approaches, ConstitutionalAI, is based on using LMs to rewrite texts that they themselves will then train on. Similar recent approaches include:

Although this potential fast takeoff mechanism more closely resembles the mechanisms of cultural development responsible for the human sharp left turn, I think there are still important differences that make a direct extrapolation form human evolutionary history inappropriate. Most prominently, a data refinement fast takeoff wouldn't coincide with exploiting the same sort of massive resource overhang that came into play during the human sharp left turn.

Additionally, I expect there are limits to how far AIs can improve their training data without having to run novel experiments and gather data different from their initial training data. I expect it will be difficult to extend their competency to a new domain without actually gathering new data from that domain, similar to how human scientific theory only progresses so far in the absence of experimental data from a new domain.

Conclusion

I think that evolution is a bad analogy for AI development. I previously argued as much in the context of inner alignment concerns, and I've also argued that evolution is actually very mechanistically different from the process of training an AI.

Our evolutionary history has all sorts of difficult-to-track details that really change how we should derive lessons from that history. In this post, the detail in question was the enormous disparity between the optimization strength of biological evolution versus brain-based within lifetime learning, leading to a giant leap in humanity's rate of progress, once within lifetime learning could compound over time via cultural transmission.

I've started to notice a common pattern in evolutionary analogies, where they initially suggest concerning alignment implications, which then seem to dissolve once I track the mechanistic details of what actually happened in the evolutionary context, and how that would apply to AI development. At this point, my default reaction to any evolutionary analogy about AI alignment is skepticism.

^{^}
Other than "don't take automotive advice from insane space clowns", of course.
^{^}
If you suspect that you've maybe accidentally developed an evolution-style inner optimizer, look for a part of your system that's updating its parameters ~a billion times more frequently than your explicit outer optimizer.
^{^}
- "inner optimizer" = the brain.
- "inner loss function" = the combination of predictive processing and reward circuitry that collectively make up the brain's actual training objective.
- "inner loss function includes no mention human values / objectives" because the brain's training objective includes no mention of inclusive genetic fitness.
^{^}
Reflects the enormous disparity in optimization strength between biological evolution and human within-lifetime learning, which I've been harping on about this whole post.
^{^}
Evolution doesn't intervene in our within-lifetime learning processes if it looks like we're not learning the appropriate fitness-promoting behavior.
^{^}
It's not even that I think human within-lifetime learning is that informative. It's just that I think "being more informative than evolution" is such a stupidly low bar that human within-lifetime learning clears it by a mile.
^{^}
I do think there’s a lot of value in mechanistic interpretability as a source of evidence about the mechanics and inductive biases of SGD. For example, this paper discovered “name mover heads”, attention heads that copy a speaker’s name to the current token in specific contexts, and also discovered “backup name mover heads”, which are attention heads that don’t normally appear to act as name mover heads, but when researchers ablated the primary name mover heads, the backup name mover heads changed their behavior to act as name mover heads.

AlphaZero had autonomous learning—the longer you train, the better the model weights. Humans (and collaborative groups of humans) also have that—hence scientific progress. Like, you can lock a group of mathematicians in a building for a month with some paper and pens, and they will come out with more and better permanent knowledge of mathematics than when they entered. They didn’t need any new training data; we just “ran them” for longer, and they improved, discovering new things arbitrarily far beyond the training data, with no end in sight.

Today’s SOTA LLMs basically don't have an autonomous learning capability analogous to the above. Sure, people do all sorts of cool tricks with the context window, but people don’t know how to iteratively make the weights better and better without limit, in a way that’s analogous to to AlphaZero doing self-play or a human mathematicians doing math. Like, you can run more epochs on the same training data, but it rapidly plateaus. You can do the Huang et al. thing in an infinite loop, but I think it would rapidly go off the rails.

I don’t want to publicly speculate on what it would take for autonomous learning to take off in LLMs—maybe it’s “just more scale” + the Huang et al. thing, maybe it’s system-level changes, maybe LLMs are just not fit for purpose and we need to wait for the next paradigm. Whatever it is, IMO it’s a thing we’ll have eventually, and don’t have right now.

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

In the context of the OP fast-takeoff scenarios, you wrote “Takeoff is less abrupt; Takeoff becomes easier to navigate; Capabilities gains are less general”. I’m not sure I buy any of those for my autonomous-learning fast-takeoff scenario. For example, AlphaZero was one of the first systems of that type that anyone got to work at all, and it rocketed to superhuman; that learning process happened over days, not years or decades; and presumably “getting autonomous learning to work stably” would be a cross-cutting advance not tied to any particular domain.

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:

Scaling runs into data limitations
no one can think of any autonomous learning techniques for years
finally someone finds an algorithms which works really well (prior to anyone finding an algorithm which only works ok)
this results in a huge effective compute overhang
people are able to effectively scaleup by 100x in short period and this is sufficient to achieve takeover capable AIs.

But this was just a standard fast takeoff argument. Here's a different version which doesn't refer to autonomous learning but is isomorphic:

People scale up inefficient algos (like transformers)
no one can think of any better techniques for years
finally someone finds an algorithms which works really well (prior to anyone finding an algorithm which only works somewhat better than the current techniques)
this results in a huge effective compute overhang
people are able to effectively scaleup by 100x in short period and this is sufficient to achieve takeover capable AIs.

The reason you got fast takeoff in both cases is just sudden large algorithmic improvement. I don't see a particular reason to expect this in the autonomous learning case and I think the current evidence points to this being unlikely for capabilities in general. (This is of course a quantitative question: how big will leaps be exactly?)

Sure, people do all sorts of cool tricks with the context window, but people don’t know how to iteratively make the weights better and better without limit, in a way that’s analogous to to AlphaZero doing self-play or a human mathematicians doing math.

I don't think this is a key bottleneck. For instance, it wouldn't be too hard to set up LLMs such that they would improve at some types of mathematics without clear limits (just set them up in a theorem proving self play type setting much like the mathematicians). This improvement rate would be slower than the corresponding rate in humans (by a lot) and would probably be considerably slower than the improvement rate for high quality supervised data. Another minimal baseline is just doing some sort of noisy student setup on entirely model generated data (like here https://arxiv.org/abs/1911.04252).

Capabilities people have tons of ideas here, so if data is an actual limitation, I think they'll figure this out (as far as I know, there are already versions in use at scaling labs). No one has (publically) bothered to work hard on autonomous learning because getting a lot of tokens is way easier and the autonomous learning is probably just worse than working on data curation if you don't run out of data.

My guess is that achieving reasonably efficient things which have good scaling laws is 'just' a moderately large capabilities research project at OpenAI - nothing that special.

You probably take some sort of hit from autonomous learning instead of supervised, but it seems not too bad to make the hit <100x compute efficiency (I'm very unsure here). Naively I would have thought that getting within a factor of 5 or so should be pretty viable.

Perhaps you think there are autonomous learning style approaches which are considerably better than the efficiency on next token prediction?

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

Option A: You need to spend the entire week reading AI literature. I choose what you read, and in what order, using a random number generator and selecting out of every AI paper / textbook ever written. While reading, you are forced to dwell for exactly one second—no more, no less—on each word of the text, before moving on to the next word.

Option B: You can spend your week however you want. Follow the threads that seem promising, sit and think for a while, go back and re-read passages that are confusing, etc.

It seems extremely obvious to me that you’d make more progress under Option B than Option A—like, massively, qualitatively more progress. Do you not share that intuition? (See also Section 1.1 here.)

(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)

It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."

And more broadly, you're making a claim like 'current learning efficiency is very low'.

I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.

Suppose we instantly went from 'current status quo' to 'AI systems learn like humans learn and with the same efficiency, but with vastly larger memories than humans (current LLMs seem to have vastly better memory at least for facts and technical minutia), and vastly longer lifespans than humans (if you think token corresponds to 1 second, then 10 trillion tokens is 317098 years!)'. Then, we certainly get an extremely hard FOOM if anyone runs this training!

But this hypothetical just isn't what I expect.

Currently, SOTA deep learning is deeply inefficient in a bunch of different ways. Failing to do open ended autonomous learning to advance a field and then distilling these insights down to allow for future progress is probably one such failure, but I don't think it seem particularly special. Nor do I see a particular reason to expect that advances in open ended flexible autonomous learning will be considerably more jumpy than advances in other domains.

Right now, both supervised next token prediction and fully flexible autonomous learning are far less efficient than theoretical limits and worse than brains. But currently next token prediction is more efficient than fully flexible autonomous learning (as the main way to train your AI, next token prediction + some other stuff is often used).

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

In this hypothetical, I obviously would pick option B.

But suppose instead that we asked "How would you try to get current AIs (without technical advances) to most efficiently come up with new good experiements to try?"

Then, my guess is that most of the flops go toward next token prediction or a similar objective on a huge corpus of data.

You'd then do some RL(HF) and/or amplification to try and improve further, but this would be a small fraction of overall training.

As AIs get smarter, clever techniques to improve their capabilities futher via 'self improvement' will continue to work better and better, but I don't think this clearly will end up being where you spend most of the flops (it's certainly possible, but I don't see a particular reason to expect this - it could go either way).

I agree that 'RL on thoughts' might prove important, but we already have shitty versions today. Current SOTA is probably like 'process based feedback' + 'some outcomes' + 'amplification' + 'etc'. Noteably this is how humans do things: we reflect on which cognitive strategies and thoughts were good and then try to do more of that. 'thoughts' isn't really doing that much work here - this is just standard stuff. I expect continued progress on these techniques and that techiques will work better and better for smarter models. But I don't expect massive sharp left turn advancements for the reasons given above.

Here’s my takeaway:

There are mechanistic reasons for humanity’s “Sharp Left Turn” with respect to evolution. Humans were bottlenecked by knowledge transfer between new generations, and the cultural revolution allowed us to share our lifetime learnings with the next generation instead of waiting on the slow process of natural selection.

Current AI development is not bottlenecked in the same way and, therefore, is highly unlikely to get a sharp left turn for the same reason. Ultimately, evolution analogies can lead to bad unconscious assumptions with no rigorous mechanistic understanding. Instead of using evolution to argue for a Sharp Left Turn, we should instead look for arguments that are mechanistically specific to current AI development because we are much less likely to make confused mistakes that unconsciously rely on human evolution assumptions.

AI may still suffer from a fast takeoff (through AI driving capabilities research or iteratively refining it's training data), but for AI-specific reasons so we should be paying attention to that kind of fast takeoff might happen and how to deal with it.

Edited after Quintin's response.

In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
Deliberately create a (very obvious) inner optimizer, whose inner loss function includes no mention of human values / objectives.
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.

I’ll bite! I think this happens if we jump up a level from “an AI developer” to “the world”:

Lots of different people and companies deliberately create a (very obvious) inner optimizer (i.e. a fresh new ML training run), whose inner loss function includes no mention of human values / objectives (at least sometimes—e.g. self-supervised learning, or safety-unconcerned people might try capabilities-oriented RL reward functions to try to beat a benchmark or just to see what happens etc.).
An outer optimizer exists here—the people doing the best on benchmarks will get their approaches copied, get more funding, etc. But the outer optimizer has billions of times less optimization power than the inner optimizer.
At least some of these people and companies (especially the safety-unconcerned ones) let the inner optimizer run freely without any supervision, limits, or interventions (OK sure, probably somebody is watching the loss function go down during training, but presumably it’s not uncommon to wait until a training run is “complete” before doing a rigorous battery of safety tests).

Some possible cruxes here are: (1) do these safety-unconcerned people (or safety-concerned-in-principle-but-failing-to-take-necessary-actions people) exist & hold influence, and if so will that continue to be true when AI x-risk is on the table? (I say yes—e.g. Yann LeCun thinks AI x-risk is dumb.) (2) Is it plausible that one group’s training run will have importantly new and different capabilities from the best relevant previous one? (I say yes—consider grokking, or algorithmic improvements, or autonomous learning per my other comment.)

This whole just does not hold.

(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

This is clearly false. GPT4, can you explain? :

While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:

Epigenetics: Epigenetic modifications involve changes in gene expression that do not alter the underlying DNA sequence. These changes can be influenced by environmental factors and can sometimes be passed on to the next generation.
Parental behavior: Parental care, such as feeding, grooming, and teaching, can transmit information to offspring. For example, some bird species teach their young how to find food and avoid predators, while mammals may pass on social behaviors or migration patterns.
Cultural transmission: Social learning and imitation can allow for the transfer of learned behaviors and knowledge from one generation to the next. This is particularly common in species with complex social structures, such as primates, cetaceans, and some bird species.
Vertical transmission of symbionts: Some animals maintain symbiotic relationships with microorganisms that help them adapt to their environment. These microorganisms can be passed from parent to offspring, providing the next generation with information about the environment.
Prenatal environment: The conditions experienced by a pregnant female can influence the development of her offspring, providing them with information about the environment. For example, if a mother experiences stress or nutritional deficiencies during pregnancy, her offspring may be born with adaptations that help them cope with similar conditions.
Hormonal and chemical signaling: Hormones or chemical signals released by parents can influence offspring development and behavior. For example, maternal stress hormones can be transmitted to offspring during development, which may affect their behavior and ability to cope with stress in their environment.
Ecological inheritance: This refers to the transmission of environmental resources or modifications created by previous generations, which can shape the conditions experienced by future generations. Examples include beaver dams, bird nests, or termite mounds, which provide shelter and resources for offspring.

(/GPT)

Actually, transmitting some of the data gathered during the lifetime of the animal to next generation by some other means is so obviously useful that is it highly convergent. Given the fact it is highly convergent, the unprecedented thing which happened with humans can't be the thing proposed (evolution suddenly discovered how not to sacrifice all whats learned during the lifetime).

Evolution's sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.

If the above is not enough to see why this is false... This hypothesis would also predict civilizations built by every other species which transmits a lot of data e.g. by learning from parental behaviour - once evolution discovers the vast amounts of free energy on the table this positive feedback loop would just explode.

This isn't the case => the whole argument does not hold.

Also this argument not working does not imply evolution provides strong evidence for sharp left turn.

What's going on?

In fact in my view we do not actually understand what exactly happened with humans. Yes, it likely has something to do with culture, and brains, and there being more humans around. But what's the causality?

Some of the candidates for "what's the actually fundamental differentiating factor and not a correlate"

- One notable thing about humans is, it's likely the second time in history a new type of replicator with R>1 emerged: memes. From replicator-centric perspective on the history of the universe, this is the fundamental event, starting a different general evolutionary computation operating at much shorter timescale.

- Machiavellian intelligence hypothesis suggests that what happened was humans entered a basin of attraction where selection pressure on "modelling and manipulation of other humans" leads to explosion in brain sizes. The fundamental thing suggested here is you soon hit diminishing return for scaling up energy-hungry predictive processing engines modelling fixed-complexity environment - soon you would do better by e.g. growing bigger claws. Unless... you hit the Machiavellian basin, where sexual selection forces you to model other minds modelling your mind ... and this creates a race, in a an environment of unbounded complexity.

- Social brain hypothesis is similar, but the runaway complexity of the environment is just because of the large and social groups.

- Self-domestication hypothesis: this is particularly interesting and intriguing. The idea is humans self-induced something like domestication selection, selecting for pro-social behaviours and reduction in aggression. From an abstract perspective, I would say this allows emergence of super-agents composed of individual humans, more powerful than individual humans. (once such entities exist, they can create further selection pressure for pro-sociality)

or, a combination of the above, or something even weirder

The main reason why it's hard to draw insights from evolution of humans to AI isn't because there is nothing to learn, but because we don't know why what happened happened.

It seems to me that the key threshold has to do with the net impact of meme replication:

Transmitting a meme imposes some constraint on the behaviour of the transmitting system.
Transmitting a meme sometimes benefits the system (or descendants).

Where the constraint is very limiting, all but a small proportion of memes will be selected against. The [hunting technique of lions] meme is transferred between lions, because being constrained to hunt is not costly, while having offspring observe hunting technique is beneficial.
This is still memetic transfer - just a rather uninteresting version.

Humans get to transmit a much wider variety of memes more broadly because the behavioural constraint isn't so limiting (speaking, acting, writing...), so the upside needn't meet a high bar.

The mechanism that led to hitting this threshold in the first place isn't clear to me. The runaway behaviour after the threshold is hit seems unsurprising.

Still, I think [transmission became much cheaper] is more significant than [transmission became more beneficial].

Without this evolutionary analogy, why should we even elevate the very specific claim that 'AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.' to consideration at all, much less put significant weight on it?

I see this as a very natural hypothesis because alignment requires a hopeful answer to the question:

“Given that an agent has ontology / beliefs X and action space Y, will it do things that we want it to do?”

If the agent experiences a sudden burst of generality, then it winds up with a different X, and a different (and presumably much broader) Y. So IMO it’s obviously worth considering whether misalignment might start at such point.

(Oh wait, or are you saying that “AIs will experience a sudden burst of generality” is the implausible part? I was assuming from the boldface that you see “…at the same time…” as the implausible part.)

However, the process responsible for human breadth of generality was not some small architectural modification evolution made to the human brain. It was humanity's cross-generational process of expanding and improving our available "training data" to cover a broader and broader range of capabilities across many domains (a process we sometimes call "science"). The evolutionary analogy thus offers no reason to expect sudden jumps in generality without corresponding extensions of the training data.

I think this is true but wish it were hedged as "I think it was not", since we have not in fact observed that the small architectural modification didn't cause it, nor do I think we have a huge pile of evidence towards that. I think we have a reasonably strong amount of evidence to that conclusion.

I've started to notice a common pattern in evolutionary analogies, where they initially suggest concerning alignment implications, which then seem to dissolve once I track the mechanistic details of what actually happened in the evolutionary context, and how that would apply to AI development. At this point, my default reaction to any evolutionary analogy about AI alignment is skepticism.

I agree. Perhaps the alignment field would be better off if we'd never thought about evolution at all, and instead had modelled the learning dynamics directly. Don't think about AIXI, don't think about evolution, think about what your gradient update equations might imply, and then run experiments to test that.

The generation dies, and all of the accumulated products of within lifetime learning are lost.

This seems obviously false. For animals that do significant learning throughout their lifetimes, it's standard for exactly the same "providing higher quality 'training data' for the next generation..." mechanism to be used.

The distinction isn't in whether learned knowledge is passed on from generation to generation. It's in whether the n+1th generation tends to pass on more useful information than the nth generation. When we see this not happening, it's because the channel is already saturated in that context.

I guess that this is mostly a consequence of our increased ability to represent useful behaviour in ways other than performing the behaviour: I can tell you a story about fighting lions without having to fight lions. (that and generality of intelligence and living in groups).

(presumably unimportant for your overall argument - though I've not thought about this :))

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

Scaling runs into data limitations
no one can think of any autonomous learning techniques for years
finally someone finds an algorithms which works really well (prior to anyone finding an algorithm which only works ok)
this results in a huge effective compute overhang
people are able to effectively scaleup by 100x in short period and this is sufficient to achieve takeover capable AIs.

But this was just a standard fast takeoff argument. Here's a different version which doesn't refer to autonomous learning but is isomorphic:

People scale up inefficient algos (like transformers)
no one can think of any better techniques for years
finally someone finds an algorithms which works really well (prior to anyone finding an algorithm which only works somewhat better than the current techniques)
this results in a huge effective compute overhang
people are able to effectively scaleup by 100x in short period and this is sufficient to achieve takeover capable AIs.

Sure, people do all sorts of cool tricks with the context window, but people don’t know how to iteratively make the weights better and better without limit, in a way that’s analogous to to AlphaZero doing self-play or a human mathematicians doing math.

My guess is that achieving reasonably efficient things which have good scaling laws is 'just' a moderately large capabilities research project at OpenAI - nothing that special.

Perhaps you think there are autonomous learning style approaches which are considerably better than the efficiency on next token prediction?

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

Option B: You can spend your week however you want. Follow the threads that seem promising, sit and think for a while, go back and re-read passages that are confusing, etc.

(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)

It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."

And more broadly, you're making a claim like 'current learning efficiency is very low'.

I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.

But this hypothetical just isn't what I expect.

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

In this hypothetical, I obviously would pick option B.

But suppose instead that we asked "How would you try to get current AIs (without technical advances) to most efficiently come up with new good experiements to try?"

Then, my guess is that most of the flops go toward next token prediction or a similar objective on a huge corpus of data.

You'd then do some RL(HF) and/or amplification to try and improve further, but this would be a small fraction of overall training.

Here’s my takeaway:

Edited after Quintin's response.

In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
Deliberately create a (very obvious) inner optimizer, whose inner loss function includes no mention of human values / objectives.
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.

I’ll bite! I think this happens if we jump up a level from “an AI developer” to “the world”:

Lots of different people and companies deliberately create a (very obvious) inner optimizer (i.e. a fresh new ML training run), whose inner loss function includes no mention of human values / objectives (at least sometimes—e.g. self-supervised learning, or safety-unconcerned people might try capabilities-oriented RL reward functions to try to beat a benchmark or just to see what happens etc.).
An outer optimizer exists here—the people doing the best on benchmarks will get their approaches copied, get more funding, etc. But the outer optimizer has billions of times less optimization power than the inner optimizer.
At least some of these people and companies (especially the safety-unconcerned ones) let the inner optimizer run freely without any supervision, limits, or interventions (OK sure, probably somebody is watching the loss function go down during training, but presumably it’s not uncommon to wait until a training run is “complete” before doing a rigorous battery of safety tests).

This whole just does not hold.

(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

Epigenetics: Epigenetic modifications involve changes in gene expression that do not alter the underlying DNA sequence. These changes can be influenced by environmental factors and can sometimes be passed on to the next generation.
Parental behavior: Parental care, such as feeding, grooming, and teaching, can transmit information to offspring. For example, some bird species teach their young how to find food and avoid predators, while mammals may pass on social behaviors or migration patterns.
Cultural transmission: Social learning and imitation can allow for the transfer of learned behaviors and knowledge from one generation to the next. This is particularly common in species with complex social structures, such as primates, cetaceans, and some bird species.
Vertical transmission of symbionts: Some animals maintain symbiotic relationships with microorganisms that help them adapt to their environment. These microorganisms can be passed from parent to offspring, providing the next generation with information about the environment.
Prenatal environment: The conditions experienced by a pregnant female can influence the development of her offspring, providing them with information about the environment. For example, if a mother experiences stress or nutritional deficiencies during pregnancy, her offspring may be born with adaptations that help them cope with similar conditions.
Hormonal and chemical signaling: Hormones or chemical signals released by parents can influence offspring development and behavior. For example, maternal stress hormones can be transmitted to offspring during development, which may affect their behavior and ability to cope with stress in their environment.
Ecological inheritance: This refers to the transmission of environmental resources or modifications created by previous generations, which can shape the conditions experienced by future generations. Examples include beaver dams, bird nests, or termite mounds, which provide shelter and resources for offspring.

Evolution's sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.

It seems to me that the key threshold has to do with the net impact of meme replication:

Transmitting a meme imposes some constraint on the behaviour of the transmitting system.
Transmitting a meme sometimes benefits the system (or descendants).

Humans get to transmit a much wider variety of memes more broadly because the behavioural constraint isn't so limiting (speaking, acting, writing...), so the upside needn't meet a high bar.

The mechanism that led to hitting this threshold in the first place isn't clear to me. The runaway behaviour after the threshold is hit seems unsurprising.

Still, I think [transmission became much cheaper] is more significant than [transmission became more beneficial].

Without this evolutionary analogy, why should we even elevate the very specific claim that 'AIs will experience a sudden burst of generality at the same time as all our alignment techniques fail.' to consideration at all, much less put significant weight on it?

I see this as a very natural hypothesis because alignment requires a hopeful answer to the question:

“Given that an agent has ontology / beliefs X and action space Y, will it do things that we want it to do?”

However, the process responsible for human breadth of generality was not some small architectural modification evolution made to the human brain. It was humanity's cross-generational process of expanding and improving our available "training data" to cover a broader and broader range of capabilities across many domains (a process we sometimes call "science"). The evolutionary analogy thus offers no reason to expect sudden jumps in generality without corresponding extensions of the training data.

I've started to notice a common pattern in evolutionary analogies, where they initially suggest concerning alignment implications, which then seem to dissolve once I track the mechanistic details of what actually happened in the evolutionary context, and how that would apply to AI development. At this point, my default reaction to any evolutionary analogy about AI alignment is skepticism.

The generation dies, and all of the accumulated products of within lifetime learning are lost.

67

Evolution provides no evidence for the sharp left turn

67

Does human evolution imply a sharp left turn from AIs?

Evolution’s sharp left turn happened for evolution-specific reasons

Don't misgeneralize from evolution to AI

Fast takeoff is still possible

Will alignment generalize across sudden capabilities jumps?

Human "misalignment" with inclusive genetic fitness provides no evidence for AI misalignment

Capabilities jumps due to AI driving AI capabilities research

Capabilities jumps due to AI iteratively refining its training data

Conclusion