As far as we can tell, bacteria were the first lifeforms on Earth. Which means they’ve had a full four billion years to make something of themselves. And yet, despite their long evolutionary history, they mostly still look like this:

Bacteria belong to one major class of cells—prokaryotes.^[1] The other major class of cells, eukaryotes, arrived about one billion years after bacteria. But despite their late start, they are vastly more complex.

Prokaryotes mostly only contain DNA, and DNA translation machinery. Eukaryotes, on the other hand, contain a huge variety of internal organelles that run all kinds of specialized processes—lysosomes digest, vesicles transport, cytoskeletons offer structural support, etc.

Not only that, but all multicellular life is eukaryotic.^[2] Every complex organism evolution has produced—eukaryotic. Trees, humans, worms, giant squid, dogs, insects—eukaryotic. Somehow, eukaryotes managed to blossom into all of these complex forms, while bacteria steadfastly remained single-celled, simple, and small. Why?

The short answer is that prokaryotes have vastly less DNA than eukaryotes—four to five orders of magnitude less, on average—and hence can’t do nearly as much stuff.^[3] The long answer is the rest of this post, which investigates two related questions: first, why are eukaryotic genomes so long? And second, how exactly does more DNA allow for more complexity?

Why Are Eukaryotic Genomes So Long?

Scalable Energy Production

Using DNA—replicating, transcribing, and translating it into proteins—isn’t free. Cells need energy (such as ATP) to power these reactions and, all else equal, longer genomes will require more of it.^[4]

Both prokaryotes and eukaryotes pay similar energetic costs to maintain genes. The difference is that eukaryotes have way more energy and hence can afford to have longer genomes. But why this disparity?

Prokaryotes generate ATP along their cell membrane. Which means that as they increase in size, their surface area—and hence their energy production—will scale sublinearly with their volume. So a prokaryote that doubles in size, for example, will only end up producing half as much ATP per unit volume. Because prokaryotes become less metabolically efficient as they get bigger, most are quite small—six orders of magnitude smaller than eukaryotes, on average.^[5]

There are some exceptions. For instance, individual bacteria in the species Thiomargarita can reach up to one centimeter in size, visible to the naked eye! But its cell structure suggests the exception proves the rule—80% of its volume is a vacuole,^[6] essentially empty space. So in effect, evolution expanded its surface area without concomitantly expanding its functional volume—a neat trick!

But how do eukaryotes avoid this surface area constraint? Well, eukaryotes generate energy using mitochondria, which are inside the cells. As a result, their number of mitochondria—and hence their energy production—scales with their volume. This allows them to afford both larger cell sizes than prokaryotes, and also longer genomes.^[7]

Tolerance for Junk

But bioenergetic constraints aren't the whole story. Even leaving aside the direct energy costs, prokaryotes face way more selection pressure toward having short genomes.

Empirically, bacteria are very quick to rid themselves of genes once they're no longer useful. For example, if you insert DNA into a bacteria that affords antibiotic resistance, it will keep those genes as long as antibiotics are around. But once you remove the antibiotics, it will jettison that DNA within a few hours.^[8]

Eukaryotic DNA, on the other hand, is much more weakly selected against. While bacteria are sensitive to additions of DNA fewer than ten base pairs in length, eukaryotes will keep additions of over ten thousand around indefinitely, even if they’re useless.^[9]

But why is selection so much weaker among eukaryotes? The main reason is that they have very small population sizes relative to bacteria, and the smaller the population size, the more the species’ genome will be determined by chance. This sentence requires a bit of unpacking.

What does it mean for a genome to be determined “by chance”? There are, generally speaking, two ways by which new genes can spread throughout the population. They can be actively selected for, or they can propagate purely by random events (also referred to as genetic drift).

The likelihood that a gene spreads through the population by chance alone is inversely related to the size of the population. After all, the gene faces the same probability of propagation at each reproduction event, so the more individuals, the more unlikely it is to reach all of them. Conversely, the smaller the population size, the more likely genes are to spread by chance.^[10]

Of course, selection will still promote genes with high fitness and cull those of low fitness. But in the vicinity of neutral fitness, the effects of chance begin to dominate a gene’s fate. And as population sizes decrease, this vicinity grows, and the fate of more genes will be determined by chance. Put differently, as population sizes decrease, selection becomes a weaker force.

Prokaryotes have vastly larger population sizes than eukaryotes,^[11] which means that their genomes are under stronger selection—any non-immediately useful gene is quickly discarded. Eukaryotic organisms, on the other hand, often have such small population sizes that large portions of their genomes evolve almost as if natural selection were entirely absent. This means that even mildly deleterious stretches of "junk" DNA will tend to stick around and accumulate.

But why, one might wonder, do eukaryotes have such small population sizes? The main reason seems to be that eukaryotic organisms are much bigger, and bigger animals tend to have smaller population sizes. In general, if an organism becomes twice as large, the overall population of such organisms will halve.^[12] I don’t know why this is true, although I suspect it has to do with food supply: as organisms get bigger, they each need more energy from their environment and there is only so much to go around, so the total number of individuals shrinks.

How Do Longer Genomes Enable More Complexity?

So, eukaryotic genomes will tend to keep superfluous DNA around. But how does that DNA actually get there in the first place? After all, if mutations are just as likely to be deletions as they are insertions, then the net effect on genome length should be zero.

Unfortunately for everyone, one of the main ways DNA lengthens is that genetic parasites called transposons^[13] copy and paste themselves throughout the genome. Indeed, at least 45% of the human genome is the result of these fuckers.^[14] The other major source of genome expansion is thankfully less depressing—just some accidental duplication, typically of a single gene, but occasionally of an entire chromosome!

As we saw above, prokaryotes don’t stand for this kind of fuckery—any bit of DNA which isn’t immediately useful is typically discarded. Thus, parasitic DNA and accidental duplications are very unlikely to accumulate.

Eukaryotes, on the other hand, will keep loads of this random, completely unhelpful DNA around. So while nearly all DNA in prokaryotes is protein-coding, in eukaryotes—and especially in highly complex organisms like humans—sometimes as little as 1% of the genome codes for any protein.

Eukaryotic Monopolies

At this point you might be wondering, wait, wasn’t more DNA supposed to entail more complexity? It seems like eukaryotes just got the shit end of the stick—a ton of parasitic junk DNA that is at best useless and at worst mildly deleterious, but just barely not deleterious enough for selection to notice. This is supposed to be… good?

Yes and no. While it’s true that a huge fraction of eukaryotic DNA is almost certainly just “junk,” the slack on their genome size also creates more opportunity for innovation. It is precisely because eukaryotic DNA is long and under little selection pressure that it has the chance to evolve useful secondary adaptive changes later on.

Peter Thiel argues that monopolies often actually spur innovation. While companies caught in cut-throat markets need immediate returns, monopolies have the financial freedom to pursue basic research that might pay off in the future. Eukaryotes, in this borrowed metaphor, are more monopolistic, innovating over longer time horizons, while bacteria are limited to lives of myopia, only creating products which can pay off quickly.

But how is it that these initially-useless mutations come to pay off later? And what kinds of innovations did eukaryotic genomes “invent”?

Duplication and Divergence

At a high level, the main difference between prokaryotic and eukaryotic DNA is that the latter has way more “software.” What is “software” in a genome? As a very crude analogy: genes which code directly for proteins can be thought of as hardware, and genes which regulate what those protein-coding genes do can be thought of as software.^[15]

One way that eukaryotic genomes can acquire this software is through a process called “duplication and divergence.” The basic idea is that sometimes a regulatory gene is accidentally duplicated. In bacterial genomes, these duplications would almost certainly be quickly deleted—it’s just extra clutter that costs precious energy to maintain for no additional benefit. But eukaryotic DNA is fine with all of these mildly deleterious additions! So these duplications can stick around, being redundant for a while.

But slowly, this might start to change. Let’s say, for example, that there is a protein-coding gene which codes for a stress hormone, and a regulatory gene which can turn that gene “on or off,” i.e., control when that hormone is created. Originally, this regulatory gene only activates the hormone in liver cells in the presence of toxins. But once the regulator has been duplicated, the second copy may begin to mutate away from its original function. Perhaps the mutation now causes it to activate the same hormone in a different cell type—in skin cells when they’ve been bruised, for example. In this way, duplicated regulators may occasionally “diverge,” i.e., take on new functional roles.^[16]

Duplication and divergence is one mechanism whereby software proliferates in eukaryotic genomes. But eukaryotes also have better software. In particular, eukaryotic genomes have more modularity and higher-level abstractions. Prokaryotes, on the other hand, cannot get past the duplication step, and so these routes to complexity are unavailable to them.

Modularity

To the extent that prokaryotic genomes do have regulatory elements, these often control “operons,” sets of genes which are all turned on or off at the same time. In other words, none of the genes in the set can act independently of each other.

Operons typically achieve a single function. For example, E. coli prefers glucose as its energy source, but when glucose is in short supply it will switch to using lactose. This is controlled by the lactose operon: when regulatory elements sense that glucose is absent and lactose is present,^[17] they will trigger the expression of a set of genes which jointly work to create the proteins necessary to digest lactose.^[18]

Eukaryotic genomes are almost entirely devoid of operons. Instead, their genes are modular, in the sense that any single gene is typically used in many different operations, rather than being part of one functional unit. This sort of modularity enables the underlying “hardware” to be used much more flexibly.

And this flexibility is a large part of what allows for multicellularity. Different cell types (e.g. neurons, liver cells) within the same organism are defined not by different genomes, which are the same in all cells, but by different patterns of gene expression. These patterns can be staggeringly complex, but at a very basic level they are just regulatory genes processing information (e.g., sensing that lactose is present) and controlling the timing and amount of protein-coding genes (or other regulatory genes!) in response.

Because eukaryotic genes are so modular, the space of possible genetic patterns is massive, and hence the space of possible phenotypic structures is massive, too. When you combine this with the fact that prokaryotic genomes contain far fewer protein-coding genes (and even fewer regulatory ones), the difference in the amount of phenotypic possibilities is truly staggering.

At its core, multicellularity results from different patterns of gene expression. Technically, this is possible with operons: just mix and match them in different ways! But this tool is far more crude. Instead of being able to adjust patterns at the gene level (and hence the individual protein level), patterns which utilize operons are restricted to operating at the circuit level (e.g., digest lactose), and hence the space of possible prokaryotic phenotypes is dramatically reduced.

How might modularity emerge from the duplicate and diverge mechanism? It’s the same idea as the liver and skin example explained above. A regulator controls a protein-coding gene in one context (the liver) and a duplicated regulator eventually mutates to the point that it controls it in a different context (the skin). Now the gene is being used independently in two different patterns: it’s modular!

Abstractions

In addition to modularity, eukaryotic genomes also operate at higher levels of abstraction. For instance, there are the so-called “master genes” which can induce macroscopic body structure. One example is the Pax6 regulatory gene. Forcing this single gene to be expressed where it normally wouldn’t be—e.g., in the legs or abdomens of flies—causes an entire eye to form there.^[19]

This means evolution does not have to work from scratch every time to create new body plans. Instead, small tweaks can be made to high-level variables (like master genes) to create novel macroscopic structure, such as adding additional legs, or moving the position of eyes. Indeed, the diversity of body plans among animals seems to stem primarily from using the same underlying regulatory genes in different patterns.^[20]

How might these higher-level abstractions arise in practice? I'm not sure, but here's a (speculative) sketch of my guess:

Say you have two genes, each of which makes a particular protein. The first gene creates a hormone that causes food-seeking behavior; the second an enzyme that breaks down glucose. And suppose these two functions happen to be synergistic, in the sense that whenever the cell is stressed it should both stop seeking food and stop using glucose.

At first these two genes were regulated separately. But suppose one of the regulators gets duplicated, and then eventually mutates, such that it gains the ability to control both genes at once. Now, many sub-functions are regulated by a single gene, enabling it to represent information at a higher-level of abstraction, e.g., about the overall stress level of the organism.

At this point the process can repeat, at increasingly higher levels of abstraction. Eventually long chains of these regulators regulating other regulators might form, enabling hierarchies of subroutines that master genes can control—affording single genes the ability to create entire macroscopic structures, like eyes.

Prokaryotes don’t have these high-level genes for the obvious reason—there is nothing high-level to control! But this answer passes the buck. There is nothing high-level for them to control because they don’t have the tools to build complex structures in the first place. Bacteria didn’t really get modularity or high-level abstractions—bacteria got hardly anything—because they can’t keep useless genes around long enough for secondary adaptive changes to emerge later on.

So, Why Are Bacteria So Simple?

We can now finally say why bacteria are so embarrassingly simple. There are two pressures keeping their genomes short—lack of energy and strong selection—and this shortness does indeed limit their ability to build complex structures. Without the slack to explore a wider range of regulatory possibilities, their software is stunted—indeed, almost non-existent. Couple this with the fact that they have far fewer coding genes to begin with, and we have our result: four billion years of potential with only some tiny boring blobs to show for it.

And while eukaryotic genomes might, at first glance, seem undesirable—bloated with junk which is at best redundant and at worst parasitic—with that bloat comes length, and length can do all kinds of wonders over evolutionary time. Not only do longer genomes give eukaryotes the chance to accumulate more protein-coding genes, but they also enable software upgrades like modularity and higher-level abstractions. And from these, complexity follows: as software becomes more hierarchical, with more flexibility over the underlying hardware, innovations such as multicellularity and the staggering diversity of body plans become possible.

It’s kind of insane how complex eukaryotic organisms got, considering that they started out as tiny little blobs stuck in the muck, just like bacteria. To be sure, they’ve been around a long time, but as we’ve seen time isn’t all that matters—bacteria had a one billion year head start and yet stuck in the muck they remain. Biological complexity, it seems, is about more than just time and chance: it is about exploration, the ability to try, but more importantly the ability to fail, so that when something useful does finally come around, it can be seized upon. So that eukaryotes can slowly claw their way out of the muck, can complexify…

… while their bacterial cousins continue to press “exploit” for eternity.

Thank you to Adam Scholl for invaluable writing feedback and for countless fun and thoughtful conversations (I promise I’ll stop talking about bacteria now :p), to Alexander Gietelink Oldenziel for suggesting really useful books and papers on the topic, for helping think through some of the trickier population genetics claims, and for feedback on earlier drafts, and thank you to Siddharth Hiregowdara for feedback on a previous draft.

^{^}
The technical distinction is that eukaryotes have a nucleus—a small compartment which holds the DNA. Prokaryotes don’t, their DNA floats freely around inside the cell.
^{^}
There actually are some cases of prokaryotic multicellularity. For instance, bacteria will sometimes aggregate together in what are called biofilms. Biofilms can exhibit cooperative behavior—e.g., cells on the inside of the film will send out signals of starvation to the outside, causing the exterior cells to halt activity and wait until the interior ones are fed. There are also cases of cell specialization among prokaryotes. For instance, in one bacterial species (Nostoc) there is a cell type which has specialized to metabolize nitrogen. But aside from the occasional multicellular blob, bacteria mostly remain single-celled and simple creatures. They never saw the explosion of intelligence, complexity, and diversity of body plans that eukaryotes did.
^{^}
There are many estimates of this, but e.g.: mitochondria “enabled a roughly 200,000-fold rise in genome size compared with bacteria” (Lane and Martin).
^{^}
These costs are not trivial. As a rough estimate: an average E. coli gene costs around one ten thousandth of the cell’s total energy budget to maintain. Given that there are around 4,000 genes in E. coli, the entire genome adds up to about a tenth of its total energy budget (Lynch and Marinov).
^{^}
This is only the disparity between prokaryotes and single-celled eukaryotes; a similar disparity exists between single-celled eukaryotes (such as yeast) and multi-celled ones. (See for example The Origins of Genome Architecture, page 83, under the section “The Three Genomic Perils of Evolving Large Body Sizes”).
^{^}
“Cells showed a large central vacuole which accounted for 73.2 ± 7.5 % of total volume” (Volland et al.).
^{^}
The full extent of this argument is made in Nick Lane’s book Power, Sex, and Suicide: Mitochondria and the Meaning of Life. See in particular, the chapter “The Foundations of Complexity.”
^{^}
See Nick Lane’s book Power, Sex, and Suicide, page 118, section “Balancing gene loss and gain in bacteria.”
^{^}
See the section on “Gene Structural Costs” in Lynch and Marinov.
^{^}
In particular, the chance that a gene spreads throughout the population by chance alone is equal to 1/N, where N is the effective population size (explained in next footnote). To see why, consider that each new allele introduced to the population has the same chance of going to fixation (ignoring selective forces). Since they all have the same chance, and since there are N copies of different alleles currently in the population, the chance that this new, unique allele spreads to everyone by chance alone is 1/N. Thanks to Alexander Gietelink Oldenziel for this argument.
^{^}
Prokaryotes typically have effective population sizes of 10^9. The effective population size takes into account factors like sex ratio, geographic distribution, etc., and so is often much smaller than the total population size. Eukaryotic organisms have effective population sizes ranging from 10^4 for invertebrates like us to 10^7 for single cells like yeast (Lynch and Marinov). To be clear, these variations in population size have real, substantial effects on the force of selection; refer to the paper for a great in depth analysis of this.
^{^}
See Michael Lynch’s book The Origins of Genome Architecture, page 84, under the section “Smaller population size.”
^{^}
These go by many other names, e.g.: jumping genes, selfish genes, and mobile genetic elements.
^{^}
See Sean Eddy.
^{^}
This gets quite complicated, but you are presumably in the footnote section for more complicated answers, so I will gander to tell you. For one, the hardware/software distinction is not totally apt. After all, the protein-coding genes are “codes” for the protein, not the protein itself. Ah well, for a very crude analogy it probably holds up.
Second, what does it mean for a regulatory gene to “control” what a protein-coding gene does? Well, every protein-gene has what is called a promoter sequence that sits directly above the gene. This sequence does not code for proteins, it is just a stretch of DNA which is particularly attractive to the molecules which start the “coding for protein” process. So, whenever this is exposed, the protein-coding gene is more likely to produce proteins. And conversely, whenever it is blocked, the gene can’t produce proteins. Now genes can be turned “on or off” by blocking or unblocking the promoter.
Regulatory genes do just this: they can, e.g., create proteins which serve the functional purpose of blocking the promoter. Or they can be non-coding, e.g., stretches of DNA like the promoter that bind well to molecules which affect genetic expression. This is somewhat unfortunate, since there is not a clean distinction between “protein-coding” and “regulatory,” as some regulatory genes do make proteins! Gah. In general, though, it’s my impression that regulatory elements tend to be of the non-coding variety.
^{^}
See Michael Lynch’s book The Origins of Genome Architecture, page 294, under the section “The passive emergence of modularity.”
^{^}
Here, what it means for a “regulatory element to sense,” is that a protein created by a regulatory gene can bind to lactose, and when it does, that protein changes shape such that it detaches from the promoter sequence (see footnote 14 for more details) and the operon can be expressed.
^{^}
See Essential Cell Biology (fifth edition), page 294, under “How Transcription is Regulated” for more details on this example and operons in general.
^{^}
See From DNA to Diversity, page 29, section on “Field-specific selector genes” for more information and pictures of this process.
^{^}
See From DNA to Diversity for many such examples, in particular the section “Sharing of the genetic toolkit among animals.”