Lifetime energy costs are already significant, but I don't think the problem will get that skew this decade. IRDS' predicted transistor scaling until ~2028 should prevent power density increasing by too much.
Longer-term this does become a greater concern. I can't say I have particularly wise predictions here. There are ways to get more energy efficiency by spending more on lower-clocked hardware, or by using a larger memory:compute ratio, and there are also hardware architectures with plausible significant power advantages. There are even potential ways for energy to fall in price, like with solar PV or fusion, though I haven't a good idea how far PV prices could fall, and for fusion it seems like a roll of the dice what the price will be.
It's entirely possible energy does just become the dominant cost and none of those previous points matter, but it's also an input we know we can scale up pretty much arbitrarily if we're willing to spend the money. It's also something that only starts to become a fundamental economic roadblock after a lot more scaling. For instance, the 100,000 wafer scale processor example requires a lot of power, but only about as much as largest PV installations that currently exist. You could then upgrade it to 2028 technology and stack memory on top of the wafers without changing power density by all that much.
This is likely a topic worth periodically revisiting as the issue gets closer.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.
Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.