Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.
Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.