UMI to OCP as an extension to the BoW standard.
While the improvements in processor performance to enable the incredible compute requirements of applications like Chat-GPT get all the headlines, a not-so-new phenomenon known as the memory wall risks negating those advancements. Indeed, it has been clearly demonstrated that as CPU/GPU performance increases, wait time for memory also increases, preventing full utilization of the processors.
With the number of parameters in the generative AI model ChatGPT-4 reportedly close to 1.4 trillion, artificial intelligence has powered head-on into the memory wall. Other high-performance applications are not far behind. The rate at which GPUs and AI accelerators can consume parameters now exceeds the rate at which hierarchical memory structures, even on multi-die assemblies, can supply them. The result is an increasing number of idle cycles while some of the world’s most expensive silicon waits for memory. ‘
Traditionally there have been three ways to pry open this bottleneck. The easiest—in the days when Moore’s Law was young—was to make faster DRAM chips with faster interfaces. Today that well is dry. The second approach has been to create a wider pathway between the memory array—which can produce thousands of bits per cycle in parallel—and the processor die. Arguably this has been taken near its practical limit with the 1 kbit-wide high-bandwidth memory (HBM) interface.
The third alternative is to use parallelism above the chip level. Instead of one stack of HBM dies, use four, or eight, each on its own memory bus. In this way the system architect can expand not just the amount of memory directly connected to a processing die, but also the bandwidth between memory and the compute die.
The trouble is, this approach is running into two hard limits, both involving real estate. At the system-in-package (SiP) level, there is no room left for more memory. We are already filling the largest available silicon interposers. Making room for more memory would mean leaving out some computing dies. At the die level there is a different issue. Computing dies—whether CPU, GPU, or accelerator—are prime real estate, usually built in the most advanced and expensive process technology available. Designers want all that die area for computing—not for interfaces. The are reluctant to give up any of it, or any of their power budget, for additional memory channels.
So, it is a dilemma. Architects need the added memory bandwidth and capacity that more memory channels can bring. But they are out of area on silicon interposers. And compute designers don’t want to surrender more die area for interfaces. Fortunately, there is a solution.
Read more here.
Leave a Reply