Higher implementation and manufacturing costs are made up for by increased bandwidth, board space savings, and power efficiency.
Generative AI (Gen AI), built on the exponential growth of Large Language Models (LLMs) and their kin, is one of today’s biggest drivers of computing technology. Leading-edge LLMs now exceed a trillion parameters and offer multimodal capabilities so they can take a broad range of inputs, whether they’re in the form of text, speech, images, video, code, and more, and generate an equally broad set of outputs. LLM training takes an enormous amount of compute capacity coupled with high-bandwidth memory. This blog will discuss High Bandwidth Memory (HBM), including the next-generation HBM4, and how it is the leading solution for demanding LLM training workloads.
HBM is based on a high-performance 3D-stacked SDRAM architecture. HBM3, introduced in 2022, offered the capability to achieve tremendous memory bandwidth. Four HBM3 stacks connected to a processor via an interface running at 6.4 Gb/s deliver over 3.2 TB/s of bandwidth. And with 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small, power-efficient footprint. The extension of HBM3 (HBM3E), raised data rates to 9-plus Gb/s and throughput to over 1 TB/s per HBM3E device.
Through the HBM3E generation, each HBM device connected to its associated processor through an interface of 1,024 data “wires.” With command and address, the number of wires grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory device(s) and processor. Like with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.
JEDEC has announced that the next-generation HBM4 specification is nearing finalization. HBM4 will double the data lines to 2,048, with announced data rates up to 6.4 Gb/s. This will raise the bandwidth per HBM4 device to 1.6 TB/s, so that a GPU with 8 HBM4 devices will achieve an aggregate memory bandwidth of over 13 TB/s.
HBM comes with increased complexity and costs. The interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory is a higher complexity structure than DDR in a single die package. The net is that implementation and manufacturing costs are higher for HBM3 than for a traditional 2D memory architecture like GDDR.
However, for AI training applications, the benefits of HBM make it the superior choice. The bandwidth performance is outstanding, and higher implementation and manufacturing costs can be traded off against savings of board space and power.
HBM4 will offer system designers extremely high-bandwidth capabilities at optimal power efficiency. While implementation of HBM4 systems will present challenges due to greater design complexity and manufacturing costs, savings in board space and cooling can be compelling. For AI training, HBM4 is shaping up to be an ideal solution. It builds on a strong track record of success with HBM3 and HBM3E which were implemented in today’s state-of-the-art AI accelerators.
For training, bandwidth and capacity are critical requirements. This is particularly so given that some AI training datasets like Meta’s Llama have been growing by 10X each year. Training workloads now run over massively parallel architectures. Given the value created through training, there is a powerful “time-to-market” incentive to complete training runs as quickly as possible. Furthermore, training applications run in data centers which are increasingly constrained for power and space, so there’s a premium for solutions that offer power efficiency and smaller size resulting in a lower Total Cost of Ownership (TCO) for data centers. Given all these requirements, HBM4 like preceding generations of HBM will be an ideal memory solution for AI training hardware.
The newly announced Rambus HBM4 memory controller IP delivers maximum throughput performance and flexibility for AI training. The HBM4 memory controller can be paired with an internally designed or 3rd-party HBM4 PHY to deliver a complete HBM4 memory subsystem. The Rambus HBM4 memory controller offers the capability to support data rates up to 10 Gb/s for future scalability and design headroom in support of your next Gen AI SoC design.
Leave a Reply