Generative AI Training With HBM3 Memory

Meeting the memory bandwidth and capacity demands of large language models.


One of the biggest, most talked about application drivers of hardware requirements today is the rise of Large Language Models (LLMs) and the generative AI which they make possible.  The most well-known example of generative AI right now is, of course, ChatGPT. ChatGPT’s large language model for GPT-3 utilizes 175 billion parameters. Fourth generation GPT-4 will reportedly boost the number of parameters to more than 1.5 trillion. LLM training takes an enormous amount of compute capacity coupled with high-bandwidth memory. This blog will look at High Bandwidth Memory (HBM) and how it is suited for training demanding LLM workloads.

HBM is based on a high-performance 3D-stacked SDRAM architecture. HBM3, the latest version of the standard introduced in 2022, offers the capability to achieve tremendous memory bandwidth. Four HBM3 stacks connected to a processor via an interface running at 6.4 Gb/s can deliver over 3.2 TB/s of bandwidth. And with 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. What’s more, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low.

All versions of HBM run at a “relatively low” data rate but achieve very high bandwidth using an extremely wide interface. More specifically, each HBM3 stack running at up to 6.4 Gb/s connects to its associated processor through an interface of 1,024 data “wires.” With command and address, the number of wires grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. Like with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.

The design trade-off with HBM is increased complexity and costs. The interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up by making traditional DDR-type memories (including GDDR). The net is that implementation and manufacturing costs are higher for HBM3 than for a traditional 2D memory architecture like GDDR6.

However, for AI training applications, the benefits of HBM3 make it the superior choice. The bandwidth performance is outstanding, and higher implementation and manufacturing costs can be traded off against savings of board space and power. In data center environments, where physical space is increasingly constrained, HBM3’s compact architecture offers tangible benefits. Its lower power translates to lower heat loads for an environment where cooling is often one of the top operating costs.

HBM3 offers system designers extremely high-bandwidth capabilities and optimal power efficiency. While implementation of HBM3 systems can be challenging due to greater design complexity and manufacturing costs, savings in board space and cooling can be compelling. For AI training, HBM3 is an ideal solution. It builds on a strong track record of success with HBM2 and HBM2E, which were implemented in AI accelerators such as NVIDIA’s Tesla A100 and generations of the Google Tensor Processing Unit (TPU). HBM3 is the memory used by NVIDIA’s new Hopper H100 AI accelerator.

For training, bandwidth and capacity are critical requirements. This is particularly so given that AI training sets have been growing by 10X each year. Training workloads now run over massively parallel architectures. Given the value created through training, there is a powerful time-to-market incentive to complete training runs as quickly as possible. Furthermore, training applications run in data centers increasingly constrained for power and space, so there’s a premium for solutions that offer power efficiency and smaller size. Given all these requirements, HBM3 is an ideal memory solution for AI training hardware. It provides excellent bandwidth and capacity capabilities.

The Rambus HBM3 memory interface subsystem delivers maximum performance and flexibility for AI training in a compact form factor and power-efficient envelope. The interface consists of a co-verified PHY and digital controller comprising a complete HBM3 memory subsystem. The Rambus HBM3 memory subsystem delivers a market-leading 8.4 Gb/s per data pin (well above the standard speed of 6.4 Gb/s). The interface features 16 independent channels, each containing 64 bits, for a total data width of 1024 bits. At maximum data rate, this provides a total interface bandwidth of 1075.2 GB/s, or in other words, over 1 Terabyte per second (TB/s) to an HBM3 memory device.


Leave a Reply

(Note: This name will be displayed publicly)