Scaling AI/ML Training Performance With HBM2E Memory

Continued increases in memory capacity and bandwidth are needed to keep AI accelerators and processors from being bottlenecked.


In my April SemiEngineering Low Power-High Performance blog, I wrote: “Today, AI/ML neural network training models can exceed 10 billion parameters, soon it will be over 100 billion.” “Soon” didn’t take long to arrive. At the end of May, OpenAI unveiled a new 175-billion parameter GPT-3 language model. This represented a more that 100X jump over the size of GPT-2’s 1.5 billion parameters, a model OpenAI introduced a little over a year previously in February 2019. The trend line since 2012 reflects a 10X annual increase in the size of training models, suggesting we could witness training models in 2021 with over a trillion parameters.

Even in its prime, Moore’s Law couldn’t deliver improvements necessary to keep pace with a 10X annual increase in demand. Advancements in every aspect of computer hardware and software is needed to stay on this pace. Foremost among the areas of focus must be the continued increase in memory capacity and bandwidth to keep AI accelerators and processors from being bottlenecked.

GDDR6 and High Bandwidth Memory (HBM) have emerged as the key high-performance memory solutions for AI/ML. GDDR6 offers excellent bandwidth and a ruggedness achieved through over two decades of high-volume manufacturing. These characteristics make it an outstanding choice for high-reliability AI/ML inference applications such as advanced driver-assistance systems (ADAS).

But for applications, like AI/ML training with its insatiable need for bandwidth, the performance of HBM is without rival. The latest iteration of HBM, HBM2E, employs a 1024-bit (128-Byte wide) interface running at 3.2 gigabits per second (Gbps) to deliver 410 gigabytes per second (GB/s) per HBM2E DRAM “stack.”

The stack refers to the 3D structure of HBM memory devices. HBM combines “scaling down” via Moore’s Law with “scaling up” with 3D-packaging of DRAM. By scaling in the Z-dimension, HBM can deliver a significant increase in capacity. The latest generation HBM2E supports 12-high stacks of DRAM providing memory capacities of up to 24 GB per stack.

The “wide and slow” interface of HMB2E, 3.2 Gbps being “slow” at least relative to the 16+ Gbps speed of GDDR6, enables an architecture that delivers tremendous bandwidth and capacity in a very power-efficient manner. This is another great benefit for AI/ML training given its extensive deployment in hyperscale data centers. With the largest hyperscale data centers consuming on the order of 100 megawatts of power, managing heat and power is mission critical.

High bandwidth, high capacity, compact and power efficient, HBM2E memory delivers what AI/ML training demands. An architecture using six HBM2E device stacks can achieve nearly 2.5 terabytes per second (TB/s) of bandwidth and 150 GB of memory capacity. Apart from future generations of HBM, the HBM2E architecture offers additional scalability beyond 3.2 Gbps. In July, SK hynix announced it had reached mass production of a 3.6 Gbps HBM2E memory. With the wide interface, the 0.4 Gbps increase in data rate jumps bandwidth by 50 GB/s to 460 GB/s per stack.

The trade-off to achieving the outstanding bandwidth of HBM is the complexity of implementing the 2.5D design. To the 1024 data “wires” between an HBM DRAM stack and the interface on the accelerator, add clock, power management and command/address, and the number of traces in the HBM interface grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory stack(s) and accelerator. The use of the silicon interposer is what makes this a 2.5D architecture. As with an IC, finely spaced traces can be etched in the silicon interposer to achieve the number needed for the HBM interface.

Designers can greatly mitigate the challenges of higher complexity with their choice of IP supplier. Integrated solutions such as the HBM2E memory interface from Rambus ease implementation and provide a complete memory interface sub-system consisting of verified PHY and digital controller. Further, Rambus has extensive experience in interposer design with silicon-proven HBM2/HBM2E implementations benefiting from Rambus’ mixed-signal circuit design history, deep signal integrity/power integrity technology expertise, and system engineering capabilities. For every customer engagement, Rambus provides a reference design of the interposer and package for the HBM2E implementation as part of the IP license.

It’s still early days for the AI/ML revolution, and there’s no slowing in the demand for more computing performance. Improvements to every aspect of computing hardware and software will be needed to keep on this scorching pace. For memory, AI/ML training demands maximum bandwidth, capacity and power efficiency. HBM2E memory represents the state-of-the-art solution for advancing AI/ML training performance.

Additional resources:
Webinar: GDDR6 and HBM2E Memory Solutions for AI
White Paper: GDDR6 and HBM2E: Memory Solutions for AI
Web: Rambus GDDR6 PHY and Rambus GDDR6 Controller
Web: Rambus HBM2E PHY and Rambus HBM2E Controller
Product Briefs: GDDR6 PHY and GDDR6 Controller
Product Briefs: HBM2E PHY and HBM2E Controller
Solution Briefs: GDDR6 Interface Solution and HBM2E Interface Solution

Leave a Reply

(Note: This name will be displayed publicly)