HBM2E Memory: A Perfect Fit For AI/ML Training

The ability of HBM to achieve tremendous memory bandwidth in a small footprint outweighs the added cost and complexity for training hardware.


Artificial Intelligence/Machine Learning (AI/ML) growth proceeds at a lightning pace. In the past eight years, AI training capabilities have jumped by a factor of 300,000 (10X annually), driving rapid improvements in every aspect of computing hardware and software. Memory bandwidth is one such critical area of focus enabling the continued growth of AI.

Introduced in 2013, High Bandwidth Memory (HBM) is a high-performance 3D-stacked SDRAM architecture. Like its predecessor, the second generation HBM2 specifies up to 8 memory die per stack, while doubling pin transfer rates to 2 Gbps. HBM2 achieves 256 GB/s of memory bandwidth per package (DRAM stack), with the HBM2 specification supporting up to 8 GB of capacity per package.

In late 2018, JEDEC announced the HBM2E specification to support increased bandwidth and capacity. With transfer rates rising to 3.2 Gbps per pin, HBM2E can achieve 410 GB/s of memory bandwidth per stack. In addition, HBM2E supports 12‑high stacks with memory capacities of up to 24 GB per stack.

All versions of HBM run at a relatively low data rate compared to a high-speed memory such as GDDR6. High bandwidth is achieved through the use of an extremely wide interface. Specifically, each HBM2E stack running at 3.2 Gbps connects to its associated processor through an interface of 1,024 data “wires.” With command and address, the number of wires grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. As with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.

HBM2E offers the capability to achieve tremendous memory bandwidth. Four HBM2E stacks connected to a processor will deliver over 1.6 TB/s of bandwidth. And with 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low.

The design tradeoff with HBM is increased complexity and costs. The interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories (including GDDR). The net is that implementation and manufacturing costs are higher for HBM2E than for memory using traditional manufacturing methods as in GDDR6 or DDR4.

However, for AI training applications, the benefits of HBM2E make it the superior choice. The performance is outstanding, and higher implementation and manufacturing costs can be traded off against savings of board space and power. In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits. Its lower power translates to lower heat loads for an environment where cooling is often one of the top operating costs.

For training, bandwidth and capacity are critical requirements. This is particularly so given that training capabilities are on a pace to double in size every 3.43 months (the 10X annual increase discussed earlier). Training workloads now run over multiple servers to provide the needed processing power flipping virtualization on its head. Given the value created through training, there is a powerful “time-to-market” incentive to complete training runs as quickly as possible. Furthermore, training applications run in data centers increasingly constrained for power and space, so there’s a premium for solutions that offer power efficiency and smaller size.

Given all these requirements, HBM2E is an ideal memory solution for AI training hardware. It provides excellent bandwidth and capacity capabilities: 410 GB/s of memory bandwidth with 24 GB of capacity for a single 12‑high HBM2E stack. Its 3D structure provides these features in a very compact form factor and at a lower power thanks to a low interface speed and proximity between memory and processor.

Designers can both realize the benefits of HBM2E memory and mitigate the implementation challenges through their choice of IP supplier. Rambus offers a complete HBM2E memory interface sub-system consisting of a co-verified PHY and controller. An integrated interface solution greatly reduces implementation complexity. Further, Rambus’ extensive mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities help ensure first-time-right design execution.

The growth of AI/ML training capabilities requires sustained and across the board improvements in hardware and software to stay on the current pace. As part of this mix, memory is a critical enabler. HBM2E memory is an ideal solution offering bandwidth and capacity at low power in a compact footprint hitting all of AI/ML training’s key performance requirements. With a partner like Rambus, designers can harness the capabilities of HBM2E memory to supercharge their next generation of AI accelerators.

Leave a Reply

(Note: This name will be displayed publicly)