On the Road To Higher Memory Bandwidth

We’re at the cusp of another generational change in HBM.

popularity

In the decade since HBM was first announced, we’ve seen two-and-a-half generations of the standard come to market. HBM’s “wide and slow” architecture debuted first at a data rate of 1 gigabit per second (Gbps) running over a 1024-bit wide interface. The product of that data rate and that interface width provided a bandwidth of 128 gigabytes per second (GB/s). In 2016, HBM2 doubled the signaling rate to 2 Gbps and the bandwidth to 256 GB/s. Two years later, HBM2E came on the scene and ultimately achieved data rates of 3.6 Gbps and 460 GB/s.

The impetus for this ramping of performance is the insatiable bandwidth demands of advanced workloads. First and foremost among these is AI/ML training. Over that same ten years since the introduction of HBM, AI/ML training has been on a tear. The leading-edge training models have scaled from a few million parameters a decade ago, to over 170 billion in 2020. Model sizes are doubling every 3-4 months, growing at 10X annually. Serving that growth requires scaling of everything in the compute architecture across both hardware and software.

Higher memory bandwidth is, and will continue to be, a critical enabler of computing performance. As a result, we find ourselves at the cusp of another generational change in HBM. While the final specification is yet to be published, we anticipate HBM3 data rates will exceed 5.2 Gbps with resulting bandwidth of over 665 GB/s. And as bandwidth has scaled, so too has attached HBM memory capacity – in three ways.

An HBM system is a 2.5D/3D architecture. The “3D” part refers to the HBM memory devices. These are 3D stacks of DRAM chips. Over the HBM generations, the density of the DRAM chips has increased, as has the size of the stack, which can now be up to 16 DRAM chips in height. A third dimension of capacity expansion, and one that also increases aggregate bandwidth, is the scaling in the number of HBM memory channels in accelerators – with one or two in early implementations, to six in today’s state-of-the-art architectures, to 8 or more in the future.

An HBM memory channel requires 1024 traces to carry the data. In addition, there are command and address lines needed which brings the total to over 1700. An accelerator architecture with 8 HBM3 devices requires over 13,000 traces. That is an order of magnitudes more than can be implemented on a conventional PCB. So instead, a silicon interposer is used in which it is possible to finally etch the thousands of traces needed. The interposer is mounted in the system package and the accelerator and HBM memory devices are mounted on the interposer – that is the “2.5D” part of the architecture.

When it comes to memory performance, Rambus is a demonstrated leader. We introduced a 18 Gbps GDDR6 memory subsystem, and an HBM2E memory subsystem running at 4 Gbps. Those are both record performance levels that are unmatched to this day. Now in anticipation of the next generation of HBM, Rambus has announced our 8.4 Gbps HBM3-Ready Memory Subsystem capable of delivering over a terabyte per second of bandwidth (TB/s). At this data rate, an accelerator with 8 attached HBM3 memory devices could achieve 8.6 TB/s of memory bandwidth.

To efficiently manage all that data, the Rambus memory subsystem includes a modular and highly configurable memory controller. The controller is optimized to maximize throughput and minimize latency, and its memory parameters are run-time programmable. With a pedigree of over 50 HBM2 and HBM2E customer implementations, it has demonstrated efficiency over a wide variety of configurations and data traffic scenarios.

While even at its inception, HBM was wide and relatively slow at 1 Gbps, when operating at 8.4 Gbps, it’s really quite fast. Routing thousands of signal traces at that speed and maintaining signal integrity can be an enormous challenge. To help designers successfully harness the potential of HBM3 memory, Rambus provides both interposer and package reference designs for their 2.5D architecture implementations.

The road to higher memory bandwidth is a never ending one. The upcoming introduction of HBM3 opens a new phase in the journey, taking system performance to a new level.

Additional resources:



Leave a Reply


(Note: This name will be displayed publicly)