With 3D stacking, high bandwidth and capacity can be achieved in a small, power-efficient footprint.
Generative and Agentic AI are pushing an extremely rapid evolution of computing technology. With leading-edge LLMs now in excess of a trillion parameters, training takes an enormous amount of computing capacity, and state-of-the-art training clusters can employ more than 100,000 GPUs. High Bandwidth Memory (HBM) provides the vast memory bandwidth and capacity needed for these demanding AI training workloads.
HBM is based on a high-performance 3D-stacked SDRAM architecture. HBM3, introduced in 2022, offered the capability to achieve tremendous memory bandwidth. Four HBM3 stacks connected to a processor via an interface running at 6.4 Gb/s deliver over 3.2 TB/s of bandwidth. And with 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small, power-efficient footprint. The extension of HBM3 (HBM3E) raised data rates to 9-plus Gb/s and throughput to over 1 TB/s per HBM3E device.
Through the HBM3E generation, each HBM device connected to its associated processor through an interface of 1,024 data “wires.” With command and address, the number of wires grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory device(s) and processor. Like with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.
On April 16th, JEDEC announced the finalization of the specification for the latest generation of the HBM standard: HBM4. New architectural changes and advancements are included in HBM4 to deliver greater memory bandwidth and capacity while providing improved power efficiency and reliability.
Like all previous generations, HBM4’s interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory is a higher complexity structure than DDR in a single die package. The net is that implementation and manufacturing costs are higher for HBM4 than for a traditional 2D-memory architecture like GDDR7.
However, for AI training applications, the benefits of HBM4, in particular its unrivaled bandwidth performance, make it the superior choice. The higher implementation and manufacturing costs can at least be partly traded off against savings of board space and power. Further, no other memory architecture can practically deliver bandwidths of over 10 TB/s to a single GPU or accelerator. For AI training, HBM4 is slated to be another strong contender in the long track record of success enjoyed by all previous generations of HBM.
Rambus is the recognized leader in HBM memory controller IP, and our HBM4 memory controller, in addition to complying with JEDEC specifications, provides valuable features that enhance the reliability and serviceability of an HBM4 system. Some examples of such features are robust multi-bit error detecting code, diagnostic instrumentation that provide system architects with real-time debugging insights, and link error detection/correction schemes. The HBM4 memory controller can be paired with a customer-designed or 3rd-party HBM4 PHY to deliver a complete HBM4 memory subsystem. The Rambus HBM4 memory controller offers the capability to support data rates up to 10 Gb/s for future scalability and design headroom in support of your next AI SoC design.
Link:
Rambus HBM4 Memory Controller IP
Leave a Reply