2.5D Architecture Answers AI Training’s Call for “All of the Above”

It will take improvements in every aspect of computer hardware and software to keep the pace of computing advancement.

popularity

The impact of AI/ML grows daily impacting every industry and touching the lives of everyone. In marketing, healthcare, retail, transportation, manufacturing and more, AI/ML is a catalyst for great change. This rapid advance is powerfully illustrated by the growth in AI/ML training capabilities which have since 2012 grown by a factor of 10X every year.

Today, AI/ML neural network training models can exceed 10 billion parameters, soon it will be over 100 billion. Enormous gains in computing power thanks to Moore’s law and Dennard scaling have made this possible. At some point, however, the trend line of processing power, doubling every two years, would be overtaken by one that doubles every three-and-a-half months. That point is now. To make matters worse, Moore’s law is slowing, and Dennard scaling has stopped, at a time when arguably we need them most.

With no slackening in demand, it will take improvements in every aspect of computer hardware and software to stay on pace. Among these, memory capacity and bandwidth will be critical areas of focus to enable the continued growth of AI. If we can’t continue to scale down (via Moore’s Law), then we’ll have to scale up. The industry has responded with 3D-packaging of DRAM in JEDEC’s High Bandwidth Memory (HBM) standard. By scaling in the Z-dimension, we can realize a significant increase in capacity.

In fact, the latest iteration of HBM, HBM2E, supports 12-high stacks of DRAM with memory capacities of up to 24 GB per stack. That greater capacity would be useless to AI/ML training, however, without rapid access. As such, the HBM2E interface provides bandwidth of up to 410 GB/s per stack. An implementation with four stacks of HBM2E memory can deliver nearly 100 GB of capacity at an aggregate bandwidth of 1.6 TB/s.

With AI/ML accelerators deployed in hyperscale data centers, heat and power constraints are critical. HBM2E provides very power efficient bandwidth by running a “wide and slow” interface. Slow, at least in relative terms, HBM2E operates at up to 3.2 Gbps per pin. Across a wide interface of 1,024 data pins, the 3.2 Gbps data rate yields a bandwidth of 410 GB/s.

To data add clock, power management and command/address and the number of “wires” in the HBM interface grows to about 1,700. This is far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. The use of the silicon interposer is what makes this a 2.5D architecture. As with an IC, finely spaced traces can be etched in the silicon interposer to achieve the number needed for the HBM interface.

With 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits. Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low.

High bandwidth, high capacity, compact and power efficient, HBM2E memory delivers what AI/ML training needs, but of course there’s a catch. The design trade-off with HBM is increased complexity and costs. The silicon interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories. The net is that implementation and manufacturing costs are higher for HBM2E than for a high-performance memory built using traditional manufacturing methods such as GDDR6 DRAM.

Yet, overcoming complexity through innovation is what our industry has done time and again to push computing performance to new heights. With AI/ML, the economic benefits of accelerating training runs are enormous. Not only for better utilization of training hardware, but because of the value created when trained models are deployed in inference engines across millions of AI-powered devices.

In addition, designers can greatly mitigate the challenges of higher complexity with their choice of IP supplier. Integrated solutions such as the HBM2E memory interface from Rambus ease implementation and provide a complete memory interface sub-system consisting of co-verified PHY and digital controller. Further, Rambus has extensive experience in interposer design with silicon-proven HBM/HBM2 implementations benefiting from Rambus’ mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities.

The progress of AI/ML has been breathtaking, and there’s no slowing down. Improvements to every aspect of computing hardware and software will be needed to keep on this scorching pace. For memory, AI/ML training demands bandwidth, capacity and power efficiency all in a compact footprint. HBM2E memory, using a 2.5D architecture answers AI/ML training’s call for “all of the above” performance.



Leave a Reply


(Note: This name will be displayed publicly)