Systems & Design
SPONSOR BLOG

AI’s Rapid Growth: The Crucial Role Of High Bandwidth Memory

Every fractional increase in HBM subsystem performance has a multiplier effect on overall AI hardware performance.

popularity

System efficiency is dictated by the performance of crucial components. For AI hardware systems, memory subsystem performance is the single most crucial component. In this blog post, we will provide an overview of the AI model landscape and the impact of HBM memory subsystems on effective system performance.

AI models have grown from a few billions of parameters from the early ’90s to today’s Trillion Parameter Models (TPM). Statistical Language Models (SLM) models from the ’90s had a few hundred million parameters and were primarily targeted at speech, text processing and predicting the next word. During these early days, the processing power and memory requirements of hardware systems were abundant compared to the size of the available AI models.

The staggering growth of the internet in the past few decades has resulted in internet-scale data sets. As larger and larger data sets of images were available, Neural Networks became the algorithms of choice for training. Subsequently, Large Language Models (LLM) with billions of parameters were created. The latest generation of AI models are multi-modal or Large Multi-modal Models (LMM). These models are trained with multiple types of data sets such as text, image, audio, video, and their inter-dependencies, resulting in Trillion Parameter Models (TPM), with 100 TPM models on the horizon.

On the demand side, AI applications are also multiplying. For example, specialized models for stock trading and medical imaging are in development. All this points to huge opportunities in model development, which leads to a huge demand for AI processing power. LLMs are growing at a rate of 410X/2 years, and the compute power required for training is growing at a rate of 750X for every 2 years. In regards to AI hardware systems, compute power measured in Floating Point Operations (FLOPs) is growing at a rate of 3X/2 years, and DRAM bandwidth measured in Giga bits per Second (Gbps) is growing at a rate of 2X/2 years. It is quite evident that the growth rate of LLMs is far outpacing Moore’s Law. It is also quite evident that the memory bandwidth growth rate is the limiting factor in the AI ecosystem, thus creating bottlenecks for AI hardware system performance. In fact, this limitation was originally forecasted in 1990 by Ousterhout. Later, in 1994, William Wulf and Sally McKee published a well-researched paper, “Hitting the memory wall: implications of the obvious.” Today, “memory wall” is a cliched term that highlights the criticality of memory bandwidth for AI hardware systems.

High Bandwidth Memory (HBM) with 1024 data bus is the best of the available options for AI hardware memory subsystems. JEDEC released the HBM standard in 2013. HBMs were originally targeted at Graphics Processing Units (GPU). As GPUs became popular for AI training applications, HBMs were the obvious choice for memory subsystems used for training state-of-the-art (SOTA) transformer models. The major advantages of HBMs over DDR or GDDR are the higher bandwidth, substantially lower power, and DRAM form factor.

Over the last decade HBM2 and HBM3 standards have been released with improvements in frequency of operation and DRAM stack height/capacity. The HBM standard released in 2013 specified 1 Gbps (Giga bit per second) bandwidth. HBM2 was 2.4 Gbps and HBM3 is at 6.4 Gbps. JEDEC standards specify only the minimum required bandwidth with no restrictions on higher bandwidths. Due to the explosive growth in size of the LLMs, higher performance is always demanded by AI hardware systems. Hence, HBM DRAM suppliers are always marching towards higher-performance products. To differentiate these higher-speed devices from the base speed grades specified by JEDEC, HBM3E terminology is used. i.e., HBM DRAM products compliant with HBM3 standards but performing at higher speeds are labelled as HBM3E products.

Memory subsystems for AI have two components: a) an HBM DRAM stack and b) an HBM IP on the SoC that provides the interface to the HBM DRAM stack.

It should be noted that the HBM IP on the SoC must perform at or above the rated speed of the HBM DRAM. For an SoC design start, the plan should always be to have the highest-performing HBM IP on the die for the following reasons:

  • Performance: Since memory bandwidth is the limiting factor for AI hardware system performance, every fractional increase in HBM subsystem performance has a multiplier effect on the overall AI hardware system performance. For example, an AI hardware system with the recently available 9.6 Gbps HBM3E memory subsystem will outperform the highest speed grade 8.0 Gbps HBM3E system that is currently in production today by many folds.
  • Futureproofing: Typical SoC design cycles are 12 to 18 months, and the SoC product life cycle could be anywhere from four to ten years, depending on the target market segment. So, the product planning should look ahead at least six years into the future. Memory system design should consider the highest speed HBM DRAM that will be available six years from the start of the SoC design and select HBM IP to match those speed grades.
  • Manufacturing: A higher-performing HBM IP can provide additional margins to accommodate manufacturing process variations. For example, if your plan is to design a HBM memory system at 9.6 Gbps, the HBM IP on the SoC performing at 12.8 Gbps (the anticipated speed of the next generation devices) will always provide more margins than an HBM IP rated at 9.6 Gbps.
  • Reliability: For planet-scale AI cloud operators, reliability failures of HBM memory systems are among the top two reasons for the reported failures of AI accelerator cards. Data center workloads degrade the performance of the HBM memory system performance over time. An HBM IP on the SoC designed and performing at 12.8 Gbps will provide far better reliability than an HBM IP performing at 9.6 Gbps.

Scaling the memory wall is a tough task. HBM standards and products evolved from 1Gbps HBM to 10.4Gbps HBM3E, the currently supported speed grade by Cadence. The growth in memory bandwidth for AI hardware systems is primarily a linear growth of the higher clock rates achieved over the last decade due to advances in foundry, manufacturing, and design processes/techniques. HBM4 standard is pre-announced in 2024, and the final version is anticipated to be released in 2025. HBM4 is expected to have significantly higher performance than the current HBM3E.

It is quite evident that for AI hardware systems targeted at “training,” High Bandwidth Memory plays a crucial role. The challenge for the SoC designer is to plan and provide the highest-performing memory subsystem for data rates needed now and in AI products to be released in the coming years.



Leave a Reply


(Note: This name will be displayed publicly)