Breaking Down The AI Memory Wall

Memory is no longer able to keep pace with raw compute capability, creating a bottleneck that grows larger each year.


Over the past few decades, the semiconductor industry has witnessed the rapid evolution of memory technology as new memories helped to usher in new usage models that characterized each decade. For example, synchronous memory helped drive the personal computer (PC) revolution in the 1990s, and this was quickly followed by specialized graphics memory (GPUs) for game consoles in the 2000s. When smartphones took off earlier in this decade, low-power memory for battery-operated mobile devices such as phones and tablets was introduced to provide both improved energy-efficiency and better performance for these products. Memory will continue to be a critical enabler as computing evolves, and we foresee that in the 2020s AI will be a key driver for ultra-high bandwidth and power-efficiency for the cloud, edge and endpoint applications.

The AI Memory Trinity: On-Chip, HBM & GDDR
There are currently three primary memory types that power AI systems and applications: on-chip memory, high-bandwidth memory (HBM) and graphics DDR SDRAM (GDDR SDRAM). On-chip memory offers the highest bandwidth and power efficiency, although capacity is limited. Typical AI processors using on-chip memory can achieve tens of terabytes per second of memory bandwidth, and a capacity of a few hundred megabytes.  Examples of AI silicon implementing on-chip memory include Microsoft’s BrainWave and GraphCore’s IPU. As well, the newly announced Cerebras Wafer Scale Engine uses new manufacturing techniques to break reticle limits and combine multiple reticle-sized processing engines into one larger wafer-scale engine. With this approach, Cerebras achieves 9 petabytes/sec of memory bandwidth, and 18GB of on-chip memory capacity that help feed 400,000 AI-optimized cores.

Solutions that require more memory capacity use external DRAM. As its name implies, HBM offers very high bandwidth (through the use of many parallel data wires that run at a relatively slow speed), high density and a high level of power efficiency by using of a new architecture for combining components. While this new architecture improves performance and power-efficiency, HBM designs incur added cost due to the need for advanced interposers (for the many wires needed to connect the processor to the DRAM) and substrates, as well as implementation complexity (due to stacking inside the DRAM, as well as stacking of the SoC and DRAMs) and new manufacturing methods. System thermals and reliability also need to be thought about differently, and require engineering expertise to manage effectively. Since there are physical limits to the number of HBM DRAMs that can connect to the processor due to the need for short connections, it is important for the industry to holistically focus on improving HBM DRAM performance moving forward.

Current examples of HBM-powered silicon include AMD’s Radeon RX Vega 56, NVIDIA’s Tesla V100, the Fujitsu A64FX processor with 4 HBM2 DRAMs (powering the compute engine for the Post-K supercomputer) and the NEC Vector Engine Processor with 6 HBM2 DRAMs (powering the compute engine for the NEC SX-Aurora supercomputer TSUBASA platforms).

In contrast, GDDR offers AI system designers high-bandwidth (high data rates) and well-understood manufacturing techniques that are similar to those used in conventional DDR memory systems. GDDR memory has been around for two decades, and is assembled using traditional chip-on-PCB manufacturing. Challenges associated with GDDR system design include managing signal integrity due to high I/O data rates, and higher power consumption.

The graphic below compares the benefits of the current generation HBM2 and GDDR6 DRAMs when they are used to construct a 256GB/s memory system. HBM2 has a power and area advantage compared to GDDR6, with GDDR6 consuming three and a half to four and a half times the power on the SoC PHY compared to HBM2, and one and a half to one and three quarters times more PHY area on the SoC compared to HBM2. But GDDR6 offers better system cost, as HBM2 requires stacking and additional components not needed in the GDDR6 memory system. Overall, GDDR offers a good tradeoff between bandwidth, power efficiency, cost, and reliability. Current examples of GDDR-powered silicon include NVIDIA’s GeForce RTX 2080Ti and AMD’s Radeon RX580.

Source: Rambus

Another Brick in the Memory Wall
From 2012 to 2019, AI training capability increased by a staggering 300,000X, doubling approximately every 3.5 months. This great leap in AI capabilities is 25,000X faster than Moore’s Law over the same time period.  However, despite impressive gains in AI development facilitated by on-chip memory, HBM and GDDR, the industry continues to demand even more performance. A key question for the semiconductor industry is how to continue providing these types of performance gains when two of the most important tools that we’ve relied on for decades, Moore’s Law and Dennard Scaling, are slowing or are no longer available? This has led the semiconductor industry to an AI Memory Wall, a stark reminder of the processor-memory gap that has been growing for years.

Indeed, memory latency and bandwidth are limiting system performance, with sustained (streaming) memory bandwidth continuing to fall behind peak FLOP rates. Put simply, memory is no longer able to keep pace with raw compute capability. In addition, network latency and bandwidth also continue to fall behind processor performance at an alarming rate. This imbalance has created a significant bottleneck that continues to grow larger each year. Although multiple techniques have been developed to mitigate this imbalance, the industry has been forced to turn to new system architectures and domain-specific silicon to enable modern AI systems in order for these applications to continue their evolution at a steady cadence. The formidable memory wall that has formed is further heighted by the end of Dennard Scaling, a slowing Moore’s Law and signal integrity challenges incurred by faster data rates.

Reducing data movement
The development of AI systems is challenged by the high energy costs associated with on-chip and off-chip data movement and memory access. To improve AI system efficiency, off-chip data movement should be eschewed whenever and wherever possible. As well, data must be reused to amortize memory access and data movement energy. New generations of AI architecture must continue to emphasize data locality to minimize data movement and stay within a reasonable power envelope. Domain-specific (dedicated) AI hardware has already managed to increase power efficiency by 100X-1000X compared to general purpose processors, and new architecture innovations are needed to further improve power-efficiency.

Forward looking proposals for the tighter integration of compute and data envision compute engines placed in chips closer to the DRAMs and storage, essentially integrating processing more closely to memory, and in some cases directly into DRAM and storage silicon. An example of this paradigm is illustrated in the N3XT NanoSystems research work shown in the image above which depicts the aspirational Monolithic Integrated 3D System with Computation Immersed in Memory. Another example is the recently-announced solution from UPmem, which integrates processor cores onto a DRAM.

The rapid evolution of AI applications is significantly changing how information is consumed, moved and processed. Increasing memory bandwidth while emphasizing power efficiency is critical to further enabling the analysis of broad behaviors and training of neural networks for the cloud, edge and endpoints. As we discussed in this article, AI system designers are leveraging on-chip memory, HBM and GDDR to address latency, bandwidth, power, cost/bit and reliability requirements in order to meet a diverse range of system needs. New architectures and domain-specific silicon will continue to be needed to help AI system designers more effectively address these requirements, and new innovations will be required to continue the historic growth rates of AI processing.

Leave a Reply

(Note: This name will be displayed publicly)