How AI Impacts Memory Systems

The ways different architectures get around the memory bottleneck.


Throughout the 1980’s and early 1990’s computer systems were bottlenecked by relatively slow CPU performance, thereby limiting what applications could do. Driven by Moore’s Law, transistor counts increased significantly over the years, improving system performance and enabling exciting new computing possibilities.

Although computing capabilities have advanced significantly in recent years, bottlenecks have shifted to other parts of the computing system. Put simply, while Moore’s Law has addressed processing needs and enabled new computing paradigms, there are now a new set of challenges for the industry to address.

Evolving devices and computing models
The period between 1990-2000 was characterized by centralized computing that revolved around desktops and workstations. Improved connectivity and process improvements in the period 2000-2010 enabled a shift towards mobile computing, smartphones and the cloud, while 2010 and beyond has seen a proliferation of connected IoT devices and sensors, along with a shift towards fog/edge computing. The latter moves processing closer to the data, effectively improving latency, bandwidth and energy use.

Today there are a number of notable applications driving computing evolution, such as machine learning/neural networks, advanced driver assistance systems (ADAS)/autonomous vehicles, high performance computing (HPC) and blockchain/cryptocurrency mining.

Specific examples of AI driving silicon and system architecture development include Intel’s Nervana Neural Network Processor, the Microsoft BrainWave platform leveraging FPGAs, and Google’s Tensor Processing Unit (TPU). Additional examples include the Wave Computing Dataflow Processing Unit (DPU), Graphcore’s IPU, Cambricon’s Deep Learning AI Processor, AMD’s Radeon Vega 10 and nVidia’s Tesla V100.

Importance of memory bandwidth: Roofline Model
How are AI applications performing as hardware evolves? A well-known analysis tool, the Roofline Model, can be used to show how well applications are able to make use of the full potential of the underlying hardware’s memory bandwidth and processing power.

Rooflines vary for different system architectures. In the image above, the Y-axis represents the performance in terms of operations per second, while the X-axis represents the operational intensity, or the number of operations that are performed on each byte. Two architectural limits are illustrated by the green lines. The first is the sloping line, which shows the limits imposed by memory bandwidth. The second is the horizontal line, which shows limits imposed by the computational performance of the hardware. Together, these lines form a roofline shape, hence the name of the model.

Applications running on architectures with insufficient memory bandwidth or that perform few operations per byte of data will typically fall at or below the sloping portion of the roofline. Applications with sufficient memory bandwidth or that have high operational intensity will typically fall at or below the horizontal part of the roofline. In this example, applications that have operational intensity = 10 are memory-bound, while applications with operational intensity = 10000 are compute-bound.

This chart is taken from Google’s paper on the first generation TPU and compares the performance of the TPU on inference tasks for various types of neural networks compared to older, more general-purpose hardware (Haswell, K80). While the applications typically perform well on these architectures, newer purpose-built silicon like the Google TPU are often limited by memory bandwidth, with several applications falling at or near the sloped part of the roofline. Newer chips and platforms are embracing high bandwidth memory system solutions to address the bandwidth needs of AI silicon and systems.

Common memory systems for AI applications
There are a number of memory options that are suitable for AI applications, including on-chip memory (highest bandwidth and power efficiency), HBM (very high bandwidth and density) and GDDR (a good tradeoff between bandwidth, power efficiency, cost and reliability).

Firstly, let’s take a closer look at on-chip memory, which is implemented in Microsoft’s BrainWave and Graphcore’s IPU. Benefits include extremely high bandwidth and efficiency, along with low latency and high utilization without job batching. On the flip side, on-chip memory is limited by lower storage capacity as compared to DRAM, although data can be re-computed to save space. In addition, scalability is enabled primarily via multiple connected cards and chips.

Meanwhile, HBM can be found in Intel’s Nervana, nVidia’s Tesla V100 and Google’s TPU v2. Benefits include very high bandwidth – at 256GB/s per HBM2 DRAM – and high power-efficiency, facilitated by short interconnects and a wide and slow interface that pushes 1024b@2Gbps. However, HBM does present a number of engineering challenges such as a high number of IOs, cost and design complexity, an additional interposer component and more difficult system integration.

Lastly, GDDR offers high bandwidth paired with high capacity, along with easier integration and system engineering compared to HBM. However, maintaining good signal integrity is more difficult than for other external memories due to the high data rates of the IOs. In short, GDDR provides a balanced tradeoff between bandwidth, capacity, power-efficiency, cost, reliability and design complexity.

To summarize, artificial intelligence applications are driving the development of new silicon and system architectures. Memory bandwidth is a critical resource for AI applications, with multiple memory options available to suit different AI application needs. These include on-chip memory, HBM and GDDR. The range of tradeoffs they provide will enable a variety of AI chips and systems to be built in the future.

Leave a Reply

(Note: This name will be displayed publicly)