Memory Wall Problem Grows With LLMs

Hardware advances have addressed some of the problems. The next step may hinge on the algorithms.

popularity

The growing imbalance between the amount of data that needs to be processed to train large language models (LLMs) and the inability to move that data back and forth fast enough between memories and processors has set off a massive global search for a better and more energy- and cost-efficient solution.

Much of this is evident in the numbers. The GPU market is forecast to reach $190 billion in 2029, which is twice the size of the general-purpose CPU market.[1] GPUs, in conjunction with parallel compute platforms such as Nvidia’s CUDA, are highly scalable matrix multiplication engines. This matches well with AI training, because the performance of LLMs is directly correlated with the number of parameters they contain and the amount of computing resources they need. Commercial LLMs typically involve billions or even trillions of parameters.

But where that processing is best performed, what kind of processing elements are used, and how much data needs to be moved isn’t clear today. GPUs are not the most energy-efficient processors, and computing everything in an array of GPUs requires more data movement than memories and interconnects can efficiently handle. As a result, governments, corporations, and private investors are collectively pouring hundreds of billions of dollars into research on architectures, new algorithms, and new computational approaches.

Trillions of parameters, GPU-centuries of training
Training an LLM is a compute-bound task that involves many matrix multiplications. Typically, training data (B) is multiplied by a weight matrix (A) and incremented by the current network layer (C’) to produce the next layer (C).

In a short course at December’s IEEE Electron Devices Meeting (IEDM), Emanuele Confalonieri, senior member of the technical staff at Micron, observed that training a large model can require hundreds of GPU-years and more than a dozen terabytes of memory. As a result, it places extreme demands on device reliability. Because the whole model is trained in parallel, even a single GPU failure might require a restart of the whole job. A single uncorrectable DRAM error can impact thousands of GPUs.[2] (The recently announced DeepSeek model claims to significantly reduce training time, but the claim has not been independently verified.)

Actually using an LLM — the “inference” phase — requires a two-step process for each user query. The user request consists of a specific prompt, plus any context the user provides. This can include, for example, “Summarize this article,” plus the text of the article. The system first converts this input to a series of tokens. Then it multiplies the weight matrix (A) by the token vector (x) and adds the current layer’s response vector (y’) to produce a new response vector (y).

A feed-forward network repeats this step many different times to develop the actual response to the user query. According to Haerang Choi and colleagues at SK hynix, in a presentation at IEDM, matrix-vector multiplication accounts for 90% of the response phase workload.[3] Because it requires less than one operation per byte, Confalonieri explained, matrix-vector multiplication is a memory-bound task.

Unfortunately, such large computational requirements lead to equally large energy and cooling requirements. Each ChatGPT query consumes more than five times as much energy as a simple web search. As a result, utilities are facing conflicts between the needs of their data center customers and the needs of their other commercial and residential ratepayers.

In conventional system architectures, much of the energy consumed by a memory-bound process is due to the overhead of moving data into the processor’s working memory and then writing the result. Reducing the reliance of LLMs on large data matrices is a task for researchers developing machine learning algorithms, not integrated circuit manufacturers. Still, Nvidia’s Brucek Khailany said that better silicon can reduce the power consumed by memory write/read operations, it can increase the bandwidth between memory and compute elements, and it can support alternative architectures that place computation closer to memory elements.⁠[3]

Task-specific memory architectures
Better silicon also can incorporate alternative memory technologies. While emerging memory technologies have yet to establish themselves for general use, they offer unique tradeoffs between speed and data persistence. The frequency of read/write operations and the data lifetime depend on the task. In inference workloads, the same weight matrix might be used over and over again.

Hang-Ting Lue, director of the Emerging Memory Technology R&D Division at Macronix, and his colleagues noted that technologies like NOR flash — which are relatively slow to write, but fast to read — are a good match for this requirement. Moreover, non-volatile storage allows the data to be re-used many times without additional write steps. They estimate their design can reduce chip I/O by as much as 1,000X, significantly changing the argument for high bandwidth memory.⁠[4]

In contrast, tasks like streaming I/O and data analysis draw in large amounts of data, compute a result, and move to the next data block. Fast read and write speeds are essential in these applications, but data persistence is relatively unimportant.

Emerging machine learning tools like retrieval-augmented generation (RAG) add further complexity, Lue said. RAG algorithms seek to improve accuracy by using a near-real-time topic-specific search to provide further context to the LLM. The need to retrieve then tokenize this additional data introduces another memory bottleneck.

Widely differing requirements explain why many different architectures fall under the compute-in-memory umbrella. Onur Mutlu and colleagues at ETH Zurich differentiated between computation using memory and computation near memory. A near-memory architecture might have computational elements near non-volatile memory banks containing a weight matrix, with shared access to a block of dynamic memory containing the prompt vector. Choi and colleagues demonstrated this approach with their “Accelerator in Memory” processor (see Figure 1). This approach has the advantage of flexibility: the specific mapping of the model weights to the memory banks is under programmatic control. [5]

Fig. 1: Accelerator in memory module. Processing units adjacent to each DRAM bank have access to a shared buffer. Source: IEEE and Ref. 5.

Widely differing requirements explain the many different architectures that fall under the compute-in-memory umbrella. Onur Mutlu, computer science professor at ETH Zurich, and his colleagues differentiated between computation using memory and computation near memory. A near-memory architecture might have computational elements near non-volatile memory banks containing a weight matrix, with shared access to a block of dynamic memory containing the prompt vector. This approach has the advantage of flexibility because the specific mapping of the model weights to the memory banks under programmatic control. [⁠6]

Computation using memory, in contrast, uses the characteristics of the memory array to operate directly on individual bits. For instance, an array of analog RRAM or PCM elements might use Ohm’s Law to multiply by an input vector and Kirchoff’s Law to sum the result. This approach dramatically reduces data movement, but is relatively inflexible. As an alternative, the ETH group demonstrated a multiple-instruction, multiple-data (MIMD) model that maps different instructions to different DRAM sub-arrays. Rather than having to manipulate the entire DRAM bank at once, their MIMDRAM system works with the same granularity commonly targeted by modern compilers.

Inference on the edge
While standards for high-bandwidth memory in high-performance computing are starting to emerge, similar standards for low-power applications have not. Portable and wearable devices might use machine learning models for voice recognition, translation, navigation, and more. Performing inference tasks locally on the device is preferable for both performance and privacy reasons. Using low-precision weights can reduce memory requirements, but edge devices still face a challenging balance between performance, bandwidth, and power consumption.

But even with all these advances, the fundamental challenge remains. The memory wall doesn’t go away. Compute-in-memory arrays still need to source their data from somewhere, and the amount of memory on a single chip always will be finite. Intelligent mapping between the model and the underlying hardware can make the most of the available silicon. In the end, though, machine learning models are outpacing improvements in silicon. The next breakthroughs likely will require better algorithms, not better silicon.

References

  1. Yole Group report on Status of the Processor Industry.
  2. Emanuele Confalonieri, “Memory Needs and Solutions for AI and HPC,” 2024 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, Short Course on AI Systems and the Next Leap Forward.
  3. Brucek Khailany, “AI Accelerator Hardware Trends and Research Directions,” 2024 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, Short Course on AI Systems and the Next Leap Forward.
  4. Hang-Ting Lue, et al., “Prospects of Computing In or Near Flash Memories,” 2024 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, Paper 3.3
  5. Haerang Choi, et al., “AiMX: Accelerator-in Memory Based Accelerator for Cost-effective Large Language Model Inference,” 2024 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, Paper 3.1
  6. Onur Mutlu, et al., “Memory-Centric Computing: Recent Advances in Processing-in-DRAM,” 2024 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, Paper 3.4

Related Reading
Normalization Keeps AI Numbers In Check
It’s mostly for data scientists, but not always.
Is In-Memory Compute Still Alive?
It hasn’t achieved commercial success, but there is still plenty of development happening; analog IMC is getting a second chance.



Leave a Reply


(Note: This name will be displayed publicly)