Increasing AI Energy Efficiency With Compute In Memory

How to process zettascale workloads and stay within a fixed power budget.

popularity

Skyrocketing AI compute workloads and fixed power budgets are forcing chip and system architects to take a much harder look at compute in memory (CIM), which until recently was considered little more than a science project.

CIM solves two problems. First, it takes more energy to move data back and forth between memory and processor than to actually process it. And second, there is so much data being collected through sensors and other sources and parked in memory, that it’s faster to pre-process at least some of that data where it is being stored. Or looked at differently, the majority of data is worthless, but compute resources are valuable, so anything that can be done to reduce the volume of data is a good thing.

In a keynote address at the recent Hot Chips 2023 conference, Google Chief Scientist Jeff Dean observed that model sizes and the associated computing requirements are increasing by as much as a factor of 10 each year.[1] And while zettascale computing (at least 1021 operations per second) is within reach, it carries a high price tag.

Case in point: Lisa Su, chair and CEO of AMD, observed that if current trends continue, the first zettascale computer will require 0.5 gigawatts of power, or about half the output of a typical nuclear power plant for a single system.⁠[2] In a world increasingly concerned about energy demand and energy-related carbon emissions, the assumption that data centers can grow indefinitely is no longer valid. And even if it was possible, the physics of interconnect speed and thermal dissipation place hard limits on data bandwidth.

Simple multiplication and addition … times a billion parameters
Machine learning models have massive data transfer needs relative to their modest computing requirements. In neural networks, both the inference and training stages typically involve multiplying a large matrix (A) by some input vector (αx), and adding a bias term (βy) to the result:

Some models use millions or even billions of parameters. With such large matrices, reading and writing the data to be operated on may take much longer than the calculation itself. Chat GPT, the large language model, is an example. The memory-bound portion of the workload accounts for as much as 80% of total execution time.⁠[3] At last year’s IEEE Electron Device meeting, Dayane Reis, assistant professor at the University of South Florida, and her colleagues noted that table lookup operations for recommendation engines can account for 70% of execution time.⁠[4] For this reason, compute-in-memory (CIM) architectures can offer an attractive alternative.

Others agree. “A number of modern workloads suffer from low arithmetic intensity, which means they fetch data from main memory to perform one or two operations and then send it back,” said James Myers, STCO program manager at imec. “In-memory compute targets these workloads specifically, where lightweight compute closer to the memory has the potential to improve overall performance and energy efficiency.” By targeting data transfer requirements, engineers can drastically reduce both execution time and power consumption.

Designing efficient CIM architectures is non-trivial, though. In work presented at this year’s VLSI Symposium, researcher Yuhao Ju and colleagues at Northwestern University considered AI-related tasks for robotics applications.⁠[5] Here, general-purpose computing accounts for more than 75% of the total workload, including such tasks as trajectory tracking and camera localization. In recommendation engines, the large pool of possibilities identified through table lookups still needs to be filtered against the user query. Even when neural network calculations are clearly identifiable as a limiting factor, the exact algorithms involved differ. Neural network research is advancing faster than integrated circuit design cycles. A hardware accelerator designed for a particular algorithm might be obsolete by the time it’s actually realized in silicon.

One possible solution, seen in designs like Samsung’s LPDDR-PIM accelerator module, relies on a simple, but general-purpose calculation module, optimized for matrix multiplication or some other arithmetic operation. Software tools designed to manage memory-coupled computing assume the job of effectively partitioning the workload.

Fig. 1: One possible PIM architecture places a floating-point module between memory banks. Source: K. Derbyshire/Semiconductor Engineering

Fig. 1: One possible PIM architecture places a floating-point module between memory banks. Source: K. Derbyshire/Semiconductor Engineering

However, using software to assign tasks adds overhead. If the same data is sent to multiple accelerators, one for each stage in an algorithm, the advantage of CIM may be lost. Another approach, proposed by the Northwestern University group, integrates a CNN accelerator with conventional general-purpose logic. The CPU writes to a memory array, which the accelerator treats as one layer of a CNN. The results are written to an output array, which the CPU treats as an input cache. This approach reduced end-to-end latency by as much as 56%.

Fig. 2: A unified architecture can boost both neural network and vector calculations. Source: K. Derbyshire/Semiconductor Engineering

Fig. 2: A unified architecture can boost both neural network and vector calculations. Source: K. Derbyshire/Semiconductor Engineering

These solutions are feasible with current device technology. They depend on conventional CMOS logic and DRAM memory circuits, where the data path is fixed once the circuit is fabricated. In the future, fast non-volatile memories could enable reconfigurable logic arrays, potentially blurring the line between “software” and “hardware.”

How emerging memories help
Reis and colleagues designed a configurable memory array based on FeFETs to accelerate a recommendation system. Each array can operate in RAM mode to read and write lookup tables, perform Boolean logic and arithmetic operations in GPCiM (general purpose compute-in-memory) mode, or operate in content-addressable memory (CAM) mode to search the entire array in parallel. In simulations, this architecture achieved a factor of 17 reduction in end-to-end latency, and a 713X energy improvement for queries on the MovieLens dataset.

Part of the appeal of 3D integration is the potential to improve performance by increasing bandwidth and reducing the data path length. Yiwei Du and colleagues at Tsinghua University built an HfO2/TaOx ReRAM array on top of conventional CMOS logic, then added a third layer with InGaZnOx FeFET transistors. The CMOS layer served as control logic, while the FeFET layer provided a reconfigurable data path. In this design, a standard process element uses the CMOS layer and associated RRAM array to implement matrix-vector multiplication. A full network layer requires more than one process element, so the FeFET layer coordinates data transfer. Overall, the chip consumed 6.9X less energy than its two-dimensional counterpart. Networks with more complex connections between nodes achieved even more dramatic reductions.⁠[6]

For several years, researchers also have been investigating the use of ReRAM arrays themselves as arithmetic elements. Under Ohm’s law, applying a current is a multiplication step (V=IR), while Kirchoff’s Law sums across an array. Operating directly on the memory array is in theory one of the most efficient possible architectures. Unfortunately, resistive losses limit the practical array size. Rather than a single array, RRAM-based computations will need to break problems into “tiles,” then combine the results.⁠[7]

Memory vendors like Samsung and Hynix have been showing compute-in-memory concepts at conferences like Hot Chips for several years. As Dean pointed out, though, traditional data center metrics have devalued energy efficiency in favor of absolute performance. Such performance-first metrics are no longer sufficient in an increasingly power-constrained environment. If AI applications are to continue to grow at current rates, designers must prioritize new power-efficient architectures.

References

  1. J. Dean and A. Vahdat, “Exciting Directions for ML Models and the Implications for Computing Hardware,” 2023 IEEE Hot Chips 35 Symposium (HCS), Palo Alto, CA, USA, 2023, pp. 1-87, doi: 10.1109/HCS59251.2023.10254704.
  2. L. Su and S. Naffziger, “1.1 Innovation For the Next Decade of Compute Efficiency,” 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2023, pp. 8-12, doi: 10.1109/ISSCC42615.2023.10067810.
  3. J. H. Kim et al., “Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM Cluster,” 2023 IEEE Hot Chips 35 Symposium (HCS), Palo Alto, CA, USA, 2023, pp. 1-31, doi: 10.1109/HCS59251.2023.10254711.
  4. D. Reis, et al., “Ferroelectric FET Configurable Memory Arrays and Their Applications,” 2022 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 2022, pp. 21.5.1-21.5.4, doi: 10.1109/IEDM45625.2022.10019490.
  5. Y. Ju, et al., “A General-Purpose Compute-in-Memory Processor Combining CPU and Deep Learning with Elevated CPU Efficiency and Enhanced Data Locality,” 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 2023, pp. 1-2, doi: 10.23919/VLSITechnologyandCir57934.2023.10185311.
  6. Y. Du et al., “Monolithic 3D Integration of FeFET, Hybrid CMOS Logic and Analog RRAM Array for Energy-Efficient Reconfigurable Computing-In-Memory Architecture,” 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 2023, pp. 1-2, doi: 10.23919/VLSITechnologyandCir57934.2023.10185221.
  7. G. W. Burr et al., “Phase Change Memory-based Hardware Accelerators for Deep Neural Networks (invited),” 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Kyoto, Japan, 2023, pp. 1-2, doi: 10.23919/VLSITechnologyandCir57934.2023.10185411.

Further Reading
3D In-Memory Compute Making Progress
Researchers at VLSI Symposium looks to indium oxides for BEOL device integration.
Goals Of Going Green
Net zero goals target energy, emissions, water, and factory efficiencies.



Leave a Reply


(Note: This name will be displayed publicly)