Compute-In Memory Accelerators Up-End Network Design Tradeoffs

Compute paradigm shifting as more data needs to be processed more quickly.


An explosion in the amount of data, coupled with the negative impact on performance and power for moving that data, is rekindling interest around in-memory processing as an alternative to moving data back and forth between the memory and the processor.

Compute-in-memory (CIM) arrays based on either conventional memory elements like DRAM and NAND flash, as well as emerging non-volatile memories like ReRAM and phase-change memory, are being considered as possible accelerators for neural network computations.

While there has plenty of discussion about this over the years, implementations largely have been limited to device-oriented research projects, which typically use small, proof-of-concept networks. Designers are just beginning to incorporate CIM accelerator concepts into commercial-scale networks designed for complex problems. These larger networks highlight the differences between conventional CMOS accelerators and compute-in-memory approaches, and show how the tradeoffs among energy consumption, speed, and accuracy might change.

Both large and small neural networks are based on the same fundamental concepts. The data to be analyzed is broken into elements that can be distributed across an array of nodes — pixels for an image-recognition task, or parameters for a forecasting problem. The network consists of two or more layers of nodes, which can be connected to each other in a variety of different ways.

In a fully connected layer, every node in layer A connects to every node in layer B. In a convolutional layer, in contrast, a “filter” is defined that assigns a small portion of layer A to each node in layer B. In either case, each node in layer A multiplies its data element by a pre-determined weight. Then, each node in layer B receives the sum of these products across all the layer A nodes connected to it. (So for instance, node B1 = A1 + A2 + … An.) Nodes in layer B, in turn, multiply these values by their assigned weights and pass the results to corresponding nodes in layer C, and so on for as many layers as are in the network.

The final result might assign the input data to one of several categories, or describe how a system would respond to a given set of input parameters. In supervised learning, the “correct” final result is known, and errors are used to “back-propagate” adjustments to the individual weights. In inference tasks, the weights do not change. The network simply reports a result.

A large network may have five or more layers, each with potentially hundreds or thousands of nodes. Thus, while the individual multiply-accumulate (MAC) steps — “multiply nodes in layer A times data, sum, and send to layer B” — are quite simple, they are repeated an enormous number of times. In a conventional architecture, each MAC step involves reading the relevant data and weights from memory to the processor, performing the calculation, and writing the result back to memory. This “von Neumann bottleneck” dominates the overall performance and energy consumption of neural network tasks.

CIM accelerators seek to address this bottleneck. The details vary depending on the accelerator design, but the basic idea is that the weights for an entire layer are stored in the memory array. The data vectors are applied to all of the nodes at once, and the results are read from the output lines of the array. Many MAC steps take place simultaneously. The weights are stored semi-permanently, whether in conventional DRAMs or in non-volatile memory elements. Proponents of memristor-based arrays often cite their ability to achieve very high element density with a “crossbar” approach, placing memristor elements at the intersections of a mesh of word and bit lines.

Do CIM accelerators scale?
This approach is conceptually easy to understand. The memristor elements store analog weights as resistances. Data takes the form of voltages activating the input lines, and Kirchoff’s Current Law gives the results as currents. Scaling this approach to commercially interesting networks is more complex, though. Qiwen Wang and colleagues at the University of Michigan noted that in practice the size of the array may be limited by the need to handle relatively large currents. At December’s IEEE Electron Device Meeting, they proposed a solution using tiled ReRAM arrays to accumulate the total number of nodes desired. The size of the individual tiles depends on device characteristics and the amount of output current the design can accommodate. The system sums partial products from each tile to produce the final result.

Researchers at Imec, on the other hand, proposed an array based on a modified DRAM cell with two IGZO (indium-gallium zinc oxide) transistors acting as current sources to represent the weights. A series of pulses, defined by the input data, activate the current sources to charge a capacitor associated with each node. The result, current x time = charge, can be captured by a standard analog-to-digital converter. A third group at the University of Minnesota argued that the process maturity of 3D NAND flash devices makes it a leading candidate. They offered what they claim is the first demonstration of an embedded NAND flash-based neuromorphic chip produced by a standard logic process.

Whichever approach is used, a design using a CIM accelerator will balance energy consumption, critical area, and performance differently from a conventional CMOS design. While the MAC steps dominate overall performance in a conventional architecture, they are extremely efficient and fast in a CIM design. Instead, activation of the input lines and collection of the results are the limiting steps. Much of this computational work falls to peripheral analog/digital and digital/analog converters, which are much less area- and energy-efficient than the memory array.

The relative efficiency of MAC steps in CIM architectures influences other aspects of the network design, as well. For example, digital networks typically try to minimize the number of nodes in each network layer, in order to reduce the number of MAC steps. They then use multiple layers to achieve acceptable accuracy. In CIM accelerators, in contrast, the opposite approach appears to give better results. A larger layer allows more parallel operations and builds in more redundancy against the non-ideal behavior of devices like ReRAM. Shallower networks with fewer layers appear to accumulate fewer errors, giving more robust results. Similarly, increasing array utilization in CIM architectures reduces the number of activation steps and facilitates data reuse.

Though large and small networks, whether digital or analog, rely on the same fundamental algorithms, it’s clear that performance, latency, and energy consumption all depend on the underlying hardware. Rather than using CIM accelerators as drop-in replacements for conventional approaches, designers will need to evaluate the advantages and disadvantages of both through the lens of the specific problems they are trying to solve. Part two of this article looks at precision and noise-tolerance in light of the limitations of ReRAM devices.

Scaling Up Compute-In-Memory Accelerators
New research points to progress and problems in a post-von Neumann world.
In-Memory Computing Knowledge Center
Top stories, special reports, videos and more about In-Memory Computing
In-Memory Computing
Why this approach is so interesting today, and what it really entails.
Will In-Memory Processing Work?
Changes that sidestep von Neumann architecture could be key to low-power ML hardware.
Pushing Memory Harder
Can the processor/memory bottleneck be closed, or do applications need to be re-architected to avoid it?

Leave a Reply

(Note: This name will be displayed publicly)