Solving compute-in-memory’s limitations requires new approaches and dimensions.
Compute-in-memory (CIM) is gaining attention due to its efficiency in limiting the movement of massive volumes of data, but it’s not perfect.
CIM modules can help reduce the cost of computation for AI workloads, and they can learn from the highly efficient approaches taken by biological brains. When it comes to versatility, scalability, and accuracy, however, significant tradeoffs are required.
Most current machine learning models are designed around specific tasks. Image recognition models are different from large language models or game playing models. Moreover, artificial intelligence software is evolving rapidly, with many competing algorithms and frequent improvements.
The more flexible the underlying hardware is, the more it is able to accommodate different algorithms and tasks. However, as models get larger, scalability becomes a critical consideration, as evidenced by a recurring question asked at the recent IEEE Electron Device Meeting: “How does this scale?”
While both programmable logic and biological brains are flexible — though in very different ways — CIM modules tend to sacrifice versatility for efficiency. Conceptually, the simplest CIM architecture is an ReRAM crossbar array. Programmable resistors located at the intersections of a grid store a matrix of weights. The signal to be analyzed, a vector, is represented by the voltages applied to the array. Ohm’s Law yields a series of currents as the result, then Kirchoff’s Law sums the total current across the grid.
Scalable and configurable ReRAM designs
Within the accuracy limits of ReRAM devices this structure works, but it’s not very efficient and it doesn’t scale well. Resistive losses limit the array size, while the total accumulated current can be large. Peripheral circuits and analog-to-digital converters need to be sized accordingly. In addition, this type of design is not very versatile. It’s ideal for a fully connected convolutional neural network layer, but sparse computations will leave large sections sitting idle.
Successful scaling of crossbar arrays typically involves modular tiles, with additional logic to combine the results from individual modules. In a presentation at IEDM, Yibei Zhang of Tsinghua University described the integration of an HfO2 analog ReRAM crossbar array with a storage buffer based on carbon nanotube FETs.[1]
Conventional silicon logic sends signals to the ReRAM array, which stores the results in a digital 1T1R array combining CNTFETs with Ta2O5-based ReRAMs. In this design, the underlying silicon logic layer facilitates reconfiguration of the network for specific problems. The integrated memory buffer keeps the current requirements for each component tile moderate.
Fig. 1: 3-layer stack combining Si CMOS logic, ReRAM-based CIM, and CNT-ReRAM storage buffer. (a) is a schematic showing the three functional layers: (b) is cross-sectional TEM images of the device. Source: IEDM and the author
Jangsaeng Kim and colleagues at Samsung considered the problem of filter striding specifically [2]. Filter striding “walks” a computation window across a larger matrix, such as an image, a few nodes at a time. Cells outside the window are left unused. In grid arrays, unused cells represent wasted silicon area.
The Samsung group addressed the problem with an array composed of 3D AND flash cells. In these cells, two channels are stacked vertically, with word lines and bit lines on either side of a central pillar. To calculate the convolution of two input channels, this network uses both the upper and lower cells of the array. Relative to conventional crossbar arrays, this design reduces the total number of cells, the proportion of off cells, and the line resistance.
Fig. 2: Two-channel 3D AND flash cell. (a) schematic; (b) top; and (c) cross-section. Source: IEDM and the author
Though ReRAM is frequently used as a general term, it encompasses a variety of potential devices. Wenxuan Sun and colleagues at the Chinese Academy of Sciences used three different ReRAM types to achieve more flexible processing.[3] The first two layers of their structure incorporate dynamic memristors, based on hafnium oxide with, respectively, TiN and ruthenium word lines. These layers respond dynamically on different time scales, and are optimized for processing time-dependent data such as sound. The third layer, using tungsten word lines, exhibits analog switching behavior and facilitates image processing. According to the authors, the full structure can support multimodal video recognition with high accuracy and low energy consumption.
Accurate results from inaccurate devices
Still, the challenge looming over all of these efforts is accuracy. Floating-point digital logic establishes a benchmark against which all other approaches are judged. But even in many digital designs, less-precise weights can give acceptable performance at lower cost. When the number of categories is limited, low-precision weights work well. With more open-ended tasks, though, low-precision weights may limit the model’s ability to differentiate between similar inputs.[4] Put simply, the accuracy must be appropriate to the application, and floating-point precision is not an absolute requirement. That’s good news for emerging memory technologies. ReRAM, phase change memories, and FeRAM all have non-idealities, which makes them less precise than the familiar SRAMs and DRAMs.
ReRAM devices are inherently stochastic. Depending on the specific material used, the switching mechanism involves either the reduction and subsequent oxidation of a metal layer, or the movement of oxygen vacancies to and from a conductive filament. As Elisa Vianello, CEA-Leti Embedded AI program director, explained in an interview, the results of a single programming pulse lie within a probability distribution. The actual stored value can drift over time. The resistance read from the device is actually the sum of the device resistance and the resistances of the wires leading to it. In a large array, this second contribution varies, depending on position. The ADC module used to evaluate the result will have limited resolution. But the susceptibility of analog computations to noise and other errors is part of the reason why digital logic became the powerhouse that it is today.
Some designs, such as Bayesian networks, seek to exploit the inherent randomness of ReRAMs.[5] In more conventional networks, designers seeking to take advantage of the high density and low static power requirements of ReRAMs need ways to work around their limitations.
First, designers need to consider exactly how they want to store their weights. Soumil Jain and colleagues at UC San Diego analyzed several potential voltage-sensing architectures for ReRAM crossbar arrays.[6] Binary on/off values are much less susceptible to conductance shifts than multilevel analog values. Monitoring the difference between pairs of devices can be more accurate than the absolute value of a single device.
Second, Regina Dittman from Peter Grünberg Institut explained at IEDM that understanding the switching kinetics of ReRAMs can lead to designs and programming schemes that are less prone to conductance drift. For example, as the resistance through a device changes, so does the amount of current and therefore the temperature of the device. While researchers at UCSB exploited this thermal sensitivity as an additional degree of freedom [7], it makes establishing precise resistance values difficult. Large programming pulses can lead to thermal runaway and uncontrolled resistance changes. If analog values are needed, designers might choose devices with more stable switching chemistries, or might use a conventional resistor in series to stabilize the current flow. Dittman and colleagues compared the kinetics of HfO2 and SrTiO3-based ReRAMs, finding that the SrTiO3-based devices switched more gradually and were less prone to thermal runaway.[8]
Though much of the industry’s focus has been on ReRAM devices, phase change and ferroelectric memories are drawing attention, as well. They, too, present designers with a variety of non-ideal behaviors. Phase change memories (PCM), for example, manipulate the ratio of an amorphous, high-resistance state to a crystalline, low-resistance state. Typically, the crystalline material occupies most of the space in the physical device, with an amorphous bubble near one of the electrodes. Controlling the phase change process gives access to a range of resistance states, rather than a binary on/off value. Over time, though, the amorphous region relaxes into a lower-energy glass state, causing the resistance to drift.
Ghazi Syed and colleagues at IBM Research-Zurich proposed the insertion of a conductive carbon layer between the bottom electrode and the PCM material. [9] This layer effectively short-circuits the amorphous layer so that current passes only through the crystalline material. Similar designs using metal layers have been demonstrated before, but the carbon layer is more stable and less prone to react with the phase change material.
Fig. 3: Phase change memory element with conductive layer bypassing amorphous material. Source: IEDM and the author
While ReRAM devices have good read endurance, their write endurance is relatively low. They are good candidates for inference applications, where pre-existing weights are used to analyze a data signal. They are less useful for training applications, which require many updates as the model converges to a stable solution. Unfortunately, weights calculated by a digital network can give inaccurate results if simply transferred to an ReRAM array. Both the training algorithm and the programming scheme need to account for the behavior of the actual hardware. In fact, this is one reason why simulation results can be overly optimistic: the virtual array used for the simulation may not capture the behavior of the real devices.[10]
Meanwhile, FeRAM devices have very good write endurance, well beyond what training applications require. Unfortunately, reading data from them is destructive — the data needs to be rewritten after each read step, making them unusable for inference applications. Researchers at CEA-Leti sought to use the two types of devices to complement each other, with the strengths of one offsetting the weaknesses of the other. Michele Martemucci explained that they fabricated a TiN/Ti/Si:HfO2/TiN stack in the BEOL of a 130nm CMOS process, positioned between the fourth and fifth metal layers. Depending on subsequent processing, individual elements could become either ferroelectric capacitors (FeRAMs) or oxygen vacancy-based ReRAMs. The end result, a hybrid FeRAM/ReRAM synaptic circuit, used the high write endurance FeRAM elements for training, then transferred the resulting weights to the ReRAM elements.[11]
Conclusion
Taken as a group, the results presented at IEDM make it clear that the industry sees 3D integration of CIM arrays as a potential answer to many CIM challenges. Integrating these arrays with conventional CMOS logic maximizes their data bandwidth for computational efficiency, while at the same time giving them the versatility that programmable logic can provide.
Meanwhile a variety of novel architectures hope to maximize the accuracy that these emerging devices can achieve. CIM architectures are certainly not the only reason for the industry’s interest in 3D integration, but they may be one of the problems that only 3D integration can solve.
References
[1] Yibei Zhang, et al., “3D Stackable CNTFET/RRAM 1T1R Array with CNT CMOS Peripheral Circuits as BEOL Buffer Macro for Monolithic 3D Integration with Analog RRAM-based Computing-In-Memory,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 23.2.
[2] Jangsaeng Kim, et al., “First Demonstration of Innovative 3D AND-Type Fully-Parallel Convolution Block with Ultra-High Area-and Energy-Efficiency,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 23.4.
[3] Wenxuan Sun, et al., “High Area Efficiency (6 TOPS/mm2) Multimodal Neuromorphic Computing System Implemented by 3D Multifunctional RRAM Array,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 5.5.
[4] C. Frenkel, D. Bol and G. Indiveri, “Bottom-Up and Top-Down Approaches for the Design of Neuromorphic Processing Systems: Tradeoffs and Synergies Between Natural and Artificial Intelligence,” in Proceedings of the IEEE, vol. 111, no. 6, pp. 623-652, June 2023, doi: 10.1109/JPROC.2023.3273520.
[5] D. Bonnet, et al., “Bringing uncertainty quantification to the extreme-edge with memristor-based Bayesian neural networks.” Nat Commun 14, 7530 (2023). https://doi.org/10.1038/s41467-023-43317-9
[6] S. Jain et al., “A Versatile and Efficient Neuromorphic Platform for Compute-in-Memory with Selector-less Memristive Crossbars,” 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 2023, pp. 1-4, doi: 10.1109/ISCAS46773.2023.10181867.
[7] T. Bhattacharya, et al., “ReRAM-Based NeoHebbian Synapses for Faster Training-Time-to-Accuracy Neuromorphic Hardware,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 5.2.
[8] Regina Dittmann, et al., “Engineering the kinetics of redox-based memristive devices for neuromorphic computing,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 5.3
[9] Ghazi Syed, et al., “In-Memory Compute Chips with Carbon-based Projected Phase-Change Memory Devices,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 23.1
[10] E. P. -B. Quesada et al., “Experimental Assessment of Multilevel RRAM-Based Vector-Matrix Multiplication Operations for In-Memory Computing,” in IEEE Transactions on Electron Devices, vol. 70, no. 4, pp. 2009-2014, April 2023, doi: 10.1109/TED.2023.3244509.
[11] Michele Martemucci et al., “Hybrid FeRAM/RRAM Synaptic Circuit Enabling On-Chip Inference and Learning at the Edge,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 23.3.
Related Reading
3D In-Memory Compute Making Progress
Researchers at VLSI Symposium looks to indium oxides for BEOL device integration.
Increasing AI Energy Efficiency With Compute In Memory
How to process zettascale workloads and stay within a fixed power budget.
Modeling Compute In Memory With Biological Efficiency
Generative AI forces chipmakers to use compute resources more intelligently.
Leave a Reply