Scaling Up Compute-In-Memory Accelerators

New research points to progress and problems in a post-von Neumann world.

popularity

Researchers are zeroing in on new architectures to boost performance by limiting the movement of data in a device, but this is proving to be much harder than it appears.

The argument for memory-based computation is familiar by now. Many important computational workloads involve repetitive operations on large datasets. Moving data from memory to the processing unit and back — the so-called von Neumann bottleneck — is a limiting factor for both performance and power consumption.

Compute-in-memory architectures seek to remove this bottleneck by integrating memory and computation into a single circuit block. But exactly what form that block might take is not yet clear. The most immediate need — and perhaps the simplest compute-in-memory concept — is for an accelerator for the many multiply-and-accumulate matrix operations used in neural network calculations.

A wide variety of other concepts have been proposed, according to Abu Sebastian, a principal research staff member at IBM Research-Zurich. In a tutorial at December’s IEEE Electron Device Meeting, he said that on one hand, “near memory” computation using the peripheral circuits around a memory array can potentially offer significant benefits without the radical shift that computation directly involving memory elements represents. Because the actual computation uses conventional logic circuits, the near memory approach is more forgiving of endurance and other limitations of novel non-volatile memory devices. On the other hand, matrix multiplications are not the only computations that might take place in a memory array. Both charge-based and resistance-based memories potentially can be used for general-purpose logic, too.

Charge-based memories like DRAM, SRAM, and flash store information in the form of electronic charge captured in the structure of the device. These are three-terminal devices incorporating at least one transistor.

Resistance-based memories like PCRAM and ReRAM, in contrast, use signal pulses to switch between a high and low resistance state. In PCRAM, the switch is due to a change in crystalline structure, while ReRAM depends on the growth of a conductive filament. While including a transistor in the cell may improve performance or accuracy, it is not necessary. These are two-terminal devices and lend themselves to very high density “crossbar” architectures.

Crossbar architectures: easy to visualize, harder to build
Conceptually, implementing a matrix multiplication accelerator in a crossbar array is simple — store a weight matrix in the memory array. Non-volatile memory elements are especially attractive for inference applications, regardless of whether this step only needs to happen once. Then, use the input lines to apply a voltage representing the data vector, and read the result as a current via Kirchoff’s Current Law.

Unfortunately, several obstacles stand in the way of implementing such a design at a commercially interesting scale. First, there are the limitations of the devices themselves. Both the underlying physics of the devices and the immaturity of manufacturing processes mean that ReRAMs, PCRAMs, and other emerging non-volatile memories generally have highly variable SET and RESET characteristics. If the SET and RESET values are too close together, the programmable window may be too small. If the values are not consistent across the array, or drift over time, a given voltage pulse may write some devices but not others. These limitations are especially important in training applications, where weights may need to be adjusted frequently. Charge-based memories remain attractive for these applications.

For example, Shifan Gao and colleagues at Grace Semiconductor Manufacturing introduced a device they called a programmable linear random-access memory (PLRAM). They modified a standard floating gate flash memory cell by decoupling the select gate and from the floating gate with a tunneling oxide optimized for precise program/erase functionality. Before programming, they measured the conductance of each cell to identify the required control voltage. Thus, they were able to store up to seven analog values in a single cell.

Several different proposed designs try to compensate for the limitations of ReRAM devices. Researchers at Taiwan’s National Tsing Hua University and TSMC combined a conventional finFET with a pair of complementary ReRAM cells. In this “twin-bit” pair, one ReRAM device is always in the high resistance state and the other in the low resistance state. The value of the cell depends on the difference between the two. This approach reduces the impact of ReRAM variability by comparing the cells to each other rather than to some arbitrary reference value.

While resistance-based memory devices continue to improve, the data arrays used for commercial deep learning applications are extremely large, with weight matrices containing thousands of elements. Tien-Ju Yang and Vivienne Sze of MIT noted that operations within the array itself consume relatively little power, but activating and reading from a large crossbar array can require prohibitively high output currents. Partitioning the data into smaller pieces to keep currents reasonable adds computational overhead and reduces the advantage the accelerator provides.

Ferroelectrics offer a third way
As an alternative, several groups are investigating ferroelectric tunnel junctions. In an FTJ, the desired value is not stored in a capacitor or a resistive element, but in the polarization of the ferroelectric layer. Carriers can tunnel through the layer more or less easily depending on the polarization. A positive or negative voltage sweep switches the device, as opposed to the series of pulses required to “write” either PCM or ReRAM. With non-destructive read currents in the nanoamp range, FTJs may offer a low-current alternative for compute-in-memory arrays.

Researchers at Kioxia and Toshiba, for example, sandwiched a hafnium zirconium oxide (HZO) layer between electrodes, systematically investigating the relationship between the thickness, composition, and polarization behavior. They found that thinner ferroelectric layers reduced device variability, while decreasing the zirconium concentration increased the operating voltage margin. This work also demonstrated that compute-in-memory paradigms are useful for applications beyond matrix calculations. They used FTJ devices to implement a reinforcement learning algorithm: combinations of cells that give good results for the specified tasks — balancing a pole on a cart in this case — are “rewarded” with increased conductance.

As M. Si and colleagues at Purdue University and the Georgia Institute of Technology pointed out, though, FTJs are not ideal, either. Very thin tunneling layers can depolarize easily, while thicker ones may not carry enough current. Instead, they proposed ferroelectric semiconductor junctions, based on InSe, a recently discovered ferroelectric semiconductor. Switching in these devices is based on changes in the Schottky barrier height, and conductance scales with device area. While the material has been integrated into FeFETs, where it offers a high on-off ratio, this group also demonstrated two-terminal devices.

While two-terminal devices can maximize density, more complex structures can use transistors to help detect and amplify weak, polarization-dependent currents. Qing Luo and colleagues at the Chinese Academy of Sciences combined a single transistor with two HZO-based field programmable diodes to create a complementary cell. One of the diodes is connected with the negative terminal forward, the other with the positive terminal forward; a voltage sweep can reverse either or both.

Looking at the future: from accelerators to neurons
The applications discussed here are relatively simple — matrix multiplications, simple reinforcement learning algorithms. But as CEA-Leti research engineer Alexandre Valentian pointed out, biological brains exhibit much more complex behavior involving not simply storage of values in synapses, but interactions between neurons. In biological brains, and in many real-world applications, signals are dynamic. They describe not just an instant, but the evolution of a situation over time. To model this behavior, researchers are turning to exotic devices like Mott-FETs and graphene-ferroelectric transistors, using networks that can strengthen and weaken their own connections.

Related Stories
In-Memory Computing Knowledge Center
Top stories, special reports, videos and more about In-Memory Computing
In-Memory Computing
Why this approach is so interesting today, and what it really entails.
Will In-Memory Processing Work?
Changes that sidestep von Neumann architecture could be key to low-power ML hardware.
Pushing Memory Harder
Can the processor/memory bottleneck be closed, or do applications need to be re-architected to avoid it?



3 comments

Daniel Payne says:

In your article there’s one sentence that I take issue with, “Charge-based memories like DRAM, SRAM, and flash store a charge on a capacitor, generally under the control of at least one transistor. ” Yes, a DRAM does store charge on a capacitor and requires a refresh cycle, however an SRAM it uses cross-coupled inverters to store state and has no refresh cycle, while both DRAM and SRAM loose state when power turns off. Flash stores state in its second gate, and retains state even when the power is turned off. I started out my IC design career at Intel doing circuit level simulation of DRAM, and I also did SRAM design, but I never did a FLASH design (although Google makes me appear smarter).

Katherine Derbyshire says:

You’re right. Thank you for calling it to my attention. My point was to differentiate between charge-based memories like DRAM and SRAM, and resistance-based memories like ReRAM, but I overgeneralized a bit in doing so. I’ll revise the article.

Ernest Demaray says:

Katherine,
Active neural network matrix and vector machine learning will be optical. Have you done a story on Shaunhui Fan at Stanford. He has a fairly recent article where he reports FDFT simulation of optical photonic ANN computation via a provisioned MachZander array. Its pico seconds, much less than milliwats depending on the insertion loss of the waveguide fabric (not SOI- LOL) which, with efficient fiber to die coupling, might be like -20 dBm for the equivalent of 1 face / all Google faces not days and megawatts with Si ever. But photonic ANN does require provioning the MZ array from a silicon layer below to load matrix values. Ive always thought reliance on Si based memory was the foot of clay for photonic ANN. But now, more recently engineering of Rb atom based optical memory could make the whole computation streem problem memory/process/answer memory optical. So there is another story there. The Rb atom stores the bit the same way it provides the atomic clock with the a benefit, i.e. the ntangled hyperfine state of the ground state when it emitts the hearlding photon after being written and relaxed. Two such cavities provide teleportation secure telemetry, multiple such can provide annealing. Right now the Rb cell is formed by 4 fiber facets but is labile to monolythic integration in planar waveguides on Si or silica. Ive always been sceptical about doping single atoms in anything except MIT Draper now has a working machine as I write so there goes that problem. The other thing is the memory lifetime is like 30 microseconds. But that is plenty of time with no decoherence, the whole reason actually. Think of the band width of an optical device full of cavity coupled coherent MZs! Si has a long way to go. No substitute to drive resistice loads, store energy… not a hope in ANN or a chip/wafer wide actual real time clock.
Cheers
Ernest

Leave a Reply


(Note: This name will be displayed publicly)