Generative AI forces chipmakers to use compute resources more intelligently.
The growing popularity of generative AI, which uses natural language to help users make sense of unstructured data, is forcing sweeping changes in how compute resources are designed and deployed.
In a panel discussion on artificial intelligence at last week’s IEEE Electron Device Meeting, IBM’s Nicole Saulnier described it as a major breakthrough that should allow AI tools to assist human experts with a wide range of tasks. The big challenge is satisfying the computation requirements of these large language models (LLMs).
Stanford researcher Hugo Chen said in an IEDM presentation that GPT-4, one of the best-known LLMs, uses a staggering 1.76 trillion parameters and 120 network layers.[1] Halving the error rate can require 500X the computational resources, according to UCSB researcher T. Bhattacharya.[2] Such enormous calculations are challenging for even dedicated data centers, and are simply impossible for edge devices.
“Edge computing” includes a large category of devices, from electric vehicles to remote sensors. They are characterized by limited access to “cloud” resources and a need to minimize power consumption. Often, edge applications involve normally off devices waiting semi-quiescently for a voice command, a motion trigger, or some other sensor input.
The resource requirements of machine learning models are large due to the sheer size of the datasets involved, but their actual computations tend to be simple. Matrix-vector multiplication (MVM) accounts for as much as 90% of the computational load. In conventional von Neumann architectures, the movement of data between memory and computational elements is a major bottleneck. “Compute-in-memory” (CIM) architectures seek to operate on data directly in memory instead. A previous article looked at the macro scale view: how data centers, systems, and circuit modules can partition problems to use computing resources efficiently. This article and the next focus on the use of emerging non-volatile memory technologies to support CIM designs.
While CIM SRAM arrays are available now, emerging technologies like RRAM promise significant power and area savings. SRAMs not only require power to maintain their data, but have a relatively large silicon footprint, requiring six transistors for each cell. RRAMs are non-volatile and use single resistors, possibly supported by a single transistor per cell.
Biological brains and efficiency
Still, practical circuits require complicated tradeoffs between efficiency, versatility, and accuracy. Digital logic, however inefficient it might be for this sort of application, is extremely configurable and extremely accurate. At the other extreme, biological brains are extremely versatile and have very low power requirements, but cannot match the accuracy of digital floating point calculations. Depending on the specific hardware involved, CIM designs fall at various points between the two.
While biological systems are extremely efficient, a number of differences from computing machines are immediately obvious. As Charlotte Frenkel and colleagues observed in a comprehensive review, one of the most important is that biological systems don’t have clocks.[3] Synapses fire in response to patterns of spikes from adjacent neurons, but are largely quiescent otherwise. They don’t poll for updates at regular intervals, and synaptic connections are reinforced (or not) through repeated use. Biological brains accept input from individual sensory organs, but also respond to global signals — both positive and negative — in the form of dopamine, serotonin, and so on. Finally, biological systems learn and respond in real time. Machine learning techniques like back propagation that “freeze” the system while an update occurs are inherently alien to biological systems.
So what can engineers working in silicon learn from biological brains? The simplest artificial neuron designs rest on the leaky integrate and fire (LIF) model, in which signals accumulate until they reach a certain threshold, causing the device to fire. The accumulated signal leaks away over time, allowing the device to distinguish between correlated and uncorrelated signals. However, this model does not preserve sequences of events, and so is not a good solution for applications like sound recognition where signals evolve over time.
In biological brains, Hugo Chen explained, dendrites respond to sequences, not individual electrical spikes. Only a certain spike sequence will fire a given dendrite. To construct a sequence-sensitive electronic device, the Stanford University group placed 3 ferroelectric gates above a single channel. If the first gate — next to the source — fires first, it creates an inversion layer and allows current to flow. The second gate then extends the inversion layer, as does the third. If the second gate fires first, though, there are no minority current carriers available. It creates a deep depletion layer, blocking the channel even if gates 1 and 3 fire later.
Fig. 1: 3-gate FeFET “dendrite.” Source: IEDM
Other results presented at IEDM looked for devices that might be more compatible with spike-dependent signals. In LIF architectures based on RRAM devices, the device resistance typically encodes the synaptic weight. As T. Bhattacharya explained in an IEDM presentation, UCSB researchers used an additional signal, the device temperature, to encode an “eligibility trace” value. Temperature changes the RRAM resistance, making devices more or less responsive to programming pulses. Once the device heater is turned off, the device gradually returns to the resting state. This “e-Prop” learning approach allows the network to capture short term dynamics as well as long term plasticity. In speech recognition, for instance, the first token might cause some devices to “wait” for a relevant second token.
Quantifying uncertainty
At the IEDM panel on AI, Anantha Sethuraman of Applied Materials emphasized that overfitting is a real weakness of current machine learning methods. Most models have no innate ability to decide that a solution is “good enough,” but will continue trying to improve the fit between the data and their model until they reach some artificial stopping condition — an amount of time or a number of cycles, for instance. In LLMs, overfitting manifests as “hallucinations.” If asked to elaborate on a previous response the model might supply plausible nonsense, such as citations to non-existent publications. When confronted with outlier data, the results can be nonsensical or disastrous.
Elisa Vianello, CEA-Leti’s embedded AI program director, explained in an interview that Bayesian networks, in contrast, are able to quantify the uncertainty in their results, identifying data that lies outside their training. Unfortunately digital implementations of Bayesian networks are extremely inefficient. They work by using probabilistic weights, but generating probability distributions in digital logic greatly increases the circuit complexity. In RRAMs, randomness is a characteristic of the device. While many designs seek to use programming schemes and careful training to compensate for the non-ideal behavior of RRAMs, Bayesian networks try to exploit it. As Vianello explained, each RRAM offers a predictable and repeatable Gaussian distribution of resistance values in response to a programming pulse. With appropriate training, Bayesian networks can use these weights without the usual costs of randomness in digital logic.[4]
Fig. 2: Probability distribution of RRAM conductance values.
Next: Accuracy and Versatility
After efficiency, important challenges for CIM designs are versatility, including scalability, and accuracy. Can laboratory demonstrations scale to applications requiring billions or trillions of parameters? Are CIM architectures flexible enough to accommodate tomorrow’s algorithms as well as today’s? Can inherently stochastic devices like RRAMs deliver accuracy comparable to digital floating point systems? The next article in this series covers more recent results with a focus on these issues.
References
[1] Hugo J.-Y. Chen, et. al., “Multi-gate FeFET Discriminates Spatiotemporal Pulse Sequences for Dendrocentric Learning,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 5.1.
[2] T. Bhattacharya, et. al., “ReRAM-Based NeoHebbian Synapses for Faster Training-Time-to-Accuracy Neuromorphic Hardware,” IEEE Electron Device Meeting, San Francisco, December 2023, paper 5.2.
[3] C. Frenkel, D. Bol and G. Indiveri, “Bottom-Up and Top-Down Approaches for the Design of Neuromorphic Processing Systems: Tradeoffs and Synergies Between Natural and Artificial Intelligence,” in Proceedings of the IEEE, vol. 111, no. 6, pp. 623-652, June 2023, doi: 10.1109/JPROC.2023.3273520.
[4] D. Bonnet, et al., “Bringing uncertainty quantification to the extreme-edge with memristor-based Bayesian neural networks.” Nat Commun 14, 7530 (2023). https://doi.org/10.1038/s41467-023-43317-9
Related Reading
Increasing AI Energy Efficiency With Compute In Memory
How to process zettascale workloads and stay within a fixed power budget.
3D In-Memory Compute Making Progress
Researchers at VLSI Symposium looks to indium oxides for BEOL device integration.
Leave a Reply