Why commercialization will require improvements in devices and architectures.
To integrate devices into functioning systems, it’s necessary to consider what those systems are actually supposed to do.
Regardless of the application, machine learning tasks involve a training phase and an inference phase. In the training phase, the system is presented with a large dataset and learns how to “correctly” analyze it. In supervised learning, the dataset already is labeled by human experts, and the task is to produce the “right” answer when presented with new, but related examples.
An image search algorithm might learn to place images in one of several predefined categories, for example. In unsupervised learning, the training data is not labeled, and it is up to the algorithm to determine what categories apply. In the inference phase, the system actually does whatever task it has been trained to do. It classifies unknown images, it extracts information from a video stream, it navigates an autonomous vehicle through a neighborhood.
The most significant advances in machine learning in recent years have come from hardware improvements. Richard Windsor, an analyst for the Edison Group, pointed out that the leaders in commercial machine learning applications are simply the companies with access to the largest training datasets and the biggest data centers. For example, Jeff Wesler, director of IBM Research at the Almaden Research Center in San Jose, noted that the best performers on the ImageNet classification benchmark use an eight-layer neural network with 60 million parameters. One pass through the algorithm for one image requires 20 billion operations.
Arizona State University researcher Pai-Yu Chen (paper 6.1 at the 2017 IEDM conference) estimated that more sophisticated applications like feature extraction from high-definition video will require a three-to-five order of magnitude speedup to be commercially viable. It is not clear how much longer alone will continue to provide the improvements that the industry needs.
The specific details depend on the algorithm and the network design, but training a neural network involves determining the appropriate weight for each of the network’s parameters. To differentiate between a cat and a turtle, what are the key features? The hard edge of the shell, the curve of the tail, or the sharpness of the ears? How important is each of these to the overall image?
Over many iterations and many training examples, the network adjusts parameter weights to achieve an acceptable level of accuracy. In the current state-of-the-art, the need for labeled training data is a significant bottleneck for machine learning. The MNIST handwritten digit database, commonly used as a starting point for machine learning experiments, contains 60,000 training images. The ImageNet database seeks to obtain 1,000 labeled images for each of 100,000 concepts.
From learning to inference
The inference phase then simply uses those weights to analyze new data. Inference tasks can use lower-precision weights and much-less-powerful hardware. For example, Chen said, an image classification algorithm might use six-bit weights for training, but only one or two for inference.
For many tasks, designers plan to perform the weight calculations on a supercomputer, then write them to an autonomous or semi-autonomous device. An device might be pre-configured with appropriate weights and parameters at the factory, for example.
Performing inference tasks at the local device level offers substantial power savings. Transferring an image to a data center takes more energy than the actual classification task. In a presentation at Imec’s 2017 SemiconWest Technology Forum, Praveen Raghavan estimated that transmitting a single bit to a datacenter consumes 10-6 joules, while a single-bit local operation requires only 10-15 joules. Local tasks also can be faster and do not require ubiquitous connectivity. Speed is especially important for applications that require real-time response. Wesler estimated that an autonomous vehicle needs 200 msec response time or better.
If local inference tasks use weights calculated elsewhere, they place less onerous requirements on potential memristor devices. The SET and RESET operations for memristor devices can have different physical mechanisms and respond to electrical pulses differently. One is not necessarily the inverse of the other. But this behavior is very challenging for learning applications, where dynamic resistance changes are used to calculate and store new weights. It is less of a concern for inference applications, where weights are set once and then applied repeatedly.
Put another way, learning is a calculation task, while inference is a more conventional memory task. As such, inference tasks may be the first to incorporate memristors. Their area and power advantages over SRAMs are significant, while their non-linear response is less relevant in inference situations.
Not all crossbars are created equal
Still, the limitations of memristor devices pose significant challenges for even “simple” inference applications. For example, consider the crossbar array, one of the simplest and most area-efficient proposed designs. A crossbar array has rows of “word line” electrodes intersecting with “bit line” columns. A memristor placed at each intersection is addressed by a unique word line / bit line pair. Most crossbar designs add a selector transistor or other element for each device to control the effects of leakage and crosstalk within the array.
What happens next depends on the specific device being used. As Damien Querlioz and colleagues explained, there are enormous differences in behavior among memristor devices. On one hand, conductive bridge RAM and MRAM are stochastic binary devices. The behavior of the device depends on the formation of a conductive path or the movement of magnetic domains, and either it has switched or it has not. Especially in low voltage applications, there can be stochastic variation in the response of devices to a particular programming pulse, but the final conductance is fixed.
In contrast, filamentary devices like phase-change memory and RRAM are cumulative. Increasing the number of programming pulses increases the conductance up to some maximum value. Because the effect of each individual programming pulse will vary, the number and strength of pulses needed to achieve a given conductance value also will vary.
In both types of devices, process defects and process variability can affect the ultimate device behavior. The physics of the conductance change will depend on the composition, structure, and uniformity of the device layers.
Fig. 1: Different nanodevices and their behavior. (a) Cumulative memristive device. (b) Phase-change memory. (c) Conductive bridge memory. (d) Spin transfer torque magnetic tunnel junction (basic cell of STT–MRAM). Image from D. Querlioz, O. Bichler, A. F. Vincent and C. Gamrat, “Bioinspired Programming of Memory Devices for Implementing an Inference Engine,” in Proceedings of the IEEE, vol. 103, no. 8, pp. 1398-1416, Aug. 2015. doi: 10.1109/JPROC.2015.2437616
In filamentary devices, fabrication errors and process defects can lead to “stuck OFF” devices that fail to switch, or can cause a systematic bias in switching characteristics or ultimate conductance values. According to Wen Ma and colleagues at the University of Michigan, in work presented at IEDM 2016, stuck OFF devices can be accommodated by building redundancy into the array. Systematic bias can be much more serious, as the weights actually programmed into the array may not match the weights calculated during the training phase. Programming errors can lead to “stuck ON” devices, where a single device in a row or columns stuck at the maximum conductance value, overwhelming the single from other devices in that row. Device-to-device and cycle-to-cycle variations also can cause the weights actually stored in the crossbar array to differ from those calculated ex-situ.
Because ex-situ learning cannot account for non-ideal inference hardware, calculating parameter weights separately from the device performing the inference task is not as simple as it might appear. One alternative is a hybrid approach, where initial weights are calculated ex-situ, but additional training steps involving the actual hardware are used to adjust the final values. While these “tuning” steps need not place the same demands on the hardware that the original calculations do, any local training will necessarily require the ability to adjust weights both upward and downward, and therefore must confront the non-linear behavior of the memristors.
Mahendra Pakala, senior director for memory and materials at Applied Materials, noted that phase-change memory is somewhat more mature than RRAM at this point. The crystallographic states responsible for device behavior in phase change memory are more stable, and the conductance difference between the ON and OFF states is larger. As a result, the programming window – the gap between the lowest ON state conductance and the highest OFF state conductance – tends to be larger. If that gap is too small, the ON and OFF states can become indistinguishable. A programming sequence that SETs one device might RESET another.
In current RRAMs, the distribution of device resistance is broader. For this reason, it’s not possible to write ex-situ weights to RRAM devices using a fixed programming scheme. However, Abinash Mohanty and other researchers at Arizona State University observed (IEDM 2017 paper 6.3) that the ultimate goal is accurate inference results, not accurate storage of arbitrary weight values. Once the array contains approximate values, his group proposed a random sparse adaptation algorithm to accommodate the non-ideal behavior of RRAM devices by replicating and tuning a small portion of the RRAM array in on-chip memory.
Weights stored in RRAM devices are also less stable over the long term. RRAM conductance depends on the number and distribution of oxygen vacancies in the conducting layer. Over time, atoms making up the conducting filament can drift under the influence of both the applied electric field and the ambient temperature. While the aggregate behavior of the filament as a whole is predictable, the movement of each individual atom is probabilistic. Thus, Stanford University’s Tuo-Hung Hou explained (IEDM 2017 paper 11.6), conductance of the device may be binary, but the number of defects is an analog function of the stress time. The conductance will follow a Weibull-like distribution. The impact of this time-sensitive variation is highly dependent on the programming algorithm being used.
What’s next?
There’s much more to neuromorphic computing than non-volatile memory arrays. Pakala expects to see much more development of both devices and architectures before recognizably “neuromorphic” commercial devices emerge. Though very interesting designs have been simulated, simulations can’t be ported from software to crossbar arrays without interfaces or device read-and-write strategies.
Related Stories
Neuromorphic Computing: Modeling The Brain
Competing models vie to show how the brain works, but none is perfect.
3D Neuromorphic Architectures
Why stacking die is getting so much attention in computer science.
Toward Neuromorphic Designs
From synapses to circuits with memristors.
Terminology Beyond Von Neumann
Neuromorphic computing and neural networks are not the same thing.
Materials For Future Electronics
Flexible electronics, new memory types, and neuromorphic computing dominate research.
So then, how long do you think it might take for us to get to the point where we’re able to play Mario while having our phones do our Calculus homework within a neuromorphic network?
Computers can do your Calculus homework now, pretty much. But I wouldn’t expect them to be able to handle English or History homework anytime soon.