Different approaches for improving performance and lowering power in ML systems.
As more designers employ machine learning (ML) in their systems, they’re moving from simply getting the application to work to optimizing the power and performance of their implementations. Some techniques are available today. Others will take time to percolate through the design flow and tools before they become readily available to mainstream designers.
Any new technology follows a basic pattern — early implementations focus on implementing the most basic embodiment of an idea. The goal is to “get it to work.” Only after this phase gains momentum does the focus shift onto “getting it to work better,” where “better” can mean doing things with higher speed, lower power, greater efficiency, or lower cost – or some combination of all of those.
For ML inference neural networks, the first phase has meant using canonical networks that have been proven effective and training them with data sets using supervised training techniques. The competitive edge here comes from implementing machine learning before the competition. While many companies are still embracing ML at this level, others are taking the next step to competing not just on the existence of an ML algorithm, but on the quality of the implementation. This includes doing inference at the edge instead of in the cloud.
“AI accelerators are trying to get closer to the edge, if not at the edge,” said Ron Lowman, strategic marketing manager for IP at Synopsys. “ADAS [advanced driver-assistance stems] is driving a ton of development.”
Outside of research labs – and even within them – designers depend on a set of tools to help implement new networks and new techniques. In many cases, researchers will be working on ideas that aren’t available yet in those tools. There is a general flow of such ideas from the halls of academia into the major design frameworks and, from there, to actual production of real designs that leverage those new ideas.
Some techniques reflect approaches that can be implemented in software in the various network kernels. In other cases, they may benefit from training and inference chips that support the new techniques in hardware. “In AI, hardware and software have to go hand-in-hand. In some cases, the compiler itself may need to be augmented,” said Meghan Daga, Cadence’s director of product management for AI Inference at the Edge.
Broad uptake of anything new that’s reasonably complex will happen only when there’s robust hardware, tool, and framework support. “Customers train their models on Nvidia hardware,” said Flex Logix CEO Geoff Tate. “If Nvidia doesn’t support it, then customers don’t do it.”
Given such a scenario, inference can benefit from increased competition in the training realm.
Fig. 1: New techniques flow from research to mainstream availability. Source: Bryon Moyer/Semiconductor Engineering
Optimizing an implementation, however, means different things to different applications — and, in particular, where the application will run. If it will run in the cloud, then speed is the main metric. While power is a concern in data centers, it takes a back seat to performance. By contrast, at the edge — or the farther one gets from the cloud — the critical metric becomes performance for a given level of energy consumption. “Server chips are going into [5G] base stations,” said Lowman. “Netflix is doing local caching with ISPs [internet service providers]. They may be able to do some of the AI stuff in the way that 5G [base stations] are. The key metric is TOPS in the data center, TOPS/W elsewhere.”
Many of the early applications of ML, such as predictive analytics and speech recognition, reside in the cloud and are implemented by behemoths like Google and Facebook. Edge-oriented applications overwhelmingly focus on vision applications at the moment. This reflects an overall system optimization goal of balancing the work done at the edge and in the cloud. While the cloud has more computing power, data must be communicated from the edge to the cloud, and that comes with a cost. Doing more at the edge reduces the need for transmitting data, freeing up bandwidth and reducing decision latency. Vision applications — and video in particular — generate large volumes of data, so doing more at the edge reduces that burden.
Optimization today
As designers become more comfortable with neural networks, there has been a move to custom networks. “Customers are starting to think architecturally. They’re not just thinking about the network,” said Daga. “Optimization used to be all about PPA (performance, power, area). Now memory is the main topic.”
Development often starts with a well-known network, but then evolves as a team tries to make improvements for a specific application. For vision applications, a typical starting point is the ResNet50 network. But that serves mostly as a first-step benchmarking tool for inference hardware.
“ResNet50 numbers start the conversation,” said Lowman. “In reality, [customers] are well beyond ResNet50. That’s getting kind of old.”
Many designers still find that more modern off-the-shelf networks are sufficient for their needs. “Lots of customers are happy with YOLOv3,” said Tate. That said, vendors of platforms and tools increasingly are working with designers who want to customize a network. Having chosen a hardware platform (or at least narrowed down their options), they’ll take a standard network and start changing it.
“Customers may change the input width or redistribute feature depth between layers,” said Vinay Mehta, Flex Logic’s inference technical marketing manager.
Numerical precision is a basic consideration for edge inference. While training typically operates with floating-point numbers, edge applications instead can benefit from the use of integer math to reduce the amount of inference hardware and to lower power. This involves quantizing weights and activations, which, in and of itself, is not new. But older implementations picked a uniform quantization precision for the whole of the network. Today, some networks are being quantized with higher precision on some layers than on others to focus precision where needed, and not waste it where it isn’t needed.
One newer approach to balancing precision against hardware simplicity is the use of “asymmetric quantization.” When quantizing real numbers into integer numbers after training, a symmetric approach is typical, giving a range of numbers (positive and negative) that span the needed dynamic range and have 0 centered on that range. But some applications may have a greater positive range than a negative range (or vice versa). By using symmetric quantization, much of the range is wasted for no other reason than to keep 0 in the middle of the range and to simplify the math.
It’s possible to quantize onto a range that reflects the true dynamic range on the positive and negative sides, allowing greater precision within that useful area. “Layer-by-layer or channel-by-channel quantization may be a mix of symmetric or asymmetric quantization,” said Daga. “TensorFlow can retrain a quantized network, and it allows layer-by-layer (including asymmetric) quantization.” The cost of asymmetric quantization is that calculations become more complex. Some experimental platforms that presented at this year’s ISSCC conference have included hardware that handles this math, removing that complexity as an issue for a software algorithm.
Fig. 2: Symmetric vs. asymmetric quantization. Source: Bryon Moyer/Semiconductor Engineering
Another way of reducing the overall amount of data is to use a greater “stride” in convolutional neural networks (CNNs). CNNs work by sliding a window across an image; a 3-pixel-x-3-pixel window might be typical. The first window is calculated, and then the window is moved to the right by one pixel and a new calculation is performed. This continues until the right edge of the image is reached, and then the window moves back to the left, but 1 pixel down. This describes a setup using a stride of 1, since the window moves by 1 pixel at a time.
Some projects are experimenting with larger strides. A stride of 2 means that the window moves to the right by two pixels at a time. After the right edge of the image is reached, the window then moves down by 2 pixels. This skips half the data, reducing the volume by a factor of 4 (2 horizontally, 2 vertically). But, of course, it comes at the expense of the loss of some precision, even though the “skipped” pixels aren’t omitted outright from calculations, since they will appear within the window of the remaining calculations. A designer would need to decide whether or not any loss of precision would be acceptable for their application.
“One of the benefits of using a stride of 2 is that you get an extra cycle to do the computations, since you ignore the intermediate data being fed from memory,” said Michael Fingeroff, high-level synthesis technologist at Mentor, a Siemens Business. It should be noted that many hardware platforms currently support strides greater than 1, but it would appear that it’s not a common technique at present.
Fig. 3: Strides of 1 and 2, both horizontally and vertically. Source: Bryon Moyer/Semiconductor Engineering
All of this experimentation, as designers try different ways to optimize a network, has spurred the use of high-level synthesis (HLS) to take C code and generate equivalent hardware. The holy grail of HLS has been to turn untimed C code into hardware, but it is still not possible to extract parallelism out of such code.
“The pure algorithmic description of a convolution does not contain sufficient detail to extract the optimal memory architecture,” said Fingeroff. Instead, a timed C model can be turned into hardware for simulation within a SystemC testbench so that new ideas can be rapidly simulated to find the most effective approach.
Keeping it sparse
Meanwhile, sparsity is a characteristic of a model that is increasingly being leveraged. Models are represented with matrices, and matrix products involve billions and trillions of multiply-accumulate (or sum-of-product) operations. The more matrix entries are zero, the fewer operations are needed, since an accelerator that recognizes zero-value entries can skip the math on those entries. They contribute nothing to a sum of products. In addition, “Sparsity helps with run-length compression,” said Fingeroff. And rather than feeding in an entire sparse matrix, you can send a vector that says which entries are non-zero.
“With weights, you know in advance and can pre-calculate the vectors,” he said. “With activations, you can generate those vectors in real-time.” The use of activation functions can contribute to greater sparsity, with activations becoming sparser as they move through the network. “Deeper in the network, many activations will go to zero due to things like ReLU,” said Fingeroff. Because this sparsity evolves at run-time, sparsity-aware hardware cannot rely on pre-processing. It has to react to it in real time.
“We dynamically look at activation sparsity already, deciding at load/store whether a calculation is required,” said Cadence’s Daga. “The next step is enhancing sparsity.”
An unusual contributor to the cause of sparsity is a new image processor from Prophesee. While most image processors generate an entire frame’s worth of data in synchrony with a frame-rate clock, the Prophesee sensor doesn’t. Because successive frames of a video stream have large portions that remain unchanged from frame to frame, much of that data might be considered wasted.
Prophesee uses a different approach. Each pixel measures not the absolute amount of light impinging on it, but rather a change in that light beyond some threshold, which can be adjusted. So rather than churning out frame after frame, each pixel reports “asynchronously” when its threshold is reached or exceeded. While it’s not strictly asynchronous, the timing precision is around 1 μs, for an equivalent “frame rate” of 1 MHz. This means fast events that might have been missed in a more traditional architecture if they occurred between frames, now can be captured. And the data stream is limited to those pixels that have changed, along with a timestamp. “At the system level, with reduced data, this can enable lower cost of ownership,” said Luca Verre, founder and CEO of Prophesee.
Alexandre Valentian, head of Advanced Technologies and System-on-Chip Laboratory for Leti, noted that event-based sensors like this can work well with temporal-coded spiking neural networks, although those aren’t yet ready for commercial use.
The cost question isn’t a trivial one, since standard optical sensors are produced in enormous volumes. With any new approach, there is something of a chicken-and-egg challenge where you cannot achieve volumes until the costs come down. But the costs can’t come down until high volumes are achieved.
“Our initial generation implementations are based on standard CMOS manufacturing processes and really don’t introduce any significant new cost issues,” Verre said. “The volume dictated by the application is a factor, but that is not unique to any chip supplier. With our partnership with Sony, we are working with a company that knows the dynamics of volume pricing and consumer market better than anyone. While the new 3D stacking approach we are using is more complex, Sony has used it already in other high-volume applications. This ensures the solution’s competitiveness from a price perspective, compared to other CIS offerings on the market.”
Granted, this chip isn’t intended for professional photography or applications requiring millions of pixels. “This is optimized to do real-time high-speed vision under any lighting conditions. It’s not optimized to create an image, although it’s possible to do,” said Verre. “We are talking about perception applications where resolution isn’t that high. In industrial, the sweet spot for us is HD.” He noted that it’s particularly effective at tracking items as they move through the field of view.
This being a new and different technique, it means that standard frame-based algorithms must be revised (for which they’re working with partners). In addition, existing labeled training data sets must be modified to create a set that works with this new approach. Verre notes that, “[Our] intent is to make adoption of event-based processing as intuitive and seamless as possible. A growing part of our solution is the software tools, programming capabilities and data sets that will allow designers to experiment with various options and be guided to optimal design implementation based on their specific requirements.”
Future optimization opportunities
There are a number of additional opportunities for optimization under exploration, including a number that were presented at ISSCC. While the ISSCC ideas – both in terms of optimization technique and hardware to support the ideas – will need to be proven out and worked into the frameworks before they become readily available, it may be possible to leverage them in more generic hardware today.
A paper from a collaboration between Tsinghua University, Pi2Star Technology, and Anhui University generated greater sparsity by using standard frame-by-frame processing, but evaluating the entire frame only for the first frame. After that, the hardware calculated the difference between the prior frame and the current frame. Just as in the Prophesee case, this would generate far sparser matrices, but the processing would have to change since the network wouldn’t be looking at an image outright.
Another optimization that was noted in a number of papers is the notion of depthwise/pointwise convolution (sometimes referred to as depthwise-separable convolution, or DSC). This approach separates what would be a standard convolution, involving height, width, depth, and the number of desired output channels, into two different ones. The first one operates only on the depth, giving it the name “depthwise.” The second one operates with a kernel of size 1, giving it the name “pointwise.”
DSC can drastically reduce the number of calculations and the number of parameters. There is some loss of information – especially if the original frame is small, yielding too few parameters to train effectively. But some of the experimental results suggest that suitable accuracy is possible.
“[DSC] is more area and power-efficient. It may compromise accuracy, but it may still be good enough,” said Fingeroff.
But here again, hardware plays a role. “From a developer’s standpoint, DSC provides a lean architecture,” Daga said. “But current hardware architectures aren’t optimized for this,” which can result in poor utilization of traditional processing engines.
As these and other techniques prove themselves out, with the winners making their way into hardware and frameworks, production designs will be able to leverage them much more easily.
Related Stories
Machine Learning and Artificial Intelligence Knowledge Center
The Challenges Of Building Inferencing Chips
As the field of AI continues to advance, different approaches to inferencing are being developed. Not all of them will work.
Bridging Math And Engineering In ML
How to combine two very different disciplines to create inferencing chips.
Memory Issues For AI Edge Chips
In-memory computing becomes critical, but which memory and at what process node?
Hi! Great article! It’s interesting and insightful. We’ve learned a lot. Thank you for your work and for spreading knowledge about Machine Learning. Just wanted to let you know, that we included this piece in our Weekly Roundup on our blog neptune.ai/blog. Cheers!