Bridging Machine Learning’s Divide

Why new approaches are needed to tie together training and inferencing.

popularity

There is a growing divide between those researching Machine Learning (ML) in the cloud and those trying to perform inferencing using limited resources and power budgets.

Researchers are using the most cost-effective hardware available to them, which happens to be GPUs filled with floating point arithmetic units. But this is an untenable solution for embedded inferencing, where issues such as power are a lot more important. The semiconductor industry is bridging this divide using more tailored hardware structures and mapping technology that can convert between cloud-based learning structures and those that can be deployed in autonomous vehicles, IoT devices and consumer products.

While the industry is quite successfully using mapping technology, there is a need to bring more inferencing implications into algorithm development to ensure that the gap does not widen. “Deeper networks, with more, fatter layers will take more cycles of computation, more memory bandwidth and more memory capacity to train and to run inference,” says Chris Rowen, CEO for Babblabs. “The space of possible network architecture for any given problem is enormous, so people need to be working to develop smaller, less compute-intensive, lower-data-resolution networks that perform as well, or almost as well, as earlier networks.”

Insights into the end application are critical to understanding how much accuracy is enough for that situation, and how much throughput is required. The goal is to build a network that is just powerful enough to solve the problem at hand. “The tradeoff, from a mapping point of view, is related to accuracy and not throughput,” says Gordon Cooper, product marketing manager for embedded vision processors at Synopsys. “All of the research until 9 or 12 months ago seemed to be focused on improving accuracy, and now it is about how to get the same accuracy with less computation.”


Fig. 1: Intel’s Deep learning SDK workflow. Source: Intel/Codemotion

Mapping
Learning in the Cloud is not completely blind to the need for embedded inferencing. “Standards like Caffe, Tensorflow, and Khronos Neural Network Exchange Format go a long way to enable highly optimized implementations of inference for systems trained on a range of systems,” says Rowen. “This is extremely valuable to opening up the path to exploit optimized high-volume inference engines in phones, cars, cameras and other IoT devices. This higher-level robust set of interfaces breaks the tyranny of instruction set compatibility as a standard for exchange and allows for greater levels of re-optimization as the inference execution hardware evolves over time.”

The result of a training exercise is a floating point model. “You can do a lot of optimization on that,” says Raik Brinkmann, president and CEO of OneSpin Solutions. “There are companies looking at doing Deep Neural Network optimization, where you first try to reduce the precision of the network. You go to fixed point, or even binary weights and activation, and while this doesn’t work for learning, it works fine for inferencing. The result of this optimization step is a reduction in the size of the network, maybe a reduction in some edges that have unimportant weights to make it sparser, and a reduction of precision. Then you run it again in simulation against the training data or test data to check if the accuracy is still acceptable.”

Many coefficients within a network end up being zero or very close to zero. “Neural network designers are looking closely at how best to exploit the frequently sparse sets of coefficients in many neural networks,” adds Rowen. “Network developers and architects are working very hard to exploit this trend either with computing architectures that operate particularly well on these irregular sparse sets. They are effectively skipping over multiplies-by-zero entirely, or those sets that leverage smarter training to reduce the sparsity and the total compute by squeezing down to smaller sets-or sets with more structured sparsity, where the zeroes occur primarily in rows, columns and planes of the data set.”

A lot can be done with zeros. “With a zero weight, it means you can remove edges from the network altogether because nothing can ever propagate, so there are less inputs on the MAC nodes,” adds Brinkmann. “Then there is zero activation, which is not statically determined and depends on the data. Sometimes nodes will not activate and you can try to dynamically optimize that when running on an architecture.”

One problem in the flow is verification. “Quantization is difficult from a verification point of view,” continues Brinkmann. “There are new techniques called quantization-aware training, where you try to factor that into the training phase. This way you can provide some numerical answer about how susceptible the system will be to this type of problem. It is a statistical way of working. ML is statistical, and for verification we may have to use some statistical methods. Formal verification may play a role, and there are things that could be done with static analysis in combination with statistics. It is not deterministic.”

Custom, FPGA or processor
The range of possible implementation architectures is the same as for any hardware and extends from general purpose processors to full custom hardware. There are tradeoffs associated with each, and the end product requirements may define which of these are acceptable.

When targeting a processor, tools can do most of the work. “The output from the training environment can be transformed from the 32-bit floating point coefficients and mapped into our hardware using 12- or 8-bit resolution,” says Synopsys’ Cooper. “In theory that should be it, push-button. In reality, it is a constant software maintenance issue because it is a moving target. We support every graph in Caffe, until someone comes up with a new graph. Increasingly, a customer may have a custom layer and then some mapping has to be done manually. It is virtually impossible for it to be completely push-button, but the goal is to limit the amount of touch required.”

Libraries also help to produce optimal code. “If your network contains a 3X3 convolution, a 1X1 convolution or a normalization process, we have developed a software library for most of the common functions,” adds Pulin Desai, product marketing director for Tensilica Vision DSP product line in Cadence. “The compiler will call those functions and we generate a workspace in C code that is highly optimized for the processor.”

Sometimes you need a more efficient solution. “It is image processing on steroids and that makes it a great fit for High Level Synthesis (HLS),” says Ellie Burns, director of marketing for the Calypto Systems Division of Mentor, a Siemens Business. “Inferencing needs to be low power, low latency, very high performance and throughput. That is hardware. Because HLS is in C you can keep all of the models, you can retest, you can make sure that the floating point and integer conversion is good. It is a perfect fit and the development environment stays the same. The memory requirements are huge, and that creates some interesting challenges.”

There are some non-trivial tradeoffs here, as well. “If you take a given network and start playing with the topology, the hardware has to change unless you have a general-purpose accelerator,” explains Mike Fingeroff, high-level synthesis technologist for Mentor’s Calypto Systems Division. “One approach is for an FPGA to be used as a data acceleration engine, and this can be reused for any network layer topology to a certain maximum. You could experiment in Tensorflow or Caffe and change the topology, remove layers, or prune coefficients. The problem is that it is not a high performance system – especially for automotive applications, where you have high frame rates and lots of image data to process. If you settle on the network and architecture, then you can retrain. And it just means that your weights have changed and you can load those back into memory. But you cannot make a drastic change to a network and then apply to a fixed architecture. With FPGA you could re-architect to match the new implementation.”

So are FPGAs the natural choice? “You can get a huge speedup using FPGAs, but the key problem with matrix multiply has always been that it is memory access time constrained,” points out Randy Allen, director of advanced research for the Embedded Systems Division of Mentor. “You are running at memory speed rather than CPU speed. In addition, there are lots of ways that you can map matrix multiplies onto an FPGA, and figuring out the right one is not something that a lot of people know.”

Architectural tradeoffs
One of the big architectural decisions is the data width. “For inferencing, 8-bit is good enough most of the time, and in some cases 16-bits,” says Desai. “We have built an architecture that is optimized for convolution. Other types of networks can run on it, but this is the focus. This requires large numbers of MACs in a single cycle. The block can do 1024 8-bit MACs in a single cycle. Then there are other blocks, such as DMA, to bring in data from external memory to help with traffic management.”

That’s one approach. There are others. “Our CNNs use 12-bit multipliers,” says Synopsys’ Cooper. “That is because a 12-bit multiplier is half the size of a 16-bit multiplier as a hardware block. We believe that we can get any 32-bit FP to work in 12-bit fixed point and achieve the same level of accuracy. There also have been new quantization techniques developed, and Google is actively promoting 8-bit. We can support that, and it saves bandwidth and power because you don’t have to go to the external DDR as much. 8-bit fixed point appears to be where things are right now, but there are papers talking about going to 4, 2 or even 1 bit perhaps in the future.”

The EDA industry has plenty of experience in this area having dealt with very similar problems with mapping for DSP systems. “The high-level synthesis flow often starts with a floating point algorithm,” points out Mentor’s Fingeroff. “Then you quantize that algorithm and you look at things like signal to noise ratio (SNR). It is the same thing when you go from a trained FP model and you quantize the coefficients. When you run the images through the inferencing engine you are looking at the percentage match rate. Research has shown that 8 bits and below provides very reliable results.”

Ubiquitous training
There are three ways to do machine learning. “Classification and regression are the two used today,” points out Brinkmann. “One that will have a big impact on architecture will be the third technique – reinforcement learning. With classification and regression you can make the split between the learning and inferencing phases. You can do training in the cloud and inference on the edge. With reinforcement learning you want to apply it in the field and you will need to learn on the fly. That means you may need to do some training on the edge.”

Today, in an automobile, the image recognition system either gets the answer right or wrong. “The CNN just tells you where an object is but anything beyond that is not built into the CNN,” explains Cooper. “That is the role of the system that uses the information. How would it know that it did not recognize something correctly unless it had a redundant system? The deployed engine is only as good as the training. The vision block will always provide a yes or no, so maybe there has to be something at the higher level that tracks errors.”

Networks that are self-modifying create a new set of problems. “The data defines your function, and that is different from the ways in which engineering has been done for critical applications,” adds Brinkmann. “If the data is not representative or does not contain the right cases, then your network will fail in the field. There may be issues with data spoofing. Think about autonomous driving. They do training in the cloud today and the results are deployed. The car gathers data, and that may be transferred back to the cloud to retrain the network and then reiterate. Verification becomes continuous. It is no longer a one-off activity when you design your system. You have to take care of data integrity to ensure that no one is messing with the data, or biasing it.”

This problem would be made worse if retraining was performed on the fly, but that appears to be the direction that will ultimately be required. “Machine learning has reached the tipping point where technological advances have made it possible for the machine to become smarter,” says Deepak Boppana, Senior Director of Marketing, Lattice Semiconductor. “We have only sampled what the machines of the future can do. Going forward, we see continued focus on low-power inferencing at the edge to enable smarter systems that can perform machine learning at ultra-low latency. We also see increasing use of unsupervised learning techniques that do not require labeled training data.”

If learning is distributed, who owns the data? “Security, encryption and protection of the network models is an interesting question and something we may see in the near future,” adds Francisco Socal, product manager for Vision & AI at Imagination Technologies. “Network models are emerging as intellectual property and one can expect to see developers being sensitive about deploying them to multiple edge devices without strict protection mechanisms in place.”

Forecasting where the industry will head is an impossible task. “We can anticipate this field will evolve significantly,” says Rowen. “There is nothing theoretically optimal about today’s popular neural network structure. They just work better than known alternatives for so many different tasks. Therefore, we can reasonably expect continuous invention of new algorithmic principles, new network structures, new training methods, and new ways to adapt networks into the heart of applications.”

As for the divide? “There is definitely an algorithm/implementer divide,” concludes Mentor’s Allen. “Given that you have to go to a GPU-type of architecture, people are just getting them out and not focusing on optimization. It really is a software problem, and people have not been focused on optimization so far.”

Related Stories
Machine Learning’s Growing Divide
Is the industry heading toward another hardware/software divide in machine learning? Both sides have different objectives.
EDA Challenges Machine Learning
Many tasks in EDA could be perfect targets for machine learning, except for the lack of training data. What might change to fix that?
Deep Learning Spreads
Better tools, more compute power, and more efficient algorithms are pushing this technology into the mainstream.
Using Machine Learning In EDA
This approach can make designs better and less expensive, but it will require a huge amount of work and more sharing of data.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
CCIX Enables Machine Learning
The mundane aspects of a system can make or break a solution, and interfaces often define what is possible.
Machine Learning Popularity Grows
After two decades of experimentation, the semiconductor industry is scrambling to embrace this approach.