Is the industry heading toward another hardware/software divide in machine learning? Both sides have different objectives.
Machine learning is one of the hottest areas of development, but most of the attention so far has focused on the cloud, algorithms and GPUs. For the semiconductor industry, the real opportunity is in optimizing and packaging solutions into usable forms, such as within the automotive industry or for battery-operated consumer or IoT products.
Inefficiencies often arise because of what is readily available, and that is most certainly the case with machine learning. For example, GPUs have been shown to be the highest-performance solution for training. Because these devices are based on floating point, then machine learning algorithms are developed that rely on floating point.
Inferencing in edge devices cannot afford to use floating point, and it is necessary to transform the coefficients into fixed point. But could training be done using fixed point? Quite possibly, although special-purpose hardware is only starting to be considered. It’s not clear if algorithm development and tools are too entrenched to move, forcing the industry to continue along an inefficient path similar to general-purpose software.
The big question is whether the industry heading toward another hardware-software divide. It may manifest itself as a cloud/embedded divide this time, unless something can be done to bring the two sides together.
, CEO of Babblabs, puts this into perspective. “At any given moment, there may be tens or hundreds of thousands of networks being trained, but we will soon live in a world with tens of billions of networks doing inference.”
Where we are today
Most algorithm development is focused toward the cloud, and much of that assumes inferencing also will happen there. The computational power needed for inferencing is a fraction of what is required for training and thus gets little attention.
Fig 1: Training Inferencing divide. Source: NVIDIA
Is the divide real? “You are dealing with a community that suffers from the divide between the algorithm designer and the implementer,” says Randy Allen, director of advanced research for the Embedded Systems Division of Mentor, a Siemens Business. “Those doing the design use Matlab and Python and have no clue about the implementation underneath. They do not understand fixed point and do not want to go there. They just want to get it running and not worry about optimizing it.”
Others agree. “If you are doing AI or machine learning in the space of Google level, you will use a server bank of GPUs and not worry about it,” says Gordon Cooper, product marketing manager for embedded vision processors at Synopsys. “But if you have a Google phone or something in your car, now you care about power and cost, so area of the die directly impacts that. Now you want all of the performance you can get with the caveat of low power, low area.”
At the embedded level, all available technologies will be utilized. “Machine learning is a broad term that encompasses a number of non-neural network methods (e.g. support vector machines, random forests) as well as deep learning methods, especially neural network,” says Rowen. “These ‘less-deep’ learning methods are not nearly so compute intensive as deep neural networks, so that the theoretical efficiency advantages of GPUs and FPGAs may be fairly inconsequential.”
Overall, real applications—even ones that heavily leverage machine learning methods—consist of many other parts. Rowen’s list includes image preprocessing, image scaling, MPEG decoding, image and video clipping, region of interest selection, motion detection, background deletion, histograms, transcoding, segmentation, database access, data formatting, smoothing, noise removal, data augmentation, face detection, FFT, frame-to-frame tracking, I/O processing, and sundry stitching together of complete processing pipelines.
Hardware solutions leverage history. “In the embedded computer vision space there has been years of work done using traditional computer vision, where you write a program to determine if this is a pedestrian or not,” says Cooper. “Histogram of gradients is a program that looks for edges around people and then tries to determine the pattern and what it corresponds to. That is where vector DSPs were doing pixel processing using large SIMD multiplications across a row of pixels.”
This means that many of the hardware solutions today are a combination of technologies, deploying vector processing and neural networks tightly coupled into a combined accelerator engine. Also, like many other hardware solutions, there are plenty of tradeoffs to be made. You could do something in an ASIC, and this would provide the lowest power and smallest area. But what if you want to future-proof your product? Now you want programmability. The industry is looking for ideal solutions in a rapidly changing environment.
Metrics
To optimize anything, there have to be metrics against which solutions can be graded. “Currently for machine learning performance, we have no industry benchmarks to measure against,” points out Francisco Socal, product manager for vision and AI at Imagination Technologies. “With graphics, for example, we have the Manhattan and the T-Rex score. There are some neural network models emerging, but they are by no means representative. I would expect to see industry agreed benchmarks within the next year.”
In the hardware space, power, performance and area traditionally have been the key metrics. “Latency, throughput and power are very relevant to machine learning,” says Rowen. “Throughput often can be improved by simply applying more parallel hardware, but latency is a bit trickier. Parallel hardware helps a lot in some circumstances, but there may be fundamental constraints on latency due, for example, to processing windows.”
Socal takes this a little farther. “At a very high-level, throughput (performance) and power are the correct metrics to measure machine learning performance against, just like any other hardware design, because they are what really matters. Low latency isn’t really a metric, but rather a requirement.”
Latency depends upon the application domain where it is possible to have long latency, in the order of seconds, for some real-time applications—so long as the throughput is adequate. In other applications, especially with direct human interactivity or in safety-critical systems like automotive and robotics, you may need latency measured in milliseconds or tens of milliseconds.
For machine learning there may be additional metrics that have to be considered. These include accuracy and future proofing.
Power
Power is one of the biggest concerns, given the amount of computational muscle required. “Power is usually strongly correlated to cost, and cost does matter,” says Rowen. “It also matters more as the technology matures and becomes widely deployed. Power is also strongly correlated to mobility, and there’s no doubt that we want to carry processing power to the places we go. State-of-the-art neural network engines (processors, reconfigurable arrays, co-processors) have roughly a factor of 100 energy-per-op advantage over general-purpose CPUs due to specialization around the computation pattern and data types needed for neural network inference.”
But more is required. “Many self-driving cars have the equivalent of 100 laptops running in the truck full-time,” says Ellie Burns, director of marketing for the Calypto Systems Division of Mentor. “That is how much power it is taking. You can’t have a Chevy Volt with 100 laptops. Power has to come down a lot. The industry is struggling with this at the moment, and today the GPU is the only thing that can keep up most of the time. This is why many people are looking at generating custom hardware using high-level synthesis.”
Lower power consumption also could enable new applications. “There are 1 million cameras in Beijing, and each produces 2.5Mb of data per second,” says , president and CEO of OneSpin Solutions. “Imagine the power and the data transfers that are required if processing were done in the cloud. They need smaller devices, cheaper devices, things that do not consume as much power. That is where new architectures come in. The standard architectures have a bottleneck on power because of data movement. They are trying to come up with architectures that change this or optimize the network so that it is less hungry.”
Power is one of the biggest tradeoffs for machine learning, and bandwidth requirements create limitations.
“Neural networks are very bandwidth-hungry due to vast amounts of data needing to be loaded into the accelerator to run just a single network inference,” adds Socal. “The memory bandwidth requirements also increase with the increase in size of neural network models, in terms of coefficients (weights), input and output data size, driven by the use of end-to-end and all-in-one network models as seen today. This introduces significant challenges for SoC designers and OEMs. Higher external memory bandwidth requires faster memory modules, which are significantly more expensive and power hungry. Many solutions will be limited in performance, not by the compute power of the inference engine, but by the ability of the system to provide the required bandwidth to the neural network accelerator.”
Lucio Lanza, managing director for Lanza TechVentures, adds that “the main metric to optimize is the ability to keep the processing unit fed, i.e. occupancy. This involves both latency and throughput. With Moore’s Law breaking down, we should expect to see more single-task designed chips and generally a move to single instruction, multiple data (SIMD) and away from multiple instruction, multiple data (MIMD).”
Power also has a habit of affecting many aspects of design. “Due to the limitations of battery power and low-cost cooling systems, energy consumption is strictly limited,” says Deepak Boppana, senior director of marketing for Lattice Semiconductor. “The low cost and small size requirement makes it hard to use packages with large numbers of pins, and that limits the bandwidth to external DRAM. Even with these limitations, most applications require real-time operations.”
Accuracy
What does accuracy mean with a statistical process? “For vision there is no clear algorithm,” says Mentor’s Allen. “You can look at an optical illusion and it is clear that it allows something to be interpreted in different ways. Humans can look at it and see something different. There is no 100% right answer. It is more inductive than deductive. If you are getting 97% accuracy on the training data, you are doing really good.”
Like other metrics, absolutes may not apply. “Automotive and consumer have very different end goals when it comes to accuracy,” points out Pulin Desai, product marketing director for Tensilica Vision DSP product line in Cadence. “We all want to make sure it is a safe device, so high accuracy may be demanded for automobiles. But in consumer devices, you may want a low-power wakeup mode and a usage mode. For always-on, you want to be saving maximum energy, but when in operation you can use more energy for better results.”
Future-proofing
Machine learning is rapidly evolving. “Things are moving at a fast pace and we have to consider ease of use and futureproofing,” says Cooper. “If you make something great now, you may be behind the target, so we have to make sure that whatever we do has legs for the future. There is a pipeline between what we do and when a device gets released in a product, so we are early in the process. And anything we do has to last for some time from a software point of view.”
Desai shares a similar view. “Devices being created today will go into production in 2019. For automotive, that may be 2021. So it depends on the market segment. Programmability helps and provides more flexibility. You can develop a hardware accelerator, and it may provide the best solution, but it cannot evolve with the technology. A dedicated accelerator core will provide a better performance/power option that a CPU/GPU combination.”
Some applications cannot afford that luxury. “You build hardware when you need more performance or lower power,” says Burns. “Over time we will see combinations of these solutions.”
Closing the Divide
It is the responsibility of the hardware community to help close the divide. Software has taken the lead and that industry can justify their investments without consideration of processing at the edge. The hardware industry may not be used to being in this position, but if they want to see changes made, they will have to start making the investment.
Related Stories
EDA Challenges Machine Learning
Many tasks in EDA could be perfect targets for machine learning, except for the lack of training data. What might change to fix that?
Using Machine Learning In EDA
This approach can make designs better and less expensive, but it will require a huge amount of work and more sharing of data.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
CCIX Enables Machine Learning
The mundane aspects of a system can make or break a solution, and interfaces often define what is possible.
Machine Learning Popularity Grows
After two decades of experimentation, the semiconductor industry is scrambling to embrace this approach.
Personally, I’m seeing more of a convergence of things I’m familiar with – a lot of neural network processing is a similar code pattern to circuit simulation (behavioral analog, fast-SPICE).
A language like Verilog-AMS can easily be used to describe neural networks. You can even use “flow” for reinforcement feedback.
There’s a cognitive dissonance for the guys above because they are all digital, probably hate analog, and don’t see the connection. Can’t see them crossing any divides myself.
I had to chuckle – yes, the analog/digital divide has existed ever since the transistor was invented. I would love to see some work on analog machine learning. I am sure it would be a tiny fraction of the power.
I had to chuckle – yes, the analog/digital divide has existed ever since the transistor was invented. I would love to see some work on analog machine learning. I am sure it would be a tiny fraction of the power.
You are in luck, these guys are doing analog multipliers wrapped in RISC-V –
https://www.mythic-ai.com/technology/
CNN/DNN research is evolving so quickly that future-proofing the hardware is fundamental. A fully hardcoded NN will quickly become obsolete and is not the way to go.
Programmable HW accelerators provide a lot more flexibility, but even they can’t anticipate the structure and operations of future networks. This is why HW accelerators need to be extended through SW to handle operations not supported in HW.
Lastly, and perhaps most importantly, I am surprised that FPGAs are not mentioned in this discussion. They are gaining momentum in the data center (Microsoft, Amazon, Huawei…) and they are present in many embedded/IoT products. They provide the ultimate flexibility, including the ability to reprogram the NN accelerator to take advantage of the latest advances in NN research. Inference precision has quickly moved from 32bits to 16bits and now 8bits. 4bits and below are in sight. CPUs/GPUs/HW Accelerators have fixed datapath and cannot easily take advantage of this evolution, unlike FPGAs.
*Note: all the above applies to inferencing, not training.
You are right about FPGAs. For inferencing, eFPGAs are more likely to be the direction people want to go and I will be talking about that in an article about mapping in February. I do expect to see eFPGA chips being released in quantity this year.
You are right about FPGAs. For inferencing, eFPGAs are more likely to be the direction people want to go and I will be talking about that in an article about mapping in February. I do expect to see eFPGA chips being released in quantity this year.