Power Is Limiting Machine Learning Deployments

Rollouts are constrained by the amount of power consumed, and that may get worse before it gets better.


The total amount of power consumed for machine learning tasks is staggering. Until a few years ago we did not have computers powerful enough to run many of the algorithms, but the repurposing of the GPU gave the industry the horsepower that it needed.

The problem is that the GPU is not well suited to the task, and most of the power consumed is waste. While machine learning has provided many benefits, much bigger gains will come from pushing machine learning to the edge. To get there, power must be addressed.

“You read about how datacenters may consume 5% of the energy today,” says Ron Lowman, product marketing manager for Artificial Intelligence at Synopsys. “This may move to over 20% or even as high as 40%. There is a dramatic reason to reduce chipset power consumption for the datacenter or to move it to the edge.”

Learning is compute-intensive. “There are two parts to learning,” says Mike Fingeroff, high-level synthesis technologist at Mentor, a Siemens Business. “First, the training includes running the feed-forward (inference engine) part of the network. Then, the back-propagation of the error to adjust the weights uses gradient descent algorithms that require massive amounts of matrix manipulations.”

Datacenters represent a new and evolving business model for companies. Today, they are packed with CPUs and GPUs, and some have a few FPGAs available. But power and footprint are starting to make them reconsider. With much of the compute power going into machine learning, custom silicon is starting to become more important.

“The ROI is in terms of two axes,” says Patrick Soheili, vice president of business and corporate development for eSilicon. “One is the quality of service that enables them to sell more and do more and charge more, depending upon if you are selling advertising or actually getting paid for services. The second is power consumption. These things take a lot more room because they are not efficiently designed for that application. You may need to build an extra datacenter once you have scaled enough just because you used CPUs or GPUs as opposed to dedicated ASICs. We did a back-of-the-envelope calculation and we believe that by optimizing memories, $50 million to $100 million could be saved per year in power consumption savings alone.”

The good news is that there are many startups entering the market. “When I look at what AI companies are doing today, specifically the startups, their biggest concern is just getting some silicon into the world that works,” says Johannes Stahl, senior director of product marketing at Synopsys. “So, primarily, they are trying to get the architecture correct and the software stack running and running some algorithms that achieve what they are supposed to do. The next step is to work on power. It is a key concern because these designs have such massive amounts of parallelism that they eat a lot of power.”

The chip certainly has to perform well enough to compete in this performance-driven market. “If it does not perform well, the first adopters will abandon the technology and will not continue to use it,” says Jim Steele, vice president of technology strategy for Knowles. “Once you achieve that, features and low power go hand in hand. Low power, in the end, will become the most important thing, but it takes a back seat today while the technology is being developed. This will be true for all devices, even for those that get plugged in.”

On the edge, power is even more critical. “The amount of power consumed by inferencing limits its potential application,” says Dave Pursley, product management director at Cadence. “For these applications, when we talk PPA, it is all about power. You can always look for tradeoffs with performance, but it is power that is the most limiting factor. Even trading off things like accuracy are done in order to meet the power envelope.”

Taming the algorithms
Like any new technology, the rate of change in machine learning algorithms is high.

“The algorithms are improving,” says Synopsys’ Lowman. “As they improve, they need more processing power. There is a constant need to improve the amount of processing. A good example is the research that Google has done, and they already said that CNNs have some flaws and there are ways, using more advanced algorithms, to fix that. But the necessary compute power isn’t there yet. There is a constant need to improve the math, but we have to improve the compute at the same time. How do we do that while still providing lower-power, lower-cost solutions? This is why there is a renaissance in investment and development around AI. This is a unique time in history, and it is really exciting.”

A lot of research also is going into the data sets for training. “Another way to look at it is the huge amounts of data that are needed,” says Knowles’ Steele. “What is causing the huge amount of energy is that each training set has a huge amount of data. I am a big proponent of smart data. It is not only big data but smart data. Is there a way that you can fill the state space and cover the areas you need with less data or more targeted data?”

We know that it is not possible to move Cloud-levels of compute onto the edge. “If you model in TensorFlow or Caffe using full floating point, and if you were to implement a very deep neural network (NN) and just implemented it like that in hardware, it would probably burn up,” says Cadence’s Pursley. “The trick is to figure out what is an acceptable accuracy for the given application that meets the required power envelope. It is all about the architecture. If you were to look directly at the NN and the way it was implemented, you would have all sorts of specialized hardware all running in parallel. It also would be running at some high sample rate, but part of the tradeoff is adjusting the sample rate.”

Edge computing needs different architectures. “People talk about edge computing, which is starting to become a reality,” says Lowman. “But because the end node is so difficult to design for, they are taking datacenter chipsets and trying to reduce the power consumption of them to meet that need. For example, doing inference in a security camera is not easy because you have to compress the graph, and you have to optimize the compute for that graph. That gets to be a convoluted process, and it requires a lot of design coordination from software to hardware to tools.”

The edge does have some advantages. “On the edge you are not trying to solve all of the algorithms and everything that a graph can do,” says Synopsys’ Stahl. “Instead, you pick a certain subset that you are optimizing for. You still need to do exploration to find what optimizations are possible given the limited number of graphs that you could be asked to process, and you can prune the graph to make it more efficient. On the cloud you need to be completely generic.”

Floating point arithmetic is highly wasteful, both in the cloud and on the edge. “Studies have shown that discrete inference works well,” says Steele. “This is what is allowing us to take inference to the edge by using 8-bit, 4-bit and in some cases even 1-bit for certain features. What we are finding is that 8-bit is the sweet spot. Floating point is overkill and 1-bit is a nice ideal, but it’s hard to get there.”

Vertical integration
Power often is traded off off against programmability. “This is in large part due to the layer-by-layer behavior of the network,” says Mentor’s Fingeroff. “Specifically, in most convolutional neural networks, the weight storage requirements dramatically increase for later layers, while the feature map storage requirements are largest for the early layers and decrease substantially for the later layers. Additionally, the required precision for accurately implementing the network tends to decrease for the later layers.”

Figure 1. Requirements at each layer of a network. Source: Mentor, a Siemens Business

These competing storage and precision requirements for CNNs make a ‘one-size-fits-all’ hardware implementation inefficient for power. “General-purpose solutions can provide relatively high performance and small area, but do so by ’tiling’ the algorithms and shuffling feature map data back and forth to system memory, which drastically increases power consumption,” adds Fingeroff. “These general-purpose solutions also sacrifice full utilization of on-chip computational resources for programmability.”

This creates a dichotomy for edge providers. They either can go broad to expand the potential market, but lose flexibility, or focus on a more specific task with a higher performance solution but address a narrow market.

“There are two sides to every coin,” Steele says. “One side is it ends up being able to do a lot more use cases, and that is good. But the other side is that you have to be able to cater to each of them with your framework. It is not a one size fits all. For example, we made sure that our DSPs were TensorFlow-compliant, and that casts a large enough umbrella to take into account all of the use cases that machine learning people do. So if you are looking to open your hardware platform to different use cases, you will get things that you never dreamed of — some really interesting use cases — but you need to make sure that you cater to a general framework that ML people are used to.”

The task requires IP and tools. “Our mapping tools are of extreme value to customers and to their customers when building these optimized systems,” says Lowman. “People know how to do that, but it does require coordination on the software side and the hardware side. When a chip company makes a chip, they send it to a software team, and they send it to their end customers that put more software on top of it – there are many teams and it makes the optimization of those solutions a little more difficult. This is why you see vertically integrated companies that may have an advantage. They can coordinate the teams a little more easily.”

“If you want to create a $200 million ASIC, you should be a Super 7 and not a startup,” says eSilicon’s Soheili. “They have the data scientists, they have the big data, they have the algorithms, they have the architects, they have the business logic, and at the end of the day, they are the only ones that can come up with the most efficient solution for the problem that they have. The startups are on the outside and they just keep guessing. Often, there are one or two degrees of separation between what they come up with and what is required. The reason for it is the tightly coupled algorithm to the architecture, which saves them power and gives them the most tightly coupled solution.”

Flexibility adds some conflicting requirements. “Within automotive, we see a lot of examples of customized but flexible processors because they have to address a number of different applications,” says Pursley. “The length of time in service mandates a certain degree of flexibility. It is likely that you will add applications or upgrade them over time when deployed in the field.”

How significant is the price that you will pay? “There is one place that you have to make this decision—in the marketing requirement document,” says Soheili. “Do we put programmability into the device and provide a little bit of future-proofing in case a new version of the algorithm comes along that is more efficient? For that, I cannot perhaps use 2-bit processing, but instead have to put in 4-bit processing. Maybe I will put in a few floating-point engines, just in case. Or, am I happy with the improvement I will get from this particular chip and make it as optimized as possible and live with the fact that in 6 months I will do a new chip? Then I can incorporate additions, changes and edits that have come along since then. You have to look at the ROI models. Some companies are thinking a lot more about OpEx and the savings they will get, and they have faith in the fact that there are hundreds of thousands of people —data scientists — working and coming up with new models and compression.”

Rethinking learning
It is widely accepted that floating-point is overkill for inference, but what about learning? Can learning be transformed into fixed point along with the huge benefits that would come along with that?

“When you go to learning, we have found that you need extra precision,” says Steele. “People do use full floating-point precision in order to get good results. We have so much data, and some of it is noise, some of it is not targeted for the exact case, but we need the higher precision today in order that we do not optimize to the wrong results. The way to move that to the edge is going to be finding a way to optimize it so that we could discretize the learning side, and I don’t think we are there yet.”

Some use cases may not require full floating point. “We are embracing flexibility for training at the edge when use-cases demand low duty-cycle training execution mixed with other workloads needing to run on the CPU,” says Rhonda Dirvin, senior director of marketing for the Embedded, IoT and Automotive line of business at Arm. “We also are adding features to enhance both performance and efficiency for training. For example, bfloat16 reduces the amount of energy required for ML training by matching data-type to applications that can use it effectively. It also avoids the need for ‘quantizing’ the trained models if inference is also performed using bfloat16 data-type.”

For other applications, that may not be necessary. “For certain applications, 12-bit may be enough,” says Lowman. “If you are just trying to read license plates, or facial recognition, or identification, we can do that with 8-bit. It can be done with fairly high accuracy and that may be good enough for some applications. It is a matter of what is good enough. We are not sure what may ultimately be required. It is in its infancy and we are not sure what ‘good enough’ means. Adversarial attacks can be life threatening in this application. For the development of a neural network, it goes back to the training data. How good is that?”

Strides are likely to be made in both algorithms and architectures, but the edge has very different requirements compared to the datacenter. Today, most of the money is in the datacenter and thus it is the focus of the research. Hopefully, the datacenter power problem will become a significant enough problem that it will draw more investment, allowing edge designs to leverage that investment.

But all of this is still early in the development cycle, and so far it’s too early to tell how all of this will play out.

Related Stories
Machine Learning Drives High-Level Synthesis Boom
As endpoint architectures get more complicated, EDA tool becomes key tool for experimenting with different options.
Machine Learning Inferencing Moves To Mobile Devices
TinyML movement pushes high-performance compute into much smaller devices.
Finding Defects In Chips With Machine Learning
Better algorithms and more data could bolster adoption, particularly at advanced nodes.
Pushing AI Into The Mainstream
Why data scrubbing and social issues could limit the speed of adoption and the usefulness of this technology.
Machine Learning Knowledge Center
More top stories, white papers, videos, blogs and technical papers on ML


Kevin Cameron says:

Time to move off RTL to asynchronous logic.

Mike Frank says:

Time to develop reversible computing!

peter j connell says:

I dont pretend expertise in this exciting matter, so I hope this is not dumb or already covered.

Not all this compute power needs be real time, or on tap at the edge.

Could we see cars intermittently connected to external compute muscle (like the home network while garaged, a server at the dealers workshop, gas pump, charging station…)?

The car could then powerfully train based on vast data logged on recent journeys, co-ordinated with the cloud, yet allow simpler and reduced compute on the edge.

Leave a Reply

(Note: This name will be displayed publicly)