Power Models For Machine Learning

Predicting the power or energy required to run an AI/ML algorithm is a complex task that requires accurate power models, none of which exist today.

popularity

AI and machine learning are being designed into just about everything, but the chip industry lacks sufficient tools to gauge how much power and energy an algorithm is using when it runs on a particular hardware platform.

The missing information is a serious limiter for energy-sensitive devices. As the old maxim goes, you can’t optimize what you can’t measure. Today, the focus is on functionality and performance, but those are increasingly constrained by power and thermal considerations. Performance in this case not only means how fast you can do an inference or other measurable unit of work, but also the accuracy of the operation. The sophistication of the algorithm and data sets used for training play a part in this equation.

This equation has impacts on every stage in the development and deployment of ML systems, from algorithm development to the mapping of an algorithm into a hardware architecture, and the design and implementation of that into a chip. Today, the coupling between these stages is almost non-existent.

While hardware teams live within the constraints of what silicon can do and the use cases they hope to support, they have little to help them in the greater scheme of things. “AI workloads keep changing,” says Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “When you do power budgeting, how do you forecast, because your workloads can be fundamentally changing? If I built an SoC or a chip, how do you look far enough ahead and say, ‘This is the power that I need, the thermal capacity.’ And how do you budget for it?”

Others agree. “The prediction of power consumption of chips under a given workload is one of the most complex tasks our industry must tackle today,” says Guillaume Boillet, director of product management for Arteris IP. “It requires a very detailed representation of the hardware and the underlying traffic. Today, prediction may not be that accurate, and simulation is usually the tool used. In order for them to be actionable, the power numbers are expected to be within 20% of actual silicon numbers.”

That is a tough metric to achieve. “It is relatively simple to design a ML accelerator,” says Khaled Maalej, CEO for VSORA. “It is more difficult to design an efficient one. Estimating pre-silicon power consumption is of the utmost importance, but the difficulty is getting accurate results. Power consumption spreads over a wide range, with worst-case scenarios sitting far away from typical consumption. Nominal specs for computational power and power consumption don’t tell the whole story.”

Power is not the priority today. “The priority today is getting something working,” says Derya Eker, ARC processors engineering manager at Synopsys. “Then it is functional performance, and then comes power efficiency. When bringing functionality to the battery-operated devices, or edge devices, power becomes a key criterion. Getting estimations where the margin of error is 5% or 20% is going to make a big deal for the end product.”

Increasingly, more devices are becoming power-constrained. “Power consumption is indeed a key metric for designers evaluating machine learning,” says Steve Roddy, vice president of product marketing for the Machine Learning Group at Arm. “Inference workloads, whether in a data center or an endpoint device, are often power-consumption constrained. This is especially true with always-on devices, where the inference workload runs continuously, or in high-compute bursts.”

There is an ethical aspect to this. “Good power models aren’t only necessary for system design, they are also very important for the ‘Green Future’ discussion in Europe and North America,” says Andy Heinig, group leader for advanced system integration and department head for efficient electronics in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “The introduction of new AI systems will increase the overall power consumption of electronic components worldwide if there is no compensation by reducing some energy in the application. The comparison between the estimated energy consumption of the AI system and the energy reduction in the whole system needs good power models. They should consider the software and also the hardware aspects, because AI systems show the best performance-power values if they are co-optimized together.”

Software adds several new variables to the power problem. “If I have an algorithm, I need to find a way to run it on my hardware and this is where software and compilers comes into play,” says Cadence’s Mitra. “It is not just hardware, and that’s why it’s such a tricky problem. It requires a hardware/software co-design philosophy. How you split the workload depends a lot upon how the compiler, or the workload mapper, actually partitions the workload and puts this on your hardware.”

Another variable can have a huge impact, as well. “When you’re looking at ML, there’s a new degree of freedom that many other designs don’t have,” says Rob Knoth, product management director at Cadence. “You can change the accuracy of what you’re trying to compute. You can start looking at the accuracy data, and the PPA for different implementations to fulfill those as you sweep accuracy, and suddenly you start looking at radically different design architectures and how those get implemented.”

Today, the best we can do is look at the various pieces of the flow to understand the impact that they can have on the final outcome.

Architecting the hardware
More than 100 AI accelerators are being designed today, each attempting to become optimal for a given type of algorithm or application.

“An AI accelerator might bring a large number of multipliers to the party, but if the data cannot be moved to and from those multipliers efficiently, any performance predictions go out the window,” says Russell Klein, HLS platform program director for Mentor, a Siemens Business. “It turns out that for most neural networks, the movement of the data — features, weights, biases, intermediate results — is more significant to the final performance and power than the operations themselves, the multiplies and accumulates. As these networks grow larger, it is not uncommon to see hundreds of megabytes, even gigabytes of weight data, that need to be processed for a single inference, with large intermediate results that need to be stored somewhere.”

There is little disagreement about the importance of getting the memory architecture correct. “Shuffling a byte of data on-and-off chip burns an order of magnitude (or more) greater power than performing a MAC operation on that same byte,” says Arm’s Roddy. “By leveraging the knowledge that data movement and memory accesses dominate power, analysis of the power profile of a given network becomes much easier than attempting to simulate or measure the detailed power.”

That still requires knowing about every byte of data that will be moved. “Exact power modeling requires running very time-consuming EDA tools on a cycle-by-cycle basis,” says Geoff Tate, CEO of Flex Logix. “This is not going to be practical for neural network models that can take 300 billion MAC operations to process a single megapixel image. And for ML accelerators that are non-deterministic (cache contention, bus contention, etc.), getting accurate power estimates without running on the hardware will be very difficult.”

Few accelerators target a single algorithm. “You need to have a flexible architecture to be able to deal with changing applications,” says Synopsys’ Eker. “Customers may want to use a custom graph, and will want to know the performance and power efficiency of that on a particular piece of hardware. This is influenced by choices such as accuracy, because that dictates the amount of computation and the data path requirements. If you’re okay going to lower accuracy, you may have ways to optimize your hardware.”

This is just one of the ways in which the hardware architecture and the compiler are tightly coupled. “You will have the biggest impact on performance and power by keeping intermediate results local to the accelerator and not writing them back to main memory,” says Mentor’s Klein. “This often involves multiple levels of caching and strategic ordering of operations. Anything that minimizes the data movement during the calculation is a big win for performance and power.”

And sometimes that may call for special capabilities in the hardware. “One of the key aspects, when it comes to architecting an AI accelerator, is how distributed the algorithm will end up being on the chip and how often synchronization needs to happen,” says Arteris IP’s Boillet. “An example that directly impacts the network-on-chip and the overall power envelope, depending on those architectural choices, might be a need for regular broadcasting of data across the entire chip — or more targeted multicasting to the nodes that will take care of the next steps of the algorithm.”

Estimation
Performance and power optimization for a given algorithm on a defined platform is still going to be hard to pin down because of the myriad different things a compiler could do, assuming the hardware architecture has a degree of flexibility in it.

Three main factors have an effect on power consumption, assuming a deterministic network. “Most of the power will be consumed in the MAC operations and the local memory operations for them,” says Flex Logix’s Tate. “A detailed power estimate can be run for a high-MAC-utilization layer and a lower-MAC-utilization layer. Then the MAC utilization of any given layer can be interpolated. The next largest power source will be DRAM traffic. Power can be modeled in detail for high-bandwidth and low-bandwidth traffic. Then, for each layer, the power can be interpolated given the predicted DRAM traffic for the layer. Last is the PCIe link. Again, power can be modeled for the periods when the incoming image is being received, and then modeled for the time when the link is idle.”

None of this is simple. “If a team can afford the time and effort, then technically they can build a power performance profile tool,” says Mitra. “They can build up enough heuristics such that if I throw some new workload at it, it can compute how much power will be consumed between the DDR, the processing elements, and the internal buffers, and actually do the MAC computations.”

But this is only true for deterministic workloads. “Some of the newer ML algorithms will not be as deterministic,” says Mentor’s Klein. “Rather than running a fixed set of multiply accumulates, sometimes only a portion of the network will be evaluated. Imagine an object-recognition algorithm processing a video stream. One doesn’t need to process all the pixels for every frame. If only a few pixels change, they can be incrementally processed. Now the algorithm’s performance and power will be data-dependent. Other optimizations involve pruning of the network, sometimes significantly. There are approaches that do this pruning dynamically, i.e., if a certain perceptron produces a negative result, don’t bother evaluating these other perceptrons. This makes the network’s performance and power dependent on the input data, therefore harder to predict.”

The need to evaluate the power performance tradeoffs for various platforms means there has to be some way to tell which platform is the most suitable. But until we have real power models, the industry must rely on benchmarks. “Training set data is fundamental to how many of these algorithms work,” says Cadence’s Knoth. “We’ll start to see more of an agreement on certain benchmarks or certain standard workloads. If you don’t do this, it will be one person’s word versus another person’s. You’re not going to see these standards show up overnight, but that’s a logical end point in order to make this more productive.”

Optimizing accuracy
One of the new degrees of freedom, both in algorithm development and in the optimizations of the ML compiler, is accuracy.

“There is a dichotomy in terms of research, where people are coming up with newer networks aimed at improving accuracy,” says Mitra. “Edge devices are power-constrained, or resource-constrained, and you have to ask the question, ‘How much accuracy do I need?’ If I can detect a cat with 90% probability, versus 90.1% or 91%, it may not make a lot difference to you. The dichotomy is between people who are making new networks, versus people who are trying to map those networks and workloads on real hardware, on real platforms, on real silicon, on a real IP. I made this network, I made this better, but did I really make it better? Did I actually harm it from a power perspective?”

Then, from the algorithm to the underlying hardware platform there is a big gap. “If I change the algorithm, that will be reflected in a number of things,” says Eker. “The biggest factors are if my accuracy or the graph architecture changed. Do I have deeper, or more layers coming in? These will affect power. There is no straight path to get that exact estimation. It depends on the big changes that you’re making. Designers often measure the energy efficiency of a single convolution layer of a graph, such as the multi-layered SegNet graph (see figure 1). A common pitfall is to then extrapolate the result to a full graph. You would need to know the hardware, how the application will be mapped. You need to bring multiple disciplines together.”

Fig. 1: SegNet architecture implements multiple layers. Depending on position or graph architecture, the same layer may require a different amount of energy, so no single layer can be extrapolated to represent the entire graph. Source: Synopsys

Fig. 1: SegNet architecture implements multiple layers. Depending on position or graph architecture, the same layer may require a different amount of energy, so no single layer can be extrapolated to represent the entire graph. Source: Synopsys

While the algorithmic level may be too separated from the actual hardware, many of the ML compilers are being developed by the hardware developers, and thus they should know how to optimally target the hardware features that are available.

“Quantization and numeric representation are huge in terms of their impact on power and performance,” says Klein. “This happens in two ways. First the operators are smaller. The size of a multiplier is roughly proportional to the square of their input operand size. A 32×32 bit multiplier is about 4 times larger in area than a 16×16 multiplier, and for most ASIC libraries, power will scale even better than area. The second benefit from quantization is data movement. If your numbers are smaller, there is less data to move around. Going from a 32-bit representation to 8 bits means moving 1/4 the data, and having 1/4 the memory for weights and intermediate results. And 1/4 the number of bus cycles to access that data.”

And these optimizations are often taken to the extreme. “Consider a design that has an always-on portion of the chips that listens for wake-up words,” says Knoth. “That’s a very real tradeoff between how accurate is your interpretation of the word, versus how much power you’re going to draw, because you’re always sitting there in that sort of standby mode.”

There continues to be a lot of research that can help optimize hardware. “You also do not want to limit your thinking that the number representations need to be linear,” says Klein. “ML applications may need lots of precision around 0, but when the numbers get larger than 1 or smaller than -1, being less precise is fine. Storing numbers as indices into a lookup table is one way to achieve this.”

Implications for implementation
Exact power numbers cannot be known when the chip is designed, and they will change over time as the algorithms are refined, or the compilers are improved. “Peak power is definitely important to make sure that as your circuit is hitting a certain operation, you’re not going to have any power integrity issues,” says Knoth. “You also need to be looking at things like max average power to make sure that you’re not going to have a thermal problem with the device or the package. You have to look at things like standby power, especially if it is being powered by battery. You have to be looking at all of them. But they start with certain fundamental assumptions about the traffic on the device.”

There are other factors, too. “You should look at multiple applications, and the power consumption for a range of applications so that there are no surprises,” says Eker. “By knowing your design, and from profiling applications, you can see the peak power you could get. And that’s something that I should be using for the dimensioning of the power grid. Do not take one data point, but find a good upper range based on profiling.”

Systems probably need to have protection built into them for the unexpected cases where power or temperature build beyond expectations, as well. “In the mobile landscape there are lots of voltage/frequency combinations,” says Mitra. “The hardware can run at different voltage frequency profiles, meaning that I can select ‘this’ voltage, then ‘this’ is my frequency, and ‘this’ is the resulting power and performance.”

Conclusion
Today, no reliable methods get to the magic 20% accuracy number for reliable decision-making. Decisions are made on selected samples, but there is no guarantee that these remain constant over the lifetime of the product.

Ultimately what may be needed is a digital twin on which algorithmic changes can be assessed in terms of the performance and power consumption for a given device in the field. “People are only going to be able to be aware of things that are measured,” says Knoth. “The more we see the discipline of functional verification merging with some of the design and implementation, so that you can start giving almost real time feedback of the PPA impact of the algorithms that are running on these accelerators, that’s how you build an overall better product.”

At some point, this will change. “In the future, companies will be providing reliable power models for the network-on-chips used on AI accelerators,” says Boillet. “And now we are coming full circle as we plan to rely on AI algorithms to generalize those models.”

Related content:

The Next Big Leap: Energy Optimization

Low Power Still Leads, But Energy Emerges As Future Focus



1 comments

Kevin Cameron says:

Calculating power requires analog-capable simulators, e.g. Xyce, and hardware models that include V & I for the power calculation.

Interestingly, behavioral models for mixed-signal circuits look a lot like neurons in SW neural networks, so whatever HW you use for evaluating your NNs, can be used to calculate the power too.

You can also use AI techniques to generate the power-aware behavioral models from SPICE level descriptions, so it’s a bootstrappable process.

Leave a Reply


(Note: This name will be displayed publicly)