Power Challenges In ML Processors

Machine learning engines present some new power challenges that could trip up the unwary. Some of the issues were known once, but since have been forgotten.

popularity

The design of artificial intelligence (AI) chips or machine learning (ML) systems requires that designers and architects use every trick in the book and then learn some new ones if they are to be successful. Call it style, call it architecture, there are some designs that are just better than others. When it comes to power, there are plenty of ways that small changes can make large differences.

ML processors tend to have a very regular architectures. “They typically include very large arrays with hundreds, even thousands of processors, often arranged in clusters repeated across the chip and consuming high power on the order of tens or even hundreds of watts,” says Richard McPartland, technical marketing manager for Moortec. “The chips are typically huge, on the order of several hundred mm², and designed on very advanced processes — especially finFET nodes. With potentially zillions of gates all switching synchronously around the same time, tactics are often employed to spread the switching energy.”

When AI chips are judged by their processing throughput, the goal is to keep those processors busy. “You do want to keep the MACs busy all the time,” says Rob Knoth, director of product manager for digital signoff at Cadence. “When you do that, the problem looks very similar to ATPG activity on the chip, where everything is rattling simultaneously. That can create a very nasty, persistent power problem and thermal problem because it is always happening. It is not bursty activity. It keeps going, and so the thermal is likely to keep ramping.”

ML chips are one part of a complete application design problem. “At a high level, an AI chip would be equipped with three main elements — 1) a large amount of data, 2) one or more algorithms to process the data, and 3) the physical architecture where data processing/calculation is carried out,” says CT Kao, product management director at Cadence. “All three will consume higher power and generate more heat compared to conventional electronic or semiconductor devices, not only due to performing an enormous amount of computation but also due to clustering multiple functional units on the AI chip.”

This requires a systematic approach to the design problem. “For machine learning inference networks, one approach is to optimize at each stage of the tool flow,” says Steve Roddy, vice president of product marketing for the machine learning group at Arm. “This would start from network optimization (reducing network complexity), moving to network conditioning (techniques include quantization, clustering, and pruning a model to optimally fit the target hardware). Another approach is to compress (lossless) the model and store both weights and activations in compressed format all the way up to the point just before consumption in the MAC array. There are additional approaches to data management and minimization. Once this issue has been dealt with, it makes the physical implementation challenges (such as power, thermal, IR) much more tractable because data traffic on SoC buses is dramatically reduced.”

Good decisions
As with any system, good design starts at the earliest stages possible. “Architecture has the biggest impact on power and energy consumption, whereas at RTL and below, only minor improvements can be achieved,” says Tim Kogel, principal application engineer at Synopsys. “Big-ticket decisions need to be taken at the architecture level, such as hardware/software partitioning of the workload, mapping of data structures into on-chip and off-chip memory, grouping of components into power domains, power management policy, such as Run Fast Then Stop for lower energy, or DVFS for lower power.”

Adds Roddy: “ML performance and power is dominated by data movement, not compute power or MAC count.”

This means that memory architecture has to be carefully considered. “You have to consider memory distribution versus aggregation,” says Cadence’s Knoth. “This is true both in time and space. If you are able to distribute the memory physically around the die into smaller memories, that helps even out a big bottleneck and power source. Also, if you can distribute them over time, as opposed to having a giant memory hit, that allows the overall thermal aspects of the memory access to settle out. If you can distribute any high-bandwidth activity, it has a good impact. So memory access is a great area where paying attention to the ability to distribute can provide an overall benefit.”

Based on analysis
As systems have become more complicated, techniques used in the past become increasingly inaccurate. “In the past, architectural level power analysis has been mostly done with Excel,” says Synopsys’ Kogel. “This static analysis becomes increasingly difficult for heterogenous multi-processing architectures with dynamic power management, where the actual power consumption depends on the dynamic workload, actual utilization of heterogenous processing elements, and the state of the power manager. While power analysis based on emulation is becoming mainstream and allows estimation of SoC power estimation of application use-cases, it requires full RTL and software to be available. That makes it too late for architecture-level tradeoff analysis.”

Analysis requires the right model abstractions. “We advocate the use of virtual prototyping for early power analysis in the context of application workloads running on resource models of the SoC platform,” adds Kogel. “This system-level power analysis flow has been supported by IEEE 1801 UPF since version 3.0. We see increasing adoption of this method for early power analysis as static power analysis with Excel is running out of steam. It allows first estimation of peak power and quantitative investigation of architecture alternatives to spread power consumption spatially over more resources and/or temporally over longer time periods to avoid power and thermal issues down the line in later implementation and verification phases.”

Don’t forget the details
While abstraction is good, it has to contain all necessary details. “When you are targeting something designed to operate at high-frequency, you have few gates between registers,” explains Knoth. “When the industry focused on that type of architecture, glitch tended to drop off people’s radar. Glitch can significantly increase total power as signals propagate through multiple layers of combinatorial logic where it can take many changes before the signals settle out. As processes technologies have improved, especially for AI types of applications, you are getting deeper datapaths that incorporate more combinational logic. So the percentage of power switching between the edges is going to go up. That is causing people to look at more glitch-tolerant logic design.”

A small error multiplied by a lot can be significant. “If you are undercounting what can be a source of power dissipation at the MAC level, and tile that across a reticle limited design, that small undercounting of power at the sub-block level ramifies to the whole design level,” adds Knoth. “You could have all kinds of problems just because the early estimation of power was faulty.”

Those ramifications can affect other aspects of the design. “The large number of MAC operations per cycle makes it important to analyze peak power and its impact on IR drop,” says Arti Dwivedi, principal technical product manager at ANSYS. “Peak power analysis and identification of critical vectors for IR drop analysis require accurate power profiling for billions of cycles of emulation activity data for multiple realistic scenarios. There are multiple facets to consider here. First, whether the vectors being used for this analysis are actually stressing the design for worst IR drop scenarios, making the vector quality and coverage very important. Second, the power profiling of large emulation activity data requires architectures with large capacity and parallel processing ability. Third, power analysis for billions of cycles generates huge amounts of data. Designers need big data analytics to derive useful insights from this data, such as the minimum vector set required to sign off power integrity and how to fix IR drop issues.”

Tiled designs create some problems that do not exist with more heterogenous circuit types. “Spread spectrum clocking is a technique where you can skew the clock a little bit to spread out peaks in power,” says Knoth. “On the EDA side we spend time looking at not just clock skew for timing closure, but clock skew for power. The goal is that designers should not have to worry about that. We should be able to generate a more power-evened architecture. Spread spectrum clocking is a valuable technique, but is only one of the arrows in the quiver. The real value is getting further ahead in the analysis, which in turn enables going back to the architecture and providing accurate power feedback to the folks writing the C algorithms.”

Runtime protection
Some designs are targeted to run close to the edge, but never go over it. That requires on-chip feedback loops. “On-chip thermal sensors can play a crucial role in detecting real-time temperature responses, so a built-in feedback mechanism can protect the AI chip,” says Cadence’s Kao.

“To manage and keep an AI chip within its thermal envelope, designs typically include not just one but tens of temperature sensors across the die — for example, one per processing cluster,” explains Moortec’s McPartland. “This is essential not just for avoiding thermal runaway, but to enable data throughput to be maximized by minimizing processor throttling. Thermal management is also important for applications such as automotive and in the data center where high temperatures impact electromigration. Automotive applications demand exceptionally long life of 15 or more years, and while data centers operate their chips for a shorter life, they choose to run them with higher data throughput, resulting in high temperatures.”

And while the feedback loop can be active during operation, is still requires a lot of early analysis to ensure it functions correctly. “Equally important are the software tools that can perform physical analysis and simulation while designing the AI chips,” adds Kao. “Different application scenarios can be analyzed concurrently with the chip design before actual manufacturing, shortening time-to-market significantly and ensuring device reliability.”

Those application scenarios are vital to maximizing capabilities. “While the behavior of the power delivery network (PDN) can and is simulated, these devices are all software driven and the PDN may show different behavior simply because different software is used in the end application,” says McPartland. “Despite all the simulations and design tactics used, it is extremely valuable to also embed a fabric of voltage monitors so one can check if the PDN was adequate. Further, with visibility of actual on-chip conditions, you are then armed with the information needed to optimize performance, power and/or reliability.”

Conclusion
Massive, replicated designs present challenges that may be new to some development teams. Some of them are old problems, others only seen before by test engineers, but now they all have been rolled into high-level architectural decisions. Many of these chips are being fabricated in the most advanced nodes, where resolving issues with IR drop are more difficult than in the past. But few of these problems are completely new, and techniques necessary to handle them are available.

“Whenever you give anyone an additional degree of freedom, that is where good architects and designers, coupled with tools that help them get better answers, faster, can make truly differentiated products,” says Knoth.



Leave a Reply


(Note: This name will be displayed publicly)