Power Issues Rising For New Applications

Why managing power is becoming more difficult, more critical, and much more expensive.


Managing power in chips is becoming more difficult across a wide range of applications and process nodes, forcing chipmakers and systems companies to rethink their power strategies and address problems much earlier than in the past.

While power has long been a major focus in the mobile space, power-related issues now are spreading well beyond phones and laptop computers. There are several reasons for this:

  • Power dissipation is becoming increasingly difficult in the finFET world, a problem that is made worse by the fact that at each new node after 16/14nm leakage current and dynamic power density are both increasing.
  • New applications such as AI and deep learning require massive compute power, and new architectures depend on rapid throughput and raw performance. But they also rely on keeping all of the processing elements in a chip busy at all times, which creates power dissipation problems.
  • More customization is required to tackle new markets. As a result, there are fewer derivative chips and more one-off designs, so problems detected and solved for one chip may be significantly different than problems detected in other chips and much more expensive to fix.

These challenges extend from data centers, where AI, networking, and telecommunications require massive amounts of energy, all the way to the edge. At 7nm, it’s not uncommon for chips to be large, sometimes at reticle size, with hundreds or thousands of processor cores. But unlike in the past, where those processors were mostly dark except for required bursts of activity, some of the new application areas require more of these processing elements to be on more often, if not all the time.

And this is where problems such as heat, electromigration, power-related noise and reliability become particularly difficult to manage.

“CPU Utilization, power management and device reliability must be tightly and accurately thermally managed on die,” said Stephen Crosher, CEO of Moortec. “Otherwise, data center electricity bills can be millions of dollars higher than necessary each year. Datacenter operators are now seeing the direct correlation between site-running costs and the thermal monitoring and management adopted way down deep within the system at chip level.”

This is driving new techniques such as real-time, in-chip thermal guard-banding to enhance the implementation of health monitoring, failure prediction and the design of higher rack density configurations. But in many cases the solutions are just barely keeping pace with the problems. Everyone wants to utilize AI/ML/DL in a chip, whether those chips are used inside data centers or at the edge, but the multiply/accumulate processing consumes a lot of energy.

“Whether you’re doing that to a specialized CNN block, as in the case of embedded vision processes, or whether you’re doing it in a graphics chip with GPUs, it’s all about multiply accumulates,” said Yudhan Rajoo, technical marketing manager for foundation IP at Synopsys. “The way we deal with this problem is primarily after the RTL has been written, by instantiating certain complex cells in the RTL by the designer in a hand-placed fashion. For example, there are large boot multiplexers — large compressors and 16-bit muxes and multipliers that we are starting to add in to reduce the overall size of the design. That reduces the number of routes that you need to make, and as you go down in nodes this reduction in number of routes saves a lot of switching power. These things are continuously running and transmitting signals, so as little connection as you can make is what really helps save power.”

These decisions start up front during the planning phase of the design. But as with any other type of design, engineering teams are very worried about design timelines and tapeout timelines, and power can have a big impact on schedules.

“There’s a big race to come up with the best neural network processing architecture, and these RTLs keep on changing until pretty much the last month of tapeout,” Rajoo said. “As a result, design teams are very worried about finding [library] solutions that give enough flexibility to modify things down the line. This has risen as a prime consideration for both SoC designers and their design managers who want to have this flexibility. These teams need a breadth of options, especially on advanced nodes, because the number of foundries that are doing the most advanced nodes is down to two, maybe three if you’re being generous.”

Within these new architectures, optimization around power is becoming a critical design element. “Low-power design is not limited to platforms like mobile or IoT,” said Dave Pursley, senior principal product manager for the Digital & Signoff Group at Cadence. “Computationally intensive algorithms are an interesting problem because the computations themselves will require a significant amount of energy to perform. In other words, there is a fairly high ‘floor’ when it comes to the amount of energy that will be consumed.”

All of this has pushed the design space well beyond just the hardware to the movement of data through a system, including what gets processed where, how precise the computation needs to be, and how it is stored and read in memory.

“Theoretically, from a dynamic switching perspective, the lowest energy solution to compute an algorithm would be to compute it as in as few clock cycles as possible and then shut off via clock gating—or better yet, via power shutoff,” Pursley said. “That minimizes the amount of ‘unproductive’ switching, such as muxing, flip-flops, and the amount clock-switching but that often is not the best tradeoff, because the required silicon area would be larger. That, in turn, increases costs, leakage, and even dynamic energy due to the higher capacitance of longer interconnects. Moreover, it may not even be feasible, especially for computationally intensive algorithms. Power is energy over time, so computing an energy-hungry algorithm in a short time may be infeasible or too costly from a power perspective.”

In these cases, it is the task of the designer and the EDA tools they use to amortize that energy over time. The goal is an acceptable power profile with minimal energy overhead, while still meeting the performance requirements of the application. So while RTL and physical optimizations can reduce power by 20% or more, the most important optimization begins with a power-efficient RTL architecture. That includes an understanding of the clock speeds of the various blocks, how they communicate with each other, what is the memory architecture and the throughput, and what is the overall power impact of the architecture. Modeling all of this remains difficult, however, largely because so many of the applications and architectures are new.

“With finFETs, with self-heating behavior, we have some history from the earliest finFETs,” said João Geada, chief technologist for the semiconductor business unit at ANSYS. “This is the part that concerns me the most. We are making parts for which we don’t really have history on the modeling side on the foundry. We have the simulation technology. If we have the models — both on the highly detailed stuff, as well as on the large-scale chip-wide stuff on our side. We do need both, but we depend critically on models, and that’s still a very challenging area.”

Still, the power problem is so large and diffuse that some higher level of abstraction is required.

“In many cases, the best way to figure this out is to use high-level synthesis (HLS) to actually create multiple RTLs with different architectures and actually measure the power with realistic stimulus,” Pursley said, noting that state-of-the-art RTL power estimation tools today can produce power estimates within 15% of sign-off. “The real trick is to ensure you have realistic stimulus for measuring power. For example, for a processor the ‘boot Linux’ test is great for functional testing and peak power analysis, but it is likely a terrible metric for optimizing average power to maximize battery life. A better stimulus would be the processor running its typical applications. It is important to use the correct stimuli, or windows of stimuli, for the correct optimization tasks. Otherwise, you or your tools will be making optimization decisions based on bad data.”

If the stimulus is known to be representative, it can also feed into the implementation tools to ensure that the same power goals and tradeoffs are being made throughout the flow. Introducing or changing stimuli late in the flow increases the chance of a non-convergent optimization flow, or at least one that takes longer to converge.

Then, as early as RTL synthesis, multi-mode, multi-corner (MMMC) optimization should be used, he said. That allows RTL physical synthesis tools to create power-optimized netlists, which include well-balanced logic to avoid glitching, optimal leakage optimization, advanced clock gating, multi-bit cell inferencing, and power-aware design-for-test.

“Like the architectural optimizations, these types of implementation optimizations have the most impact on power when introduced early in the flow,” Pursley said. “Introducing MMMC in layout or signoff changes the optimization goals partway through the flow. At best, this means that optimizations done by RTL synthesis were wasted and may be undone. At worst, you now have a flow that will take many iterations to converge through signoff, with an increased chance of a costly re-spin due to error-prone manual iterations.”

Methods for reducing power at RTL and below — power gating, clock gating, multi-Vdd, multi-threshold, DVFS — are well understood. The problem is that by the time RTL is available, the project is already well advanced and it’s too late to make any bigger changes, said Tim Kogel, principal applications engineer at Synopsys.

The biggest impact on power, energy, heat, and cost is achieved at the system level, and it works best when the design team has detailed knowledge of the end application and use cases. That allows engineers to group components into power domains that can be powered down as much as possible, as well as to define power management policy and operating points for DVFS. It also helps to figure out the best way to distribute workloads to processing and memory resources to stay within power and thermal budgets.

“The power needs to be considered and optimized well before RTL availability, at the architecture specification phase,” said Kogel. “The problem is that accurate data about the power consumption is typically not available during architecture specification phase. At best you have some data-sheet numbers and data from previous projects. It becomes worse when you try to roll up that premature data in spreadsheets because you are missing the dynamic effect of the application utilizing different components at different points in time. Even if the hardware implementation has been designed for low power, the effective power consumption is often much higher than expected because the software does not leverage the low-power mechanism provided by the hardware. Thus, a small oversight from the software developer can prevent a power domain from being shut down.”

To enable early power estimation, IEEE 1801 UPF has defined a standard format for system level power models. “This way UPF power monitors can be added to architecture models and virtual platforms for software development,” Kogel said. “Architects can analyze and optimize power based on the actual activity, and software developers become aware of the impact of their software on power consumption. Even if the power data is not accurate, trend-based analysis based on the simulated activity provides valuable insight. Later the initial power data can be refined as more accurate measurements become available.”

While characterization of system-level power models remain a challenge, it’s possible power characterization tools could be enhanced to generate system-level power models.

Power matters more
There has never been a more pressing need to care about the power in chips as today because of the rapid rise in data generated by the proliferation of connected sensors and devices.

“In the days of PCs where the source of power supply used to be 220V AC, it was all fine,” said Mohammed Fahad, product specialist at Mentor, a Siemens Business. “But with the advent of handheld devices like smart phones and tablets, it’s not just that the geometries of the computing devices have shrunk. The devices are getting loaded with more and more apps and services. Possibilities to fabricate chips at smaller nodes have enabled the chipmakers to pack billions of transistors on even smaller silicon real-estates. With enormously complex logic going into even tinier chips, the power consumption is getting on the critical path and often causing chips to burn out. Industry research has found that power is the second most frequent reason for chip re-spins. Billions of dollars in investment are going down the drain. This is why design companies today have a very robust low power methodologies in place, built around sophisticated power estimation and optimization tools.”

Performing power estimation is about knowing the power scenario of the chip. Designers would like to understand the overall power consumption of their blocks, where the hotspots are, and which areas are overshooting the budget. In other words, where is power being wasted? If power consumption of the chip stays within the budget, it’s all good news. But what if it doesn’t?

Fahad noted that RTL power estimation tools define a problem statement for the RTL power optimization tools to address it, identify the computational redundancies in RTL, and inform the user how these redundancies in the code could be eliminated. Tools also provide ways to automatically fix these redundancies and write out the power-optimized RTL. “Optimizing the RTL for power early in the design stages pays higher dividends than engaging later in the cycle. Therefore, low-power methodology demands that power optimization should be run well before the code freeze so that it is easy to make any power-saving code changes at the RTL or architectural level-if necessary.”

There are various ways in which a chip’s power consumption can be reduced or controlled, including gating the non-observable operations on flops and memories, stopping the design toggles for stable inputs and outputs and bypassing the stable memory accesses. At the architecture level, changing the shift register operations to circular buffer, and finding a common gating condition for blocks rather than just flops also can help.

Fundamentally, the key to effective power management, from the smallest battery-operated IoT devices to the hungriest GPU and SoC designs, is drawing only the power that is really needed. Different functions on a chip should run at the lowest voltage and clock speed that can deliver the required performance, while functions not currently in use should be on standby or turned off entirely. To accomplish this, complex chips have dozens or even hundreds of power domains, each of which controls the operating state for a portion of the design.

The rules for how these domains can be manipulated are usually quite complex, and iterating through all possible legal power combinations in simulation is impractical. One solution to finding and fixing potential problems may require applying existing tools in new ways.

“Rules for which power domains should be on or off depending on what the chip is doing can be captured in the form of assertions,” said Tom Anderson, technical marketing consultant at OneSpin Solutions, noting that formal can prove that only legal combinations of power domain settings are possible, or generate tests showing violations if there are bugs in the design. “Formal verification can prove that these rules are satisfied under all conditions or report bugs. Finding and fixing power-related issues pre-silicon is critical to avoid a chip that doesn’t work because key functions are powered down, or one that suffers thermal breakdown when too much of the chip is turned on at the same time.”

For this, power estimation isn’t enough. “It’s not that we shouldn’t do anything at the RTL,” said Madgy Abadir, vice president of marketing at Helic. “You can run some RTL power estimation and do things like that, but it is not sufficient. Anything you can do to improve on your design at the RTL is always a plus, but it’s not going to be the full answer. Especially at the physical level, there are phenomena such as thermal and electromagnetic effects, and these effects can only be seen once the layout is complete. Once you have the actual physics, such as the IR drop, analysis needs to happen on the real physical layout that you are planning to implement. Only when you see the effects can you decide if it is acceptable or not. This is not something that can be characterized early on and just put it in a library.”

Especially for high-power-consuming chips like GPUs, a lot depends on the application that is running.

“When people develop GPUs, it’s like developing a microprocessor in the old days,” Abadir said. “They don’t know exactly what applications people will be running, and it is general-purpose. There might be many, many customers and applications that would change over time. It may take a couple of years for that to get developed from the time it’s in RTL to the time it is on the shelf. During that time a lot of software gets written, a lot of apps will be developed. The algorithms are where the optimization needs to happen, and some of it depends on what type of algorithm you need to be running. If you’re doing pattern matching or if you’re doing sorting or searching, there are many different ways of executing these types of tasks. Every one of them has a different power, a different performance kind of characteristic. Depending on what you’re trying to do and how good your software developers are, at the end of the day, this is what determines the actual power consumption of the task.”

This is where knowledge of the end application really helps. “If I am developing a GPU and have knowledge of the type of application that would run on my chip eventually, which in a lot of cases people do, they try to do performance modeling and power modeling in the early stages to figure out the architecture — which type and what to do,” he said. “When it comes to power, it’s a very difficult problem. The problem is how to estimate power at the high level. Some approach this from a characterization point of view, which means you characterize gates and cells, the worst case for timing and for power. But in many cases, such as with GPUs, we’re doing things that have not been done before. Where do you get the models? We’re estimating at the high level how much power is required, and this can be a guessing game because it’s not accurate and can be way off from what happens with the real chips. This is because the actual power consumption has to do with the actual physical attributes of the chip.”

Another significant factor is the choice of algorithm. There may be several different sorting algorithms, for example, each of which may run at a different speed or have different memory requirements. The tradeoffs here can have a big impact on how much power is used, though.

“As a developer of the chip, at the RTL what do I do? I developed the GPU that can add and multiply, go through memory and get things to operate in parallel, have multiple threads,” Abadir said. “Most of the techniques for lowering the power come in the implementation stage for these type of chips, so doing it early requires control of the application. I need to control the algorithm. ‘Early’ means the keys are in the hands of the software people. ‘Later’ means the keys are in the hands of the hardware people. It might be both of them operating in hardware-software co-design, but later on somebody will pick the algorithm, and now the hardware guy needs to tweak all the possible things that are clock gating.”

The semiconductor industry is coming to grips with the fact that general-purpose chips are no longer the path forward. The new currency is data, and processing that data quickly with blazing fast throughput and access to memory are key design elements.

But making this happen without burning up a chip is a massive and growing challenge, and it’s only getting harder as the volume of data increases and the benefits of device scaling decrease. Power is the main gating factor, and it’s becoming much more difficult to fix as compute architectures and demand for processing continue to rise.


Related Stories

Power Modeling And Analysis

Taming NBTI To Improve Device Reliability

Designing For Ultra-Low-Power IoT Devices

Leave a Reply

(Note: This name will be displayed publicly)