Taking Power Much More Seriously

Without the benefits of scaling to help reduce power consumption, design teams must take responsibility themselves. It all starts with the architecture.

popularity

An increasing number of electronic systems are becoming limited by thermal issues, and the only way to solve them is by elevating energy consumption to a primary design concern rather than a last-minute optimization technique.

The optimization of any system involves a complex balance of static and dynamic techniques. The goal is to achieve maximum functionality and performance in the smallest area possible, while using the least amount of energy. But until recently, power optimization was the last criterion to be considered, and it was done only if there was sufficient time after performance goals had been met. That is no longer a viable strategy for an increasing array of devices, because power is the primary limiter to what can be achieved. Unless power and energy are considered part of architectural analysis, including hardware/software partitioning, late-stage power optimization will not be sufficient to remain competitive.

Most of the optimization techniques being used today are deployed after the RTL has been completed, with some during detailed implementation. This is evident in gate sizing, for example, where timing slack is used to decrease the performance of transistors so they consume less power. Other techniques are used during RTL implementation, such as clock gating, which utilizes hardware triggers to turn off clocks when it can show they are unnecessary.

According to the latest Wilson Research Group and Siemens EDA, 2022 Functional Verification Study, 72% of ASIC designs now use some form of active power management, as shown in figure 1.

Fig. 1: ASIC power management features that require verification. Source: Siemens EDA

Fig. 1: ASIC power management features that require verification. Source: Siemens EDA

“If you look back 10 years, that would have been 62%, but realistically in the past couple studies, it has leveled off around 72%,” said Harry Foster, chief verification scientist at Siemens EDA and the person who leads the study. “However, if you dig a little bit deeper and look at designs over 2 million gates, you find that at 85%.”

Power in the development flow
Power issues are ubiquitous in that every design decision, from the largest to smallest, can impact power. “To understand anything about power and energy, you have to look at so many different factors,” says Rob Knoth, product management director in Cadence’s Digital & Signoff Group. “You need to understand functional activity, system intent, the physics, interconnect, and the gates — everything. To make meaningful decisions, you need to be very multidisciplinary, and to the customers who really care about this, the people who are doing incredibly energy dense computations, this matters.”

Attitudes are changing, although slowly. “The focus has been on performance and time to get to results,” says Amir Attarha, product manager for Veloce Power at Siemens EDA. “It is becoming time to get the results within a power budget, where the power budget may impact the time to get the results. It starts from a very high level, when you’re doing software hardware partitioning and deciding which part needs to be in software and which part needs to be in hardware, and to the microarchitecture decision like adaptive voltage scaling, or dynamic voltage frequency scaling. Every one of these techniques involves a tradeoff. For example, you can’t immediately jump from one frequency to another. Does it provide enough benefit for every algorithm that they have?”

Schedule can indirectly influence it, as well. “Invariably, a project has power goals up front, but power features are last on the docket due to the need to deliver functionality that enables verification work to get started,” says Steve Hoover, founder and CEO for Redwood EDA. “By the time power becomes a focus, the implementation team has made progress with layouts and the project has fallen behind schedule. Adding clock gating would create new bugs, put pressure on timing and area, and necessitate rework for the implementation team. None of this is desirable when the schedule is the top priority. So management makes the difficult decision to accept a bit more platform cost and tape-out silicon that runs a bit hot.”

Power issues shifting left
Software is beginning to play a larger role. “Companies are thinking of power as a system-level problem that encompasses both software and hardware,” says Piyush Sancheti, vice president of system architects at Synopsys. “It is a constant tradeoff in how a system is architected, deciding how much of the power management intelligence to build in hardware versus software, and how complex it is from a software design standpoint and hardware design — and ultimately the verification and validation of such a system.”

This creates new demands on tools, as well. “It often requires levels of analysis not considered in the past, and automation only becomes possible once the techniques have started to become democratized,” says Cadence’s Knoth. “We have started to see a broader footprint of customers ask us about this, and now we have to work out how to make it more available, how we package it, how we automate it, how we make it more useful. One of the first areas, as you move up the food chain, is to start looking at partitioning. What do we need to provide in design space exploration? How do we more nimbly mock up the partitioning and still get enough accuracy with the estimates.”

High-level goals sometime can be more important than local optimization. “For cloud workloads, latency and response times are critical,” says Madhu Rangarajan, vice president of products at Ampere Computing. “Any power management technique has to avoid latency penalties, which may optimize for a local power minima in one server but compromise the system as a whole. This can result in higher power being consumed overall. It also will reduce the total throughput of the data center, thereby reducing the revenue generated by the cloud service. All power management techniques must work without compromising on the fundamental tenet of not increasing latency.”

This is why it power needs to be dealt with at the highest possible level, with a well-defined methodology that progresses the power goals through the design flow. “Where does power fit into your importance overall?” asks Knoth. “That guides the type of design techniques you are going to use. It’s crucial that people don’t just automatically jump to, ‘I’m going to try to squeeze out every ounce of inefficient power immediately.’ That can add extra latency and an overly complicated power grid, because you have all these small domains scattered out everywhere. All of those cost something, even if it’s just schedule to design and verify it. And if the return isn’t going to justify the cost you are spending on it, you’re probably making a bad decision.”

Power complexity
While it may seem fairly simple to just add another power domain, there can be many hidden costs and potential problems. “You have to consider the entire power grid and the impact of any change on that,” says Marc Swinnen, director of product marketing at Ansys. “You need to do a full transient analysis, and one of the hardest things about power switching is managing the power surges. Peak power happens when you turn something on. It is not just the block that is being switched on, but all the surrounding logic that feels that current draw and experiences a voltage drop. It is not free. Switching a block on costs you a certain amount of power and you have to include that tradeoff. If I switch it off because I am nor using it briefly, is it worth switching it off because it will cost me power and time to switch it back on.”

And that can also influence functional verification. “When you turn something off and then back on, you have to verify that works correctly,” says Siemens’ Foster. “Should the block have retained state, and did that happen correctly? You have to verify the power transitions because basically you have a conceptual state machine, and you have to verify the transitions of these power states.”

Consideration for thermal adds another level of complexity because of time constants. “While power has a very short time constant, thermal can have a very long time constant,” says Knoth. “There is a place for both hardware and software control in terms of thermal management. Some things are best controlled instantaneously with hardware, and these can help prevent a thermal problem from occurring. Thermal effects aren’t instantaneous, they build up over time and dissipate over time. Software control plays a very important part in making sure the overall system isn’t violating a thermal budget. It’s not a problem that is only solved by one or the other. It requires a handshake between the two.”

Time constants can be exploited. “The peak power requirements of most systems are larger than what they can dissipate as heat in the long term, although short peaks can be exploited to deliver higher performance when usage patterns include recovery periods where the system can cool down,” says Chris Redpath, technology lead of power software within Arm’s Central Engineering. “This requires a complex system of dynamic power controls to operate in both hardware and software.”

This is one of the issues that is driving the notion of shift left. “We are being asked to put together solutions that shift left the earliest dimension of placement and routing,” says Siemens Attarha. “You need this to start doing thermal analysis. Switching data can find high activity areas in your workload, but you need to be able to map that to early physical placement, and then using physics you can calculate the possible temperature.”

Accuracy and abstraction
Assumptions used in the past are not accurate enough. “You need to know the current flowing through all of the wires so that you can calculate voltage drop,” says Ansys’ Swinnen. “But this is temperature-dependent, so you need to know the global temperature, which depends on the heatsink and the environment, but temperature varies across the chip. In the past, a single temperature across the whole chip was accurate enough. But now we need to do thermal modeling and include Joule self-heating. As current flows through a wire, it will heat it up.”

This aspect of the problem will explode with chiplets. “In a heterogeneous integration context, you’re dealing with different materials and different nodes,” says Shekhar Kapoor, senior director of marketing at Synopsys. “And you have different substrates, which are probably varied, as well. With all these different thermal expansions you will see varying amounts of warpage and mechanical stress coming into the picture. So first you have this problem with power density, which could be high because of transistor density, and now you have additional thermal problems. These issues cannot be ignored, and they have to be part of system planning and can be better managed if you appropriately partition your design up front. Then, you create the necessary models and do hierarchical analysis while you’re doing architecture planning upfront to manage and address that. So partitioning and power and thermal all go hand in hand when you are looking into the multi-die system.”

Getting the right models can be tricky. “Any time you’re dealing with IP integration, there’s a natural amount of abstraction that has to happen, just because of the scope of the problem,” says Knoth. “The amount of data and the information you’re trying to juggle makes it that much harder. Also, if you can’t change anything inside of a box, knowing about the guts of that box can be extra cost that you can’t quite afford in terms of compute power, time, turnaround time, etc. You’ll start seeing more and more relevancy of higher-order models as they become more numerous. But the more you abstract, the more you lose some of that fine-grained fidelity. Depending upon how that chip was architected, you may not want a model. You actually want to be able to chew on fine-grained power gating so that you can accurately make it dance the way that it’s intended.”

More models required
It is one thing to deal with these issues when you are designing all of the chiplets, but it’s very different if a third-party chiplet market ever becomes a reality. “At the chiplet level, when all these dies or chiplets are coming from different sources, one of the big emerging needs is for electrical, thermal, and mechanical (ETM) models that will have to come along with them,” said Synopsys’ Kapoor. “Are you going to have an enclosed system, stacked systems, or are you going to have a 2.5D package. All those sorts of modeling requirements are emerging. Airflow is another thing. What airflow can you expect? That will be coming into the picture. Today, it is manually created models that are being used, but is there any standardization? Not yet, but as chiplets become more prevalent, those models will emerge.”

The need for these models is evident at the highest levels of abstraction. “There is a lot of room for standards and open interfaces to enable fine-grained power management that can be seamlessly applied, even in heterogeneous systems,” says Ampere’s Rangarajan. “For example, if half of the threads in a host-CPU are idle, would it be possible to shut down portions of a DPU to save additional power at a platform level? And can these portions be woken up at an acceptable latency when the host-CPU needs it? You already can see many examples of joint hardware and software power management in the ACPI power management mechanisms, but these are written with a client and legacy server focus. They need to evolve significantly to be useful in a cloud native world. This will involve new hardware designs and new software/hardware interfaces.”

But equally, those models have to work at the most detailed of levels. “On a single die, inductance has not been an issue because the distances and lengths are small enough,” says Swinnen. “But when you get to interposers, which have power supply interconnect and thousands of signals using fine dimensions over long distances, the power supply ripple, or noise, can be communicated electromagnetically to coupled lines. If there is a bus line running 3 or 4cm across the interposer next to a power supply line, and the power supply has a ripple, the signal will feel that.”

Conclusion
Power and thermal are becoming pervasive issues that will, in many cases, separate successful products from the rest. It will impact the entire development flow, from concept creation through architectural analysis and hardware software partitioning, through to design, implementation, and integration of blocks, dies and packages.

Models for many aspects of this are being cobbled together manually today. But it will take a wide range of models, with the right abstractions and performance, to perform the myriad and necessary tasks.



Leave a Reply


(Note: This name will be displayed publicly)