中文 English

Searching For Power Bugs

To find wasted power means you understand what to expect, how to measure it, and how it correlates to real silicon. We are further from that than you might expect.

popularity

How much power is your design meant to consume while performing a particular function? For many designs, getting this right may separate success from failure, but knowing that right number is not as easy as it sounds. Significant gaps remain between what power analysis may predict and what silicon consumes.

As fast as known gaps are closed, new challenges and demands are being placed on the tools. This makes power analysis, and early attempts at power optimization, one of the most innovative areas of EDA. Various markets are concerned with different aspects of power, each of which affects specific aspects of the design or implementation process, while shrinking geometry sizes are adding new physical effects that are not yet fully incorporated.

Design and verification teams are having to reconfigure to meet these challenges while balancing the ROI that comes from reducing power against improved designs or cheaper products. “Power-aware design is critically important and gets a lot of attention but is not a straightforward process at all,” says James Myers, distinguished engineer at Arm. “It’s also totally different, depending on the kind of design.”

That is driving lots of attention on this issue. “Every single customer is interested in power,” says Rob Knoth, product management director at Cadence. “But what power means changes with every single conversation. There are products that live and die by how much power they consumed. Others are more concerned with how many air conditioners they will require, or if they can power devices from ambient energy. They’re all going to look at power and power bugs through different lenses.”

Defining a power bug may sound easy, but what is it? “We defined a power bug as undesired power consumption,” says Preeti Gupta, head of PowerArtist product management at Ansys. “It’s not helping with the functionality. But power is a number. If my design consumes 500 milliwatts, how do I know whether that is optimal? Is that 5X from where I should be?”

Hunting for power bugs
Perhaps an even more important question gets asked when a chip comes back and consumes more power than expected. How do you find the cause of that power bug and what may have caused the divergence between predicted and actual? Where did the process let you down? Some potential disconnects are shown in figure 1.

Fig. 1: Today’s ad-hoc power analysis. Source: Cadence

The bug could be down at the very detailed level, at the highest level of abstraction, and everywhere in between. Device flexibility can be a blessing and a curse. “Today, designers are presented with an array of devices they can use,” says Haran Thanikasalam, senior staff applications engineer in Synopsys’ Design Group. “For example, foundries provide high Vt devices, low Vt devices, and ultra-low Vt devices. If you go to the high Vt devices, those are slower devices, but they dissipate less power, whereas if we go to ultra low Vt devices, they are extremely fast, but at the same time they leak really badly. There could be a lot of power being wasted in the decisions made. In addition, when combining low Vt devices and high Vt devices, certain foundries or processes do not allow these two diffusions to merge together, so they have to be separate diffusions and that affects your area.”

At the latest nodes, new effects come into play. “Today’s devices are almost at the angstrom level,” adds Thanikasalam. “Even a slight variation can massively change the way a device is going to work. How do we even measure power, and how do we correlate these power figures provided by simulation with the actual silicon? This is a growing problem, because on silicon you cannot pinpoint exactly how much a particular block, such as a memory, is wasting because there is no way that you can measure that information.”

At the system level, different problems are found. “Simulators are inherently limited by the number of cycles they can simulate, or the number of realistic scenarios they can run,” says Ansys’ Gupta. “Users need tools and methodology that can take real chip-level traffic and model that early on. They need to consider that for a billion clock cycles, there are different modes of operations, and this is the corresponding power profile. This is when my video IP is turning on and turning off, when my CPU subsystem or my GPU subsystem is idle, and any exposed power bugs will have a very high impact. Imagine a scenario where for several seconds your GPU subsystem could have been turned off. You cannot recognize that in a simple simulation scenario, but you may be able to recognize that in realistic application scenarios.”

That creates the age-old dilemma of abstraction and fidelity. “As you go to higher levels of abstraction you cannot have the same kind of accuracy as you would when the design is more well defined,” adds Gupta. “But early analysis does provide insights into higher impact power issues. Many teams concentrate on the RT-level today. For example, you are looking at a design with millions of flip flops and you want to extract one common high-level enable. But at RTL, a clock net is ideal, and that can provide less predictable results. We have to estimate loads, the kind of capacitance it will drive. We do clock gate splitting, we do buffer sizing, we create mesh networks, tree networks, with the idea that this is not about getting to a power number that is super accurate, but these high-level power debug scenarios should be recognized with fidelity.”

Widening concerns
Teams must clearly understand their power concerns. “What are you most worried about?” asks Cadence’s Knoth. Figure 2 pictorially shows just a few of the potential concerns. “Is it a thermal concern? Is it a peak power concern? Is it a standby power problem? Is it a di/dt problem? Are you worried about wake-up rush currents? Even when you know what the concerns are, you have to ask when would you have the appropriate stimulus available to do that power or thermal analysis correctly so that we’re not making an incorrect conclusion and either adding too much margin to the product or delaying the schedule. It’s a very co-dependent problem.”


Fig. 2: Five power scenarios. Source: Cadence.

Gupta agrees. “Some people started measuring power at the gate level in order to see if the power grid has been constructed properly to be able to sustain that power. Is the package enough to sustain that power? From there it has morphed into much more complex scenarios: hundreds of power gating conditions, dynamic voltage, and frequency scaling. You bring all these factors together and the complexity shoots up.”

Some designs acknowledge that feedback loops have to exist to allow for power issues to be dealt with in situ. “In servers, the focus is likely to be on maximizing throughput within a fixed thermal envelope, which comes down to active power/GHz, thermal management, and tolerance to supply noise,” says Arm’s Myers. “Some of this can be done post-silicon by characterizing voltage and temperature sensors then tuning system management software — so configurability is important. But there are also complicated hardware design feedback loops, such as when to throttle a particular block to maintain system integrity with minimal throughput impact. Current spikes are problematic for system integrity but depend upon context such as decoupling capacitance, neighboring blocks in the floorplan, present DVFS point, regulator load, package layout and more. Given all of these factors, the feedback loop is typically too long and risks instability, so new methodologies are required, because too much static margin can directly impact performance.”

Some people care about the integral of power — energy. “Some companies are changing the conversation from what are we doing to optimize power, to what are we doing to optimize energy?” says Knoth. “At the end of the day, energy is what’s actually accomplishing work. Power in many ways is an easier thing for us to measure and easier thing for us to juggle, but it really is energy that is the end goal, and the more we can directly measure it, the more we can create tools that help to understand it and work with it.”

Many aspects of power that used to be analyzed separately are now becoming linked by physical attributes. Thermal impacts both static and dynamic power and that also affects timing. Activity creates heat, so there is a feedback loop. Scenarios have to be long enough to not only create the heat, but allow for the dissipation of that heat across a die to see the impact it will have on neighboring devices.

“Part of power consumption is determined by the structure of the functionality,” says Gupta. “The second part is determined by how activity is flowing in your design. Activity has a first-order impact on power, and of course layout and variation effects also matter. There’s a lot of focus around clocks because it is the fastest signal in your design and it controls a lot of the power consumption that happens within the design. You shut off the clock and you save a lot of power. You shut off the supply and you save even more power.”

Power has to be an integral part of a process. “It’s a multi-layered approach,” says Knoth. “You have to consider the kinds of effective conclusions can you draw with the information you have today. As the design progresses, and things mature, you’re able to get more accuracy and you are able to get more insight into the product, but sometimes the amount that you can change in the product reduces. It gets less over time. You have the most flexibility early, but the least accuracy. When do you need to lock down certain decisions about packaging, about heatsinks, about power grid robustness? You have to be looking at this in terms of the overall product schedule.”

That is always a delicate balance. “With silicon development, there is always the challenge and risk that you will run into the GIGO effect (Garbage In Garbage Out),” says Dan Cermak, vice president of architecture and product planning at Ambiq. “Performing power debug too early in the design stage could give you misleading/erroneous results, but waiting until the final design is complete to start the power debug is too late to affect meaningful change.”

Result fidelity
The accuracy of simulation results is determined by how well the necessary physical effects can be modeled. Functionality is all about ones and zeros, but for power issues that can be problematic.

“Consider a memory,” says Synopsys’ Thanikasalam. “There are bit-lines that run through SRAM memory, and the primary power for that memory comes from these bit-line swings. They dissipate a lot of power. When you do simulation, you have the capability to set them to either VDD or they are grounded. In real silicon you have no way of doing that. Even if a bit line comes up at VDD, over time that bit line is going to start leaking because there’s nothing holding that bit line to that VDD point. These are differential pairs, and they can just come up right in the middle and then consume a lot of power. So there’s a big correlation gap between simulators and how real silicon works.”

Even when digital abstractions are assumed, there is plenty of room for error. “There is the issue with determining the appropriate workloads/scenarios that must be analyzed,” says Ambiq’s Cermak. “Is it a representative workload? Does it cover all critical operating modes of the design? For larger designs, this problem becomes compounded since you have to break these workloads into smaller micro-workloads to assess practically.”

“Your power analysis is only as good as your vectors,” adds Knoth. “You have to be looking at the problem one level up, where you’re looking at, ‘What’s the coverage of this vector? What does the activity look like?’ We’ve invested quite a bit in building utilities that help customers do more work with the stimulus itself, to merge different vectors together to create new scenarios, to scale activity in one vector versus another vector.”

The whole process is a lot more complex than functional verification. “Unless you stimulate a part of the circuit, you don’t toggle that device and there is no heat coming off it,” says Thanikasalam. “You have to make test benches more rigorous by making sure that every part of the circuit is actually toggling when doing the simulation. That has a negative effect on the performance, and it takes more time and it takes more capacity. It’s never a single problem anymore. You have to solve everything at the same time. Isolating a single effect is becoming extremely hard.”

You cannot arbitrarily crate large vector sets. “I may have thousands of vectors,” says Gupta. “How do I identify which are the most active signals that are common across all these vectors. I have timing critical paths, how do I characterize the timing power sensitivity along those paths in order to make design decisions? Methodologies need the capability to store all kinds of power-related data, then a framework and an API where users can look across large designs, and long vectors and help them gain meaningful insights.”

IP issues
When using IP in the design, there can be questions about the fidelity of the power models being provided. “EDA tools are good at pinpointing contributors to power, if they are in digital logic and you have appropriate simulation stimulus,” says Myers. “But they are little help for checking inside macros like memories or mixed-signal parts of the design where you are reliant on your designer or IP provider. Fortunately, there are standards initiatives in this area such as IEEE 1801, which is pursuing enhanced modelling of power-aware macros.”

It is early days. “The industry has made some really good strides toward normalizing the fact that power information is just as important as timing information when you’re packaging up and selling IP,” says Knoth. “Even if you just look at how timing models have been shipped around the industry, there’s been an incredible amount of evolution that’s happened since Liberty models were first introduced, and power has an additional dimension than timing.”

Right now, issues remain. “When design houses define their power specs to their end customers, there are a lot of assumptions,” says Thanikasalam. “These settings were used for that power case, and those specific settings may not even be possible on real silicon. So there is this discrepancy between what is quoted based on simulation and the actual number provided by silicon.”

Whose responsibility?
Design and verification teams work together and yet independently. Understanding power requires a lot more design knowledge than functional verification, so who is ultimately responsible for finding power bugs?

“The companies that are more successful with power have created a new team called the power methodology team,” says Gupta. “This team fits between the design team and the verification team. They are the folks who take the designs created by the RTL designer and run these power analyses and figure out what changes can be made. Then they manage those changes through the design community. They work with the verification engineers in that they help them recognize what a power vector should be.”

The verification team traditionally has been the maintainer of the vector set. “The industry has matured such that the functional verification of a product and the power analysis and optimization of the product are joining together,” says Knoth. “Those two really need to be one and the same, or neither one does their job as effectively as they could. Someone doing functional verification can keep an eye on power. You shouldn’t be forcing people to use a wholly separate ecosystem of tools or sets of runs. It is incumbent upon the EDA industry to make that as painless as possible to translate waveforms into watts.”

That may create a conflict of objectives. “Over time, I have seen the power methodology teams start hiring verification engineers because they were competing with functional regression resources,” says Gupta. “Without functionality, the chip is nothing. So it was hard for them to lobby and campaign for power vectors, and the paradigm shift is that power methodology engineers are now writing vectors for power.”

Conclusion
A huge amount of investment and innovation is going into power analysis tools today, and there are no easy answers. Users are forced to make tradeoffs between the extensiveness of the test and the fidelity of the results, and to assess what is necessary for each decision that must be made along the development path. But this is only the beginning of the journey.

Analysis is the first stage in the development of methodologies, which needs to be followed by insights, optimization, and automation. Some of this is coming at the same time as the problem space continues to evolve.

“When you switch from having a power focus to an energy focus you’ve got an additional degree of freedom that you didn’t have before,” says Knoth. “It’s pretty fascinating to look at what that could do to things like place-and-route and synthesis. There are some amazing opportunities for innovation once you start considering energy versus power.”



Leave a Reply


(Note: This name will be displayed publicly)