Choosing Power-Saving Techniques

There are so many options that the best ones aren’t always obvious.


Engineers have come up with a long list of ways to save power in chip and system designs, but there are few rules to determine which approaches work best for any given design.

There is widespread confusion about what techniques should be used where, which IP or subsystem is best, and how everything should be packaged together. The choices include everything from the proper level of clock and power gating and frequency scaling. And at advanced nodes, the options can range from choosing the right finFET node, because leakage increases with each new generation, to multi-chip packaging where thin wires are replaced by fatter and therefore lower-power interconnects (it takes less energy to drive a signal through a wider pipe), or FD-SOI, where chipmakers can utilize forward and reverse biasing.

“The choices depend on the end use case power goals, followed by analysis of what techniques have the greatest opportunity in terms of time and energy,” said Ashley Crawford, power architect at ARM. “In other words, do the numbers.”

An overall energy-efficient design organization is paramount, and a solid clocking and clock-gating strategy is also invariably critical, Crawford said. “Beyond that, into scaling and leakage management techniques for example, you have to understand the application and have use case goals to make choices. One pitfall is to believe that a technique will offer more simply because it’s more complicated, that is quite often not the case. EDA is important for implementation of techniques as well as for power modeling and analysis. But the important high-level choices of design and system architecture are not something EDA can reach to today.”

In other words, design teams need to decide which techniques and approaches are best based upon their experience.

“Right now there are so many different kinds of designs, so many different techniques people are going to have to try, which doesn’t necessarily mean implementing all the techniques,” said Luke Lang, director, low-power product engineering at Cadence. “When you implement something that you never use or hardly ever go into, it’s actually costing you power, so it is very important to do that early analysis. If I’m going to do this low power, does it pan out? It may be that it pans out for my design, but not for your design, for example.”

This is why there has been so much focus on power formats, particularly after 65nm.

“One of the key things in doing the power format is to be able to specify your power scenario without changing your design RTL, so at the RTL you’re able to specify all these different power scenarios, and you can run various analysis early on,” Lang said. “The key part of that is to run RT-level analysis, specifically RTL power estimation, so if this many state retention registers are added, those will take extra power, for example, during functional modes. But if I hardly ever go into power shutoff modes, overall I’m going to burn up more power. If they take up more power, but they’re only going to be on 10% of the time, 90% of the time they’re going to be off and there’s going to be an overall power savings. That’s fantastic. You need to know that, and you need to know how much more power it takes when it’s on, and you need to know the duration of the time when it’s off. All the power analysis you want to do early on because you don’t want to have to go through synthesis, place-and-route, go into the back end, and then realize you don’t really want to do this technique.”

Clock gating
One of the key advances in low-power design is the widespread recognition that power is global in a design. It can affect every aspect of a system, and therefore must be approached holistically.

“There are multiple techniques, many apply at multiple design abstraction levels, and many require both the designer and EDA tools and methodology,” said Preeti Gupta, head of PowerArtist product management at ANSYS. “Techniques, tools and methodology also continue to evolve.”

Gupta pointed to several approaches that relate to clock gating as a highly effective power reduction technique:

1.Synthesis-based register-level clock gating (automated synthesis technique). This continues to be the most popular, effective and mostly automated technique to reduce power. While high-performance designs such as processors may employ a custom clock gating insertion approach, due to the impact of clock gates on timing, most design applications have adopted clock gating for registers that share common enables.
2. Observability and stability-based clock gating (automated RTL technique). In recent years, RTL techniques that rely on data observability and stability conditions have emerged to complement synthesis-based techniques. While synthesis relies on existing RTL source code, RTL techniques identify changes to RTL code to create better and more clock gating opportunities for downstream synthesis. Looking across a representative set of vectors is also important to ensure power is reduced across the design operation and not limited to a corner case scenario that the design hardly spends any time in. EDA tools for RTL power provide the performance and capacity to look across simulation vectors and across design hierarchies to perform such complex searches. Even if designers still prefer to make the end RTL edit themselves versus a tool, the identification of the enables is largely automated.
3. Architectural clock gating (RTL and above abstraction, limited automation). This approach has the highest impact, especially for designs that are inactive in certain modes. Configuration bits and major modes of design operation can be used to shut off the clock for a large number of flops at the same time, and even shut off power entirely. Such opportunities are mostly identified by the architect or the designer today, although EDA tools are also steadily incorporating techniques to automate identification of high-level control signals.

“Power reduction techniques are obviously not all equally effective across all design applications,” she said. “A design that is mostly active will not benefit from techniques that address redundant activity. Wall-powered networking applications were the last to adopt synthesis-based clock gating. Different applications are driven by different criteria. For example, high-performance designs may select an architecture that runs faster at the cost of area and power overhead, or not insert a clock gate that saves power but makes it difficult to meet timing goals.”

Clock gating has been implemented for quite some time, but Drew Wingard, CTO of Sonics, pointed out that frequency scaling has been around even longer. “I can change the divide rate on my clock generator, the thing that catches between the phase lock loop or delay locked loop, and the rest of my circuit and say, ‘I don’t need to run fast anymore, I can run slower.’ While that doesn’t help me with leakage, I can reduce the dynamic power. And especially for parts of the circuit that are still getting a clock even when they’re idle, by slowing down the clock I reduce the number of those edges. That’s why I reduce dynamic power.”

Fourth terminal, power gating
Other techniques that designers use to try to help manage the leakage or static power dissipation include worrying about the fourth terminal of the transistor, and power gating.

In worrying about the fourth terminal of the transistor, notably the back gate or the well, if the voltage on the well can be controlled, the characteristics of the transistor can be changed—especially the threshold voltage.

“By applying body bias, you can do some interesting things,” Wingard said. “For instance, when we are designing digital chips, we worry about the characterization corners of the process technology because you don’t get the same transistors every time. Something happens in manufacturing. There’s a little bit more of this dopant, for example, and the transistors come out with a range of behaviors that tend to be relatively constant across a given wafer, but can vary substantially between wafers. When you’re characterizing, you talk about the slow corner or fast corner. That basically boils down to the threshold voltage of the average transistor in a circuit vs. that wafer. By playing with body bias, you can either raise or lower that threshold voltage to make it closer to what you wanted (i.e., the nominal threshold voltage). This means you can take slow chips and speed them up a bit, and you can take fast chips and slow them down a bit. By doing that you come up with a more uniform transistor, and that means you can control the leakage. In these fully-depleted SOI technologies, we’re living in silicon-on-insulator world, and that means that every transistor has its own fourth terminal. Normally, all of the n-type devices have the same fourth terminal, and all the p-type devices that are in a given well tend to have a shared one, but with fully-depleted SOI, because essentially every transistor gets its own small well, you can play with them individually, which means you can take your analog circuits and you can change the bias one way or think about going into a low power mode [under certain conditions].”

Then with power gating, the power is shut off, but this technique comes with the challenge that it impacts place-and-route, because all of the gates that are attached to one supply that you want to turn off probably want to be placed very close to each other so they can share a common electrical routing for the power supply, he said. “That makes it more challenging in the place-and-route because the placement system wants as few constraints to optimize around as it can, and this adds more constraints and makes the design of the power distribution network more complex. You have to deal with the fact that sometimes it can be assumed that the capacitance of Block A is there to help protect my supply against spikes, and sometimes it won’t be there because it’s been power gated. In essence, there are complexities associated with power gating.”

To try to optimize the power and energy the circuit is taking, there are two techniques associated with dynamically changing the supply voltage and frequency together.

The simplest is dynamic voltage and frequency scaling, Wingard said, where a circuit is characterized at a set of supply voltages. “Then you say, ‘The operating system tells me I want to run at 800MHz right now. What’s the lowest voltage I can run at that allows me to do that?’ You need to talk to the external power management IC, which is the normal way people provide the supplies to these chips, to say, ‘You can reduce the supply voltage from 1.0 to 0.95 at this point. Once it gets down there, it’s safe to run at 800MHz’. Now you are controlling both the output to the phase locked loop, and the output of the voltage source. That’s typically done at a pre-characterized set of points that are known before the chip is built, and are re-characterized after the wafers have been built, on the tester.”

A more dramatic technique is adaptive voltage and frequency scaling (AVFS), where the optimization is concerned with the environment the chip is in at a particular moment. This takes advantage of what temperature it’s operating at, and the supply voltage.

“Typically, the way people do this technique is to build a replica circuit that tries to mimic the worst-case delay of the much bigger circuit that is trying to be power optimized,” he said. “It’s basically turned into a ring oscillator, and the frequency is measured. If it is measured faster than they need, that’s a signal to them that they can reduce the voltage a little bit. If it’s running slower than they want, then they’d better raise the voltage a little bit. The challenge is that circuit has to be designed conservatively because it has to model what’s going on. Pretty quickly, if you start to give it slack, it doesn’t look much different than the DVFS. “You can easily imagine a day in which [AVFS] becomes the dominant way because the problem with DVFS is the number of corners that have to be optimized. It becomes a very complex physical verification problem to try and prove that when the voltage is changed by 50 millivolts, that my frequency is going to go down by 17MHz. That takes a lot of time to do that characterization, and the people who advocate AVFS say you get a more continuous range of choices, and you are relatively more comfortable with variation than the people who have to do it by rote in DVFS.”

Baseline techniques
There is a sliding scale of what power-saving techniques gets used where and when. Abhishek Ranjan, director of engineering at Mentor Graphics, asserted that power gating, DVFS and voltage partitioning are the bare minimum. “If you don’t do that, you’ll never have any hope of meeting the power budget. Those need to be done, and they are fairly well-defined techniques. You don’t need a very detailed description of the chip or the detailed layout to measure the power or know whether they are going to benefit the design’s power or not because they always benefit at a high level.”

Once the architecture is frozen, and the design flow is roughly determined, the microarchitects take over. From a very high level specification, they starting coding the design at a C level or some other meta language, making decisions about pipelines, schedules and various stages, among other things. Here, they have a lot of flexibility in terms of trading off timing, power, and area — and deciding the eventual flow, which the RTL designers are going to code. Various techniques are possible at this point—clock gating, playing with the voltage and frequency scaling, pipelining, retiming. A lot of those techniques are deployed to trade off area, power, and timing.

“Many decisions are still being done based on the prior knowledge of the design or the expertise or experience of the microarchitects, and it’s so early on it’s difficult to estimate the real impact of power on whatever you’re doing,” Ranjan said.

Once the microarchitecture is frozen, then the job of the RTL designers is to take up that microarchitecture and translate it into an RTL description, he explained. “That is where the real action kicks in, where you have a lot of flexibility but you’re bound by the guidelines the architects and microarchitects have already laid out. This is the first state where the register boundaries are well defined. That is where the clocks are present. The pipelines, registers, and datapath are very well frozen. So now you have a very good estimate of how the design is going to behave from the timing, area, and power perspective. Designers put a lot of focus here because that’s where they really know the tricks that can be applied to save power are applicable and measurable.”

But this also is something of a moving target, which makes it much harder to implement these techniques. “Unfortunately, by the time this is happening, the functionality of the design itself is changing, bugs are being identified, they are being fixed, and the schedule is slipping, so the time that is actually available to work on power is very limited,” Ranjan said. The tools that are there to work on power and make actionable changes to the design are also lacking, but techniques like clock gating, data-path gating, block-level clock gating, register-level transformations — a lot of these things can be done to really cut down the design power. But it really depends on the time in the schedule. There are multiple flavors of the tools, both automated and manual. If you don’t have much time and you want to get the most impact on power, you deploy the automated tools, generate a new RTL with power changes, and that becomes the new golden RTL that you give to the synthesis tool and to the implementation. If there is flexibility of time in the schedule, then you can be more daring, and explore micro-architectural level changes, explore retiming, look at the memory accesses and whether caches will be needed to cut down on memory accesses to reduce the power.”

At this point there is no single way to optimize power. But with power now the top consideration in many designs, these approaches are certainly getting a lot of attention. How much will ultimately be automated remains to be seen, but improvements are definitely underway on every front.

Related Stories
Power State Switching Gets Tougher
Understanding and implementing power state switching delays can make or break a design.
Chasing After Phantom Power
Demand for increasing functionality when products are off is being compounded by a growing number of products.
Power Options And Issues
From initial concepts to final sign-off, power has become one of the most challenging design problems.