DVFS On The Sidelines

Dynamic voltage and frequency scaling is moving out of favor even as companies are attempting more aggressive power reduction. Will it have a resurgence?

popularity

Power reduction is one of the most important aspects of chip design these days, but not all power reduction techniques are used equally. Some that were once important are fading and dynamic voltage, and frequency scaling (DVFS) is one of them. What’s changed, and will we see a resurgence in the future?

What is it?
DVFS has physics powerfully in its favor. As Vinod Viswanath, director of research and development at Real Intent sums it up, “dynamic power consumption has a linear dependence on frequency and a quadratic dependence on voltage, with the combined supply voltage and frequency having a cubic impact on power dissipation. Power savings therefore can be achieved by intelligently reducing frequency while concurrently reducing the supply voltage.”

It sounds simple enough, yet in a survey performed by Calypto last year, only 24% of the respondents said they were using DVFS.

“DVFS is indicative of the dynamic power challenge in that it is so important that this is driving people to look at it, but the complexity is making them leery,” explains Mark Milligan, vice president of marketing at Calypto. “There are so many techniques that they can apply that are much more straightforward.”

DVFSsurvey
Source: Calypto survey 2014

But there is one place where you can expect to find it being used. “It is an assumed feature for several processor vendors,” says Drew Wingard, CTO at Sonics. “While this does need to be interfaced to the block providing the voltage and the frequency sources, it is well integrated into the processor domain.”

Srikanth Jadcherla, low power architect in the verification group of Synopsys provides some history about early adoption. “In the mid-90s Intel and AMD introduced competing technologies (Intel’s SpeedStep and AMD’s PowerNow ) for platform level DVFS. The most interesting adoption came from Transmeta, which brought DFVS into the mainstream. They allowed fine-grained voltage positioning on the processors.”

Why has it remained a technique used for processors only? “It requires that throughput can be varied with frequency and get a different performance,” explains Jadcherla. “This tends to be limited to CPUs, GPUs and DSPs. In a modern SoC, most other modules would have no use for DVFS because they are fixed protocol or fixed frequency. They use a Lowest Possible Voltage (LPV) technique. This is a matter of working out what frequency you need to run at to achieve the task and then find the lowest voltage that would work for that.”

Alan Aronoff, director of strategic business for Imagination Technologies provides a somewhat simpler method that can be used based on reducing values of supply voltage, commonly referred to as voltage scaling. “For example, operating a MIPS M-class processor through voltage scaling at 0.95v will result in a power reduction of 23% vs. the value when operating at 1.08v.” Aronoff points out that there could be some limitations to this if the system is not designed as a whole. “Choosing processes and memory IP that can operate over low voltages (e.g. operating in the 0.7v to 0.8v regime in a 40nm process) is highly desirable.”

Whereas most power optimization techniques can be performed at the RT level or below, DVFS requires many system-level decisions to be made. What is the scope of it — full chip, core clusters or individual cores? How many operating points should it have or should it be continuous? Should there be on-chip or off-chip voltage regulators? Can it be used during a transition? How will it impact verification? But perhaps worst of all, it involves hardware and software working co-operatively.

Each of these decisions is complex on its own, but many of them have intricate dependencies between them. “Over the past 15 to 20 years, we have noticed that DVFS has evolved quite a bit,” says Arvind Shanmugvel, director of applications engineering for Ansys. “It is seeing less usage now than in the past. Mainstream processors are not using it as aggressively as they used to. Today, architectural advantages from things such as clock gating are driving down the power consumption.”

Granularity
When chips contained a single core, the decision was much simpler, but SoCs today contain many clusters of heterogeneous cores.

“It is at the processor cluster level,” says Wingard. “Within a cluster you would probably only have one supply and one set of clocks that are synchronous to each other. You can then power gate some of the processors within the cluster. So you choose the operating point for all of them together.”

Shanmugvel provides a rationale for this: “Products tend to spend a large portion of their time in standby mode so this favors turning everything off that is not being used. When turned on, it is likely that you need to operate at full speed.”

There’s a cost involved here, too. “Per core, DVFS is more expensive because it requires more than one power/clock domain per chip executing DVFS,” explains Viswanath. “They also require additional circuitry to synchronize between the chips.”

Viswanath provides a summary of the current thinking of several academic and commercial entities that have explored the benefits of per-core versus per-chip DVFS for single-chip multi-processors (CMPs). One entity reported that a per-core approach had 2.5 times better throughput than a per-chip approach. This is because the per-chip approach must scale down the entire chip even when a single core is overheating. With per-core control, only the core with a hot spot must scale downward and the other cores can maintain high speed unless they themselves encounter thermal problems. “Managing power when running parallel or multi-threaded programs — especially with the onset of heterogeneous multi-core architectures — is a more difficult optimization problem and one that is still being tackled in the industry.”

Involving software
“The biggest bottleneck in DVFS is that the software and hardware have to work together,” says Anand Iyer, director of product marketing for Calypto. “As an industry, we tend to develop these things independently, and so this creates a problem in that someone has to put the two things together. Fine-grained DVFS presents other challenges such as the verification of the different modes of operation as well as determining if the circuit can operate reliably.”

Wingard lays out the problem. “The OS typically asks for a throughput level that maps to a frequency, and the spec for the voltage says that this is the lowest one you can operate at that delivers that throughput.”

But that is where the difficulties start. “When you talk about a communications system where you don’t know when the real communications will happen, it is hard to predict when things will happen in systems that are effectively asynchronous,” points out Iyer. “This makes the usage of DVFS difficult because you need a predictable schedule.”

Shanmugvel takes this to an extra level of detail. “To be able to use DVFS effectively you have to be able to predict your task completion time. If you know you have tasks X, Y and Z, then you could increase the frequency and finish them fast, but you have to know details about the tradeoff. Without the necessary information, it is impossible to guide the policymaking. If you are not careful, the DVFS algorithm will spend more time hunting for the right voltage and not actually saving power. This happens a lot if you do not have a good enough profile of the system operation. The more software maturity a system has, the easier it becomes.”

Viswanath provides information about three ways in which the problem is being tackled. At the operating system level, Linux uses a standard infrastructure called cpu-freq to implement DVFS; cpu-freq is the subsystem of the Linux kernel that allows frequency to be explicitly set on processors.

AutoDVS is a dynamic voltage scaling (DVS) system for handheld computers. AutoDVS distinguishes common, coarse-grained, program behavior and couples forecasting techniques to make predictions of future behavior. AutoDVS estimates periods of user interactivity, user non-interactivity (think time), and computation, per program and system-wide to ensure quality of service while reducing energy consumption.

Another approach has been to look at application-directed DVFS with the observation that it is difficult to achieve good results using only statistics from the OS when applications show bursty (unpredictable) behavior. The approach here is that such applications must be made power-aware, and specify their Average Execution Time (AET) and a deadline to the scheduler controlling the clock speed and processor voltage. They implement an Energy Priority Scheduling (EPS) algorithm supporting power-aware applications. EPS orders tasks according to how tight their deadlines are and how often tasks overlap.

Regulators and frequency sources
One big system-level decision is where to put the voltage regulators. These traditionally have been off-chip, and Wingard believes that “for the vast majority of designs, off-chip regulators are the way to go. This has a lot to do with the way in which systems are constructed. There is normally more than one digital chip within a system, and so having a dedicated power management integrated circuit that provide the voltage sources can be built and manufactured in processes that have better support for voltage regulation. You don’t need a 7nm finFET process to do that. You need power transistors. You need thicker oxides and charge pumps and analog circuitry. So for high-volume applications you have one power management IC that supplies the applications processor and other things as well.”

Shanmugvel counters that you have “the ability to control the voltage using a finer granularity by using embedded voltage regulators.” He agrees that it makes sense to use off-chip regulators when the entire chip runs off a single voltage. “Today, some parts of the chip may require faster operation, and with global control it is not possible to change things at the block level.”

And with regulators, there are a whole set of tradeoffs that have to be made. “Settling times are in the millisecond range for external voltage regulators,” says Wingard. “You can perhaps do a little better than that, but characterization is challenging. It is not easy measuring millivolts to find when things have actually settled. The accuracy necessary to measure this is a tradeoff, and the latest 10nm process technology has a worse precision for this than older technologies so designers tend to be conservative.”

Shanmugvel says that “an integrated regulator has a faster response time. It can be two to three orders of magnitude better in terms of response time. If one core requires going into a high power mode, it can do it very quickly. This is because the load is closer to the regulator. If using off-chip regulators it can take milli- to micro-seconds.”

There are tradeoffs to be made, as well, when it comes to the frequency source. “There are techniques that people use to tradeoff settling time against accuracy for phase locked loops (PLL) and delay locked loops (DLL),” says Wingard. “By accuracy we mean how much jitter is in the clock after it has been locked. By making one that settles quicker, the amount of overhead I have to pay in my static timing model is higher because the jitter from the clock is going to be higher.”

How many operating points are required? Jadcherla notes that “more than five to eight operating points becomes unmanageable, in part because the schedulers do not do a good job with it and the OS tends to spend more time hunting than executing. People have tried continuous DVFS, where the chip keeps operating while you are moving up or down in operating frequency. That turned out to be very expensive. Discrete operating points are much more common in SoCs.”

“PLLs can lock in hundreds of microseconds, and DLLs a little better,” adds Wingard. “Additionally, people can step frequencies without changing the PLL by using a divider. The divide ratio is more stable, and this can happen in tens of nanoseconds. When deciding it depends on the degree of precision they need.”

Continuous DVFS is much harder and more expensive because as the device is operating and pulling current, you are also attempting to put the voltage regulator into the next point. The regulators have to be very good to do this. “You pay a big penalty for ensuring that it tracks properly,” says Jadcherla. “It has to be such a tight closed loop.”

Transitions
So should you attempt operation during the transition? By not operating you can do the transition faster and utilize a less expensive regulator.

“The device can be used during a transition if one of the requirements put in the design of the voltage and frequency sources is that it is stable during transition,” says Wingard. “When switching between operating points, the device needs to be stable. This does not mean monotonic. If I am changing from a lower voltage and frequency pair to a higher pair, you need to sequence the voltage up first while keeping the frequency constant and only once settled do you change the frequency.”

But is it worth it? “With any analog settling problem, there will be overshoot and droop,” he adds. “It is already challenging enough to characterize the silicon at the targeted operating point so characterizing the silicon during transition is beyond what most people are willing to do.”

Shanmugvel observes that most people opt for the easier solution. “When doing the transition from one voltage or frequency to another, they put the cores in standby, change the voltage and frequency and then restart the cores. Doing this makes the transition faster, and if the regulators are on-chip, the core does not have to be put in standby for such a long period of time. It may be less than 100 nanoseconds.”

Jadcherla explains why operation during transitions can be so difficult. “The underlying Boolean model changes when doing DVFS because voltage is a factor in the equation. The basis for the 1 and 0 is changing on the fly. The problems start when you have signals that go between blocks. Level shifters only work in a static relationship of the power states. They can account for the point in time when one circuit is at level A and another is at level B. But under transient conditions, the whole thing may flip, and what was a high-to-low shifter may become a low-to- high shifter. Also, if one has a larger capacitance than the other, they have different profiles when moving to the new voltages. If both start at low voltage and frequency and move to a higher operating point but at different voltage points to each other, things happen during the transition that static relationships cannot figure out.”

Implementation difficulties
Even with good operating policies in place, there is plenty of complexity in the implementation and in the functional and physical verification of such systems. “It is not just about verifying each of the operating points,” points out Jadcherla. Some of this was discussed in his book, “Verification Methodology Manual for Low Power.”

“When we talk about verification, we have to consider both functionality and performance,” says Calypto’s Iyer. “There is potentially a performance impact from the circuitry added for power reduction and in the worst case, this can also have a functionality impact.”

Jerry Zhao, director of product marketing for power signoff at Cadence, talks about some of the back-end issues. “DVFS creates additional difficulties in the power signoff phase and teams have to analyze them in terms of power integrity and signal jitter, and this includes aspects of the level shifters and power gates as well.”

There are also interactions that have to be taken into account. “Power grids are fully coupled, so even blocks that do not deploy DVFS may see an effect of the change in operating point,” explains Zhao. “It is not enough to just analyze the blocks individually. Frequency change also creates clock jitter. That can be made worse by the voltage drop. There is interference between many aspects within the chip, especially through the clock network.”

Shanmugvel agrees with the complexity of some of these interactions. “We need to verify the functionality and we have to verify the power integrity supplying the regulator. How do you perform a dynamic voltage/noise analysis on the complete SoC? What is the impact of the integrated voltage regulators? Today we can generate models for this.”

To overcome these problems Zhao says that designers tend to overdesign and put in plenty of guard banding in order to minimize the problems. But if too much is put in, then the gains may be wasted and it appears as though more people have decided that the complexity is just not worth it anymore. There are easier ways.



  • Yoshiyuki Ando

    This article is very interesting.
    I think key is how to set up or select the operation points for DVFS.
    Do you have this information?

  • Kev

    It’s pretty much impossible to verify DVFS with Verilog models in VCS (or similar), and doing functional verification in SPICE isn’t tractable.

    Verilog-AMS was supposed to fix that by letting you back-annotate analog effects like variable supplies and real wiring into the post P&R sign-off simulations, but Cadence screwed up the implementation, and nobody ever turned up at Accellera asking for it to be fixed.

    The same issues apply to using body-biasing in FDSOI, and so far nobody at GF seems to be asking for that to be fixed.

    In higher variability Silicon you really want to use asynchronous design, but that isn’t supported in the standard Verilog flow either.