Feel The (Low) Power

A slew of power-saving techniques to help you design a chip within increasingly tight power budget.

popularity

By Clive (Max) Maxfield

When I designed my first ASIC way back in the mists of time (circa 1980), its power consumption was the last thing on my mind. You have to remember that we’re talking about a device containing only about 2,000 equivalent gates implemented in a 5 micron technology. Also, I was designing this little scamp as a gate-register-level schematic using pencil and paper (I predate EDA as we know it today).

We didn’t have automated schematic capture or logic simulation or timing analysis tools. Functional verification was performed by your peers looking at your schematics. You explained what a particular portion of the design was going to do, and they thought about it and said: “That looks good to us.” Similarly, timing analysis involved deciding which paths were important, and then adding up all of the lump-load delays specified in the cell library data book. Once again this task was performed by hand using pencil and paper (no one I knew could afford one of the recently-introduced electronic calculators. Those were of interest only to well-paid managers who needed help balancing their expense accounts. What do you mean? Of course I’m not bitter.)

So my main concern was squeezing my design into 2,000 gates. The thought of how much power my device was going to consume literally never even crossed my mind. I’m sure that someone at the system level had some sort of power budget. Actually, as opposed to a “budget,” which implies some upper, not-to-be-exceeded value, I think it was more a case of keeping track as to the total estimated power consumption; multiplying this by some factor to provide a margin of safety; and then throwing a big enough power supply unit into the system.

Once again, you have to remember what we were trying to achieve. This was before the days of hand-held, battery-powered, portable electronics products like MP3 players and GPS receivers and cell phones. My ASIC was intended for use in the CPU of a mainframe computer. The circuit board (one of many forming the system) on which my device was to reside was about 1/4″ thick, around three feet long by two feet wide, with its periphery populated by power studs capable of handling the power requirements of a small town. Ah, the good old days…

How the world has changed. Over the last few years, power consumption has moved to the forefront of ASIC and SoC development concerns. And this interest in low-power design is not restricted to portable, handheld products. Instead, it spans the entire deployment domain including fixed installations such as all forms of computing, networking, and set-top boxes, to name but a few.

As a simple example, consider the fact that every time you request a search on Google, the servers in the data centers consume an average of 4.5 watts. Remembering that Google is easily processing 400 million queries a day, so this equates to 1.8 billion (1,800,000,000) watt-hours of energy being used daily to handle basic search queries. If you can significantly reduce the power being consumed by the CPUs and support chips in these servers, this will have a HUGE impact on Google’s bottom line and – more importantly – the environment.

Over the recent years a wide variety of low-power design techniques have evolved to address the various aspects of the power problem and to meet ever-more-aggressive power specifications. These days, power planning can no longer be considered as an afterthought; instead, system architects need to make power-aware architectural decisions and power-aware third-party IP selections at the very beginning of the development process.

There are so many aspects to the low-power story that we could write a book on it, but I’m a hardware design engineer by trade, so for the purposes of this article I’m going to briefly summarize the various low-power implementation technologies that are available to us as follows:

Clock gating
Although clock-gating may seem a little boring, doing it right can significantly reduce the design’s dynamic power consumption. This is because the clock trees in a modern ASIC/SoC can account for one-third to one-half of a chip’s dynamic power consumption.

Clock-gating involves all sorts of decisions. For example, should it be performed only at the bottom of the tree (the leaf nodes), at the top of the tree, in the middle of the branches, or as a mixture of all of these cases? There are tools that can help in moving the clock-gating structures around (upstream and/or downstream) and in performing tasks like splitting and cloning.

The current state-of-the-art in clock gating is “multi-stage gating,” in which a common enable is split into multiple sub-enables that are active at different times and/or under different operating modes. Although clock-gating offers big paybacks, it adds substantially to the task of physically implementing the clock tree(s) and also verifying the tree(s).

LP-01-a

Note: Terms such as “Little”, “Low”, “Medium”, and so forth as used in diagrams like the one shown above are intended only to convey relative quantities in the context of the entire chip and/or development process.

Multi-Vt optimization
Static power dissipation is associated with logic gates when they are inactive (static); that is, not currently switching from one state to another. In this case, these gates should theoretically not be consuming any power at all. In reality, however, there is always some amount of leakage current passing through the transistors, which means they do consume a certain amount of power.

Even though the static power consumption associated with an individual logic gate is extremely small, the total effect becomes significant when we’re playing with devices containing tens of millions of gates. Furthermore, as transistors shrink in size when the industry moves from one technology node to another, the level of doping has to be increased, thereby causing leakage currents to become relatively larger. The end result is that even if a large portion of the device is totally inactive it may still be consuming a significant amount of power. In fact, static power dissipation is expected to exceed dynamic power dissipation for many devices in the near future.

Now, static power dissipation has an exponential dependence on the switching threshold of the transistors (Vt). In order to address low-power designs, each type of logic gate is available in two (or more) forms: with low-threshold transistors that switch quickly but have higher leakage and consume more power, or with high-threshold transistors that have lower leakage and consume less power but switch more slowly. Of course this leads to other problems, such as unwanted signal integrity effects, but that’s a topic for another day.

LP-01-b

Multi-supply multi-voltage (MSMV)
The idea here is that blocks in the design that are powered by higher voltages run faster and consume more power than blocks that are powered by lower voltages. Thus, if we have a block that can run slower than surrounding blocks, it may make sense to implement this block as its own “voltage island”.

The downside is that we no require the insertion, placement, and connection of specialized power structures, such as level shifters, power pads, and so forth.

LP-01-c

Dynamic and adaptive voltage and frequency scaling (DVFS)
In this case, the idea is to optimize the tradeoff between frequency and power by varying the voltage or frequency in relatively large discrete “chunks.” For example, the nominal frequency may be doubled to satisfy short bursts of high-performance requirements or halved during times of relatively low activity.

Similarly, a nominal voltage of 1.0V may be boosted to 1.2V to improve the performance, or reduced to 0.8V to reduce the power dissipation. Of course this can quickly become a verification nightmare, because each of these scenarios has to be tested in the context of surrounding blocks, which may themselves switch from one mode to another.

LP-01-d

Power shutoff (PSO)
As its name suggests, power-shut-off refers to powering-down selected portions of the design that are not currently in use. If, for example, your cell phone includes an MP3 player capability but you aren’t currently listening to any music, then powering-down that function will save power.

In this case, designers have to choose between “simple power shut-off” where everything in the block is powered down, and “state retention power shut-off,” in which the bulk of the logic is powered down but key register elements remain “alive.” This latter technique can significantly reduce the subsequent boot-up time, but state-retention registers consume power and also have an impact on silicon real-estate utilization.

LP-01-e

Substrate biasing
Substrate biasing is typically applied only to selected portions of the design. The idea here is that a functional block typically doesn’t need to run at top speed for the majority of the time, in which case substrate biasing can be applied, which causes that block to run at a slower speed but with significantly reduced leakage power. The benefits from substrate biasing can be large, but actually implementing it can be a pain in the rear end.

LP-01-f

Summary
The problem with low-power design is that there’s so much of it. We really only have scratched the surface of it here. For example, at the initial system-design, architectural evaluation level, one critical task is to partition the system into its hardware and software components. Hardware implementations are fast and consume relatively little power, but they are “frozen in silicon” and cannot be easily modified to address changes in the standards or the protocols. By comparison, software implementations are slow and consume a relatively large amount of power, but they are extremely versatile and can be modified long after the chip has gone into production.

Another interesting area to consider is the interconnect mechanism used to link the various functional blocks forming the chip. For example, conventional synchronous bus architectures constantly burn power, even if they aren’t actually moving any data around. One solution is to move to a globally asynchronous locally synchronous (GALS) architecture. In this case, data flows as fast as possible through the self-timed (asynchronous) interconnect network because there is no waiting for clock edges, and the power consumed by the buses is dictated by their traffic loads. Furthermore, the clocks associated with the synchronous blocks can be stopped (or gated) when those blocks are not being used.

And we’ve really only pondered various aspects of low-power design. Another huge area of interest is low-power-aware verification. Take the case of power shut-off for example. This rarely involves powering-down a single block, but when multiple blocks are being powered-down (and back up again) there has to be a defined sequence. And what happens if the device is halfway through a power-down sequence when that pesky user presses a button requesting this functionality … in this case the chip has to gracefully about the power-down and return any already-disabled functions to their active state. All of this has to be verified. But that’s a topic for another day.