Experts At The Table: The Power Problem

Second of three parts: Power bugs, additional savings, the growing problems of optimizing power, measuring power over time instead of statically.

popularity

Low-Power Engineering sat down to discuss the issues in low-power design with Vic Kulkarni, general manager and senior vice president of the RTL business unit, Apache Design Solutions; Pete Hardee, solutions marketing manager at Cadence; Bernard Murphy, chief technology officer at Atrenta, and Bhavna Agrawal, manager of circuit design automation at IBM. What follows are excerpts of that conversation.

LPE: How important is software in lowering power?
Hardee: When you’re running software in the processor it’s how efficient data is being moved with the memory architecture and caching. People need to validate they’re using the right cache algorithm for the right application. There are a number of alternatives. We’re extending into transaction processing for that reason.
Murphy: As soon as you go up to an external memory interface your power goes way up.

LPE: How far can we reduce power, though?
Kulkarni: Functional verification has been addressed by many people around the world. A functional bug is a well-known term. We’re finding there are now power bugs. Some experts have reduced power by 100x over the last five years using every known technique. What they want to know is how they can reduce it further? That’s where we’re finding what we call ‘power bugs.’ In a multicore processor, for example, one core was running but the data was not shut off in other cores. It was creating unnecessary activity. Designers are not tuned to that because functional verifiers will not find that. We are finding more and more of these ‘power bugs’ every day when we see hot spots. The chip may be in ‘off’ mode but the address line may be running. Or the clock may be running and the data may be off. There is a lot of savings in those.
Hardee: With all these techniques to control power there’s a different kind of power bugs we’re starting to see, which are functional or structural errors with the implementation of the power intent. If I have multiple power domains I may be using power shutoff, power shutoff with retention, and multiple supply voltages that I’m adaptively or dynamically controlling. So I need a whole bunch of isolators and level shifters. A lot of verification tools are picking up errors. Those are what we call power bugs.
Kulkarni: To us those are structural bugs. What we’re finding is bugs in the relationships between signals.
Agrawal: And it’s not easy to find these bugs. If you look at fine-grain power gating it sounds very good on paper. But if you really want to do functional verification with fine-grain power gating it is very difficult.
Murphy: The whole power management architecture has become scarily complex. You have a huge amount of functionality that has to change each time you change the underlying mission-mode functionality. So you have a problem that is probably significantly more complex than ATPG (automatic test pattern generation), but you have no automation tools for it. It’s essentially being done manually.
Hardee: ATPG is its own issue. After you set up all those power domains you have to do scan insertion. We’ve seen a couple problems cropping up. One is if you didn’t consider all the power domains so half your scan chain is powered off. Or you may do a great job of isolating the power domains, and when you insert the test you have all those scan signals crossing power domains and those are incorrectly isolated. Power-aware DFT is becoming a big thing because of this.

LPE: Will we ever really be able to lower the voltage beyond 0.8 volts?
Hardee: That’s the limitation of bulk CMOS. There are other technologies like SOI and gate stack technology. If Moore’s Law is to continue, those changes will have to happen.
Agrawal: The voltage can go down below 0.8 volts. We may have to re-think architectures and topologies, but it can go down.
Kulkarni: We’re seeing islands of voltages. If you’re reading on an iPad in a certain mode as opposed to watching a video, there are different requirements. DVFS (dynamic voltage frequency scaling) and AVS (adaptive voltage scaling) might be more interesting because of this.
Murphy: DVFS is a popular term and a popular idea, but it dramatically increases the complexity of your synchronization problem. You’ve gone through a lot of work to make sure you’re not going to get handshake issues and suddenly that’s become orders of magnitude more complex.
Hardee: We’re seeing DVFS being confined to a separable block, but not generally across a design for that reason.
Kulkarni: You can envision different modes of operation across an SoC. You can have different schemes of operation.

LPE: That’s particularly true in 3D ICs, right?
Agrawal: 3D makes the power problem even worse. Today you have problems supplying power to these chips. You cannot supply enough power to these chips. So when you stack 3 or 5 or 10 chips you will be supplying power to with the same technology. Something has to change.
Hardee: And where do you put the heat sink?

LPE: Where is the solution? Is it one thing or 50 things?
Agrawal: No doubt it’s 50 things. There is no one solution. Historically we went up the power ramp with CMOS, came down with multicore—and we still haven’t quite conquered that yet.
Kulkarni: We’re seeing disparate chips in a 3D stack, not logic because of power and thermal management. But the memory and RF sections are getting stacked. It’s a flexible PCB.
Murphy: It’s bandwidth to the chip.

LPE: One of the phrases we’re hearing more about is design for variability. What does that do for power? Does it make it harder to manage?
Murphy: Variability is built into the power problem from the beginning.
Agrawal: There is a danger of optimizing your particular design for a power corner and not realizing the power sensitivity of the design to the process corners. If you optimize for one process corner and you move to another your power may go crazy. It’s very difficult to optimize power for variability.
Murphy: For a cell phone there are so many ways you can use it that there’s a huge amount of variability right there.
Agrawal: With process variability you have a non-use mode and an active mode, but the process corner could move from one to the other and the leakage power can go up dramatically. Power is always off and on averages. The danger is it might not be optimized for a particular corner.
Hardee: The number of corners is exploding, as well. At 40nm, we’re looking at about 20 corners. And with the cell libraries they’re working with, the difference in the leakage is orders of magnitude from worst case to best case. These aren’t small variations. There is a lot of variability, and that variability is showing up as leakage. Is it realistic to design for worst case? It’s not.
Agrawal: No. And as we move to 22nm, more and more people will have to do statistical timing. Worst case doesn’t leave you enough room to improve performance.

LPE: How much does process variability affect power? Seven atoms vs. eight atoms can affect performance, but will it affect power?
Hardee: Absolutely. That’s what’s giving such a wide variability in terms of leakage.
Agrawal: Leakage power varies widely with process variation.
Kulkarni: And temperature. It’s more than just power. It’s energy. If you look at energy and then power over time, it has become critical for how people design chips. They want to look at several intervals to see what happens to power. It could be completely different power as you go to different modes of operation. That’s where leakage can cause havoc. The thermal gradient can change over time and it becomes a circular computation.
Agrawal: If you run two wires in parallel, the temperature in one can increase the temperature of the neighboring wire.

LPE: How do you define energy vs. power?
Kulkarni: It’s the holistic picture. It’s not just a snapshot but power over time, so you have to figure all your components over time. You have to look at the impact of thermal and the impact of heat around it.
Murphy: There are techniques available to deal with that. There are little PDP monitors you can plant around the chip to do local compensation to adjust some of that.