Second of two parts: The cost of mistakes, when to start using power domains and what to expect when you do.
By Ed Sperling
Low-Power Engineering moderated a panel featuring Bhanu Kapoor, president of Mimasic; John Goodenough, director of design technology at ARM; and Prapanna Tiwari, corporate applications engineer at Synopsys. What follows are excerpts of their presentations, as well as the question-and-answer exchange that followed.
Prapanna Tiwari: Traditional techniques like clock-gating and multi-threshold cells are not enough, so at 65nm and beyond everyone is almost forced to look at voltage control as their primary power technique. You don’t even have a choice. If you go back to before we worried about power management, we had one state. Everything was on. Now, with domain blocks on a design, even with two domains the shared state space goes up. It doesn’t just mean 4x verification. Everything has to be re-done. I can shut down a domain and bring it back up. But while I’m doing this, the rest of the region around it shouldn’t get impacted. It should not suddenly get stuck waiting for an input from this off domain. This isn’t just about off and on. You also have to do a reasonable amount of functional verification. You have to run test vectors to make sure the functional modes are working.
There also are a whole new series of ‘buts’ that the verification has to account for. You have new power structures like isolation cells, level shifters and power switches. You need to make sure they’re hooked up correctly and sequenced properly and that all the quirks of having them on your chip are met. The difficult part is this starts all the way from RTL. Low-power verification is not something you do after synthesis. It’s part of the netlist. You can have millions of lines of legacy code that has no description of power, but in your verification you have to account for it. That’s the key challenge. And it has to be accurate.
Some groups say they have inserted their isolation cells and level shifters correctly. As long as you make sure they can handle all the power modes, why do you need to simulate? That’s not enough. There are dynamic cases that put these structures into situations for which they were not designed. You cannot model that statically. You might be able to do that for two domains, but you can’t do that for 10 domains. By adding two domains you add four power modes. If you add eight domains the state space is huge. Dynamic verification at the early stages of design is essential.
Making matters worse, you have a whole new effect. Voltage is a new parameter. We’re used to full-value simulations—1, 0, x and z. Now it’s a 1 at a certain voltage, so 1 at 0.6 volts is not the same as 1 at 0.2 volts. If these two interact there will be a problem. The simulation environment has to be aware of this.
Even if you’re only using two power domains, supplies can vary depending upon external interrupts, software algorithms, and you might have overlapping transitions between voltages that are not zero-delay events. Transitions can last 10s of microseconds. It will have an implication of what functional behavior a block will have going from one state to another. You need to account for these.
Whatever you’re simulating, the simulation platform has to understand voltage as an analog value and its effect on the full-value logic simulation. Without that, you can run your regressions and it will not be enough.
There is certainly a set of benefits for business that center around reduced cost or improved quality through process optimization. In one example, we have two power domains. One goes from a high-performance mode where the two domains are at 1.1 volts and 1.2 volts, and low-performance mode where the domains are at 0.8 and 0.9 volts. The library guy didn’t think there was a need for level shifters. We thought he was right, too. But as you transitioned from one mode to another, the voltage ramps were different, causing corruption in the chip and problems everywhere.
There are corner cases you might think you’ve accounted for in your testbench and specs, but how do you make sure you’ve accounted for that? You may have different modules coming in from various sources. Verification is the gatekeeper. You have to consider transient states.
You need a robust static checker methodology, too. If not, it’s going to hurt you in dynamic verification. You’re going to spend a lot of time trying to figure out all the bugs—the structure-related bugs, the connectivity-related bugs. That leads to wrong simulation behavior. If you’re lucky, you’ll catch it in dynamic verification and you’ll spend weeks fixing it. But if you use that specific test vector you won’t catch it. You need a consistent methodology all the way from power intent.
Even one inverter inserted incorrectly in a power signal is enough to kill a particular power mode. The annoying thing about power bugs is, more often than not, they’re going to kill your chip. With a functional bug, you may get away with it. With power, one wrong transition, one wrong polarity, and the chip may get stuck in a dead mode that it cannot come out of. That’s why it’s very important to start early and to be comprehensive. Even if you catch these later in the design flow, you’ll pay a higher penalty than if it was a functional bug you needed to fix.
Before tapeout, you need to make sure what you have in your GDS is the same as what you started with. The methodologies—UPF, CPF—have power intent as a separate file. You want to do signoff looking at what’s in signoff and compare that with power intent and see if they match. Without that, you’re not guaranteed proper behavior.
In a nutshell, low-power verification depends on how complete your architecture and power intent are to start with. You will have to modify your testbench methodology to eventually get to the point where you have full coverage. Libraries are important because you have new cells, new behaviors, new attributes that were not important in the past, but which you now have to capture. And people have to pay attention to all of this. Teams need to be educated to understand the impact at every step. Low power is not a point solution. Somebody downstream is going to have a really bad day if it isn’t done right up front—and it’s only going to get worse and more complicated.
LPE: The majority of design engineers haven’t even attempted to work with this stuff because it’s too complicated. When is it required—at what node—and what’s the cost?
Goodenough: People start this when they’ve had a major mistake, not at a particular node. A lot of the existing ad hoc verification techniques have worked and do work. It’s hard to understand that when you pass over the complexity curve you run a higher risk of failure. There’s no simple answer. If you’re running a chip with 40 or 50 domains your likelihood of failure is higher than with two domains. The reason people don’t do it is the cost. The cost really tends to be a new methodology scheduled impact on the first product.
Tiwari: The first attempt is always ad hoc. The second is when you want to fix the problems. The third time you’ve learned enough and add the necessary skills because you realize that power is here to stay. In terms of cost, you need to weigh the cost of a silicon bug against the cost of investing up front in methodology.
LPE: Once you make that investment, does it still take longer?
Goodenough: It’s a question of first chip or third chip. Once you’ve got an established flow, it doesn’t necessarily take longer because you’ve figured out how to sign off and how to fit it into schedules. It’s like rolling out any new methodology, but this one is in spades. You can trade off cost of silicon failure vs. that investment, but future cost of failure vs. NRE is an interesting discussion from a dollars standpoint.
Kapoor: It’s not that companies aren’t doing low-power checking. It may not be at the scale necessary. All of the cell phone makers were doing this in 2002. But the design complexity has increased so there is a need for a scalable methodology.
Leave a Reply