What goes wrong and what you need to know to get a head start on the next round of complexity.
By Ed Sperling
Power islands and multiple voltages used to be reserved for cell phone and process companies, but as more companies move to 65nm and 45nm process nodes these approaches to saving power—particularly in chips with multiple cores—are becoming mainstream.
The problem isn’t in the architecture of the chips, although that certainly brings its own set of challenges. More and more, the real holdup is at the verification level. While the percentage of time spent in verification has remained relatively steady—anywhere between 50% to 75% of the total time it takes between architectural design and tapeout—the size of the verification teams has doubled and in some cases tripled.
“Verification is the next big challenge,” said Naveed Sherwani, CEO of Open Silicon. “As an industry we have not done a good job managing verification. A new methodology would be very welcome. We have had to develop methodologies in-house to deal with this.”
Sizing up the problem
All of the major EDA vendors recognize the extent of the problem. They’ve been dealing with horror stories from the field since the 90nm process node. And according to TSMC, about two-thirds of the industry is now at that node or beyond.
The most advanced parts of the semiconductor industry are now working on 32nm and 28nm, with even more power states—on, off, sleep, and sometimes even more in-between states—more power islands and more processor cores. In the most advanced chips, some of those cores are even heterogeneous, which means they may have different voltages and states than the other cores. That allows a system to reduce power consumption overall and concentrate power where and when it’s most needed.
“When you cross 100nm, you’ve got to design this stuff in or you’re not competitive,” said Barry Pangrle, solutions architect for low-power design and verification at Mentor Graphics. “We’ve got a number of people well down the road on this. Larger companies with larger design teams can afford the engineering expense to make this work. But as more people go to more advanced nodes they’re going to be dealing with issues they never had to deal with before.”
The first thing that most designers encounter is complexity. What used to be done on a spreadsheet is much harder to manage now.
“There are a whole series of interrelated topics of increasing complexity,” said Srikanth Jadcherla, group director for R&T at Synopsys. “The state space is huge, and when you start dealing with three or four power islands it’s amazing how quickly the number of states and sequences explodes.”
It’s also amazing how complicated this stuff can get very quickly. Consider, for example, what happens when you’ve got a device and you’re checking e-mail. The processor wakes up a number of mixed signal blocks, then turns off what’s not being used. But that sequence also has to be ordered, which means you also have to order the power islands.
“You may wire it from low to high when you need to go from high to low,” said Jadcherla. “The problem is that you’re trying t predict island orders. You can create a safe graph, which is a set of possible states so you can look at a design and ask, ‘What are the safe ways this will work?’ But when you’re dealing with 36 to 40 islands, there’s no way you can set it up safely.”
Tales from the crypt
One of the most common mistakes that design teams make in chip engineering is internal organization and communication. The team design and communication has to reflect what’s going on in the chip design and verification.
“We’ve seen problems in a library group, for example, where they save power in a certain way that’s different from other groups,” said Mike Carroll, product marketing manager for front-end design at Cadence. “Communications between teams is not always the tightest loop. If one group instantiates it the wrong way, you may have power shutoff without state retention.”
In a library, that can be disastrous for a system—or at least some of the system’s functions.
It’s also a big problem in flash. Consider, for example, a smart phone where the low-battery signal is flashing and the system is ready to shut down to keep enough charge in the device to maintain essential data in memory.
“If you get a phone call at that time and you pick it up, it can be disastrous for the system,” said Synopsys’ Jadcherla. “But how do you prove that? It’s not easy. You need to come up with a methodology to test it. That’s where random constraints and testing come in.”
Another problem is when engineers route signals across other blocks or power domains. Pangrle noted that may not show up in the block diagram, particularly if the block is powered down.
“The key is to keep the logical hierarchy matching the physical hierarchy,” he said. “But design teams are not experienced with that. Another problem is that the signal may not be the same on one side as the other.”
That can also happen at advanced process nodes with process variations—an issue that no one even paid attention to at 130nm. At 45nm, it can be the difference between a functioning chip and a buggy one.
Advice from the experts
Low-power experts have consistently advised design teams to think about low power at the architectural level, and nothing has changed in that regard. What has changed are the numbers of possibilities for verification. Adam Sherer, product marketing manager at Cadence, said that for every power domain there are two-to-that-power possible states. So if there are two domains, there are four possible states, and so on.
“Verification does not have a theoretical limit, but pragmatically there are limitations,” Sherer said. “The problem is coverage. If you can manage to create a loop, you can extend it to the power domains. We’re seeing the same from the functional teams. Randomization testing is where the functional coverage comaes in. As long as there is coverage and you can see functional sequences you have vision into the power domain space. It has to be able to come out of shutdown and on the implementation side it has to work.”
That means establishing power intent so you shut off something at a particular time.
All the EDA companies say that a verification methodology helps, as well, although each favors their own flavor, whether it’s OVM or VMM. Other higher-level abstraction standards such as CPF and UPF, and TLM 2.0 also help significantly.
“With TLM you can figure out what’s in hardware and what’s in software and which blocks run at which voltage,” said Pangrle. “Then you can put in which blocks to shut down entirely and specify the power states.”
And if you can create an effective coverage model based upon those factors, then at least you have a chance of getting a chip out the door on time, possibly within budget, and one that actually works.
Leave a Reply