Less Room For Error

Design margins need to shrink with each new process node to meet lower power budgets.

popularity

By Ed Sperling

Say goodbye to fat design margins in advanced SoCs. The commonly used method of adding extra performance or area into semiconductors to overcome variability in manufacturing processes or timing closure issues has begun to create problems of its own.

While there was plenty of slack available at 90nm, adding margins at 45nm and 32nm disrupts performance or eats into an increasingly tight power budget—or both. And while this may seem like a relatively problem solving exercise, margins are to a design engineer what a safety net is to a high-wire acrobat. They allow engineering teams to get to market on time and on budget, with an incredibly small number of bugs considering the complexity of current designs.

Cutting margins means substantially more up-front modeling and much more work in figuring out where the variability is in new manufacturing processes. It also means potentially more restrictive design rules and less creativity at the very front end of Moore’s Law.

Different approaches

“At 45nm and 32nm, you can’t put a margin on everything because your performance would go to zero,” said Rob Aitken, a research fellow at ARM. “For the relationship between design and low power, there are two approaches being advocated. One is to do a better job quantifying the margins. Instead of putting a finger in the air and saying, ‘Let’s worst case this and worst case that,’ the solution is more, ‘Let’s actually look at data and figure out where the worst cases lie, look for correlations and relationships between the amount of timing slack we have and our verification extraction methodology. Maybe we can use a better extraction technique and shave off some of that margin.”

A second approach is a more adaptive one, where you know there will be some margins but you don’t know exactly what they are. “When you get your silicon you have adjustable parameters, whether they’re voltage or clock frequency or something else, that you can tune on a per-chip basis to boost up yield and achieve margin without necessarily putting it in the design,” Aitken said.

There are other approaches being advocated, as well. Bhanu Kapoor, founder of Mimasic, a consultancy in Richardson, Texas, said building work-arounds into chips such as classic fault tolerance is an acceptable option.

“We need to start learning to live with errors,” Kapoor said. “Margin-related issues will lead to errors and they will not function correctly at times. That’s where you have to bring in techniques like fault tolerance, where you have error correction. That is a very useful technique for low power, too, because you can work at lower voltages. There will be times when your critical path timing will not be met and you will have errors. Then you try to detect the errors, correct them and learn to live with them.”

Still others say there should be no workarounds. Vinay Srinivas, group director for R&D at Synopsys, said the solution is eliminating variability up front so there is less need for margins and far fewer errors.

“You need better tools, modeling and methodology,” Srinivas said. “Having these guardbands is not acceptable. If you were to guardband everything when the system wakes up you would have so much latency that you couldn’t afford it in the design. At 45nm and 32nm, you need more voltage-aware modeling.”

What works?

While companies such as Synopsys are pushing for better designs up front, the majority of designs will still include some design margins—at least in the short term. Hamid Mahmoodi, assistant professor of electrical and computer engineering at San Francisco State University’s School of Engineering, said there are times when each approach works.

“There is a lot of variability and unpredictability in designs,” Mahmoodi said. “Adding margins is the easiest way to solve that. You can make the design faster than expected by adding in additional biasing or something to cope with the variation in processes. But adding margin means more silicon area and more power. There is cost in terms of additional sensors or voltage regulators. Even corrective action requires overhead.”

Sometimes, in fact, adding margin can be the most cost-effective solution.

“In a given process, which is more cost effective depends,” Mahmoodi said. “If the variability is small, adding margins is the most cost effective solution. When the variability is large, and there are variations is process parameters and voltage, then adding margins is too expensive. At that point, it’s best to consider fault tolerance schemes or adaptive asset calibration methods to make the design more reliable.”

Conclusion

The bottom line is that even the experts disagree on what route to take when. That largely will be up to the design teams working under intense deadlines to get their chips out the door. But at each new process node, there clearly is less room for adding margins and more restrictive design rules for getting chips to yield properly and perform as planned within power limits defined by customers. And if you think it’s hard at 45nm, it’s only going to get more difficult over the next couple nodes.