Worst-Case Results Causing Problems

At 10nm and 7nm, overdesign can affect power, performance and time to market.

popularity

The ability of design tools to identify worst-case scenarios has allowed many chipmakers to flag potential issues well ahead of tapeout, but as process geometries shrink that approach is beginning to create its own set of issues.

This is particularly true at 16/14nm and below, where extra circuitry can slow performance, boost the amount of power required to drive signals over longer, thinner wires—and subsequently cause thermal issues due to increased resistance and capacitance. It also can propel design teams to utilize more extensive power management schemes, which in turn require more time to implement, debug and verify.

“The schedule has not changed,” said Ruggero Castagnetti, distinguished engineer at Broadcom. “But many times, the results we get out of tools are very pessimistic. And sometimes a switch is running very fast—much faster than the rest of the chip, and faster than it needs to run. We need ways to address that. If it’s a manual process, sometimes you know if there’s a glitch and you can ignore that. We need tools with the knowledge of the frequency of this chip without someone having to say, ‘Here’s the voltage information for each instance.’

This becomes more problematic at each new node below 16/14nm because even with a reduction in voltage for some components, the overall power budget is getting squeezed due to more functionality being added into chips.

“Total power may stay flat, but power density or current density is increasing,” said Castagnetti. “Associated with that, finFETs may lower leakage but the dynamic power is increasing. There is talk about starting earlier and using intelligent approaches so you don’t overdesign your mesh. What we really need is a methodology where we have a predictable turnaround time for a design.”

What works, what doesn’t
There are a number of issues that begin surfacing at advanced nodes. Power, while important, is just one of those issues. There are others involving everything from floorplanning (see The Ultimate Shift Left) to verification, mask writing, lithography, and multiple levels of test.

“The whole semiconductor business model is broken,” said Tobias Bjerregaard, CEO of Teklatech. “You could get more transistors per dollar at each new node until the 28nm node, and this was built into the business model. But you don’t see that anymore. You see a flattening of system costs due to the difficulties of addressing these underlying physical challenges required to harvest the benefits of scaling. That forces designers to work more intelligently with power integrity. It’s easy to do one thing here if it costs you something over there, but how do you find that perfect tradeoff in the design space? It’s becoming more complex. In the early days of IC design, it was all about speed and area, and less about power. If you went to the smallest area, which would give you the lowest power, then it was a two-dimensional tradeoff. But once you go to physical design you start running into all these different things. How do you manage those tradeoffs?”

This is where methodologies begin running into trouble, and it becomes more pronounced at each new process node. While it’s useful to view the design flow from a higher level of abstraction for getting chips to market more quickly, the hardest problems are way down in the details. Those are the ones that cannot be automated.

Melding those two visions has required a series of compromises. While existing tools can discover potential problems deep in the electronic circuitry, they tend to view those in absolute terms of violations. That is reinforced by restrictive design rules from the foundries, and the two sets of rules are interlaced to develop chips that are functional, reliable enough, and which can be manufactured with sufficient yield to make them economically viable.

Given enough time, experienced engineering teams can make improvements in hardware as well as software that otherwise would show up in extra margin. And now the question is whether that kind of flexibility can or should be built into automation tools.

“It’s becoming exponentially more difficult to close the sign-off loop,” said Arvind Shanmugvel, director of application engineering at ANSYS. “One way to address this is to make sure you’re not using the most pessimistic analysis. But you also have to change how we think about power integrity, from signal to multi-physics simulation, where you can see the impact of power integrity on various aspects of design. That impact ranges from timing to electromigration to thermal. The only way to achieve that is with a complete chip-package-system sign-off solution because there is margin on the package side and margin on the die side. We have to be able to simulate all of these things within the same context. If every timing path has an impact of voltage in it, then you have a different situation for each one.”

Shanmugvel said this data varies significantly from one design to the next, and from one application to the next. “As designers, we have to take hundreds of reports and write them and come up with unique solutions for every type of design.”

Silo behavior
A key hurdle to making that happen involves the organizational structure of semiconductor companies. While chip design is changing rapidly—there is more software-hardware co-design, new market requirements for automotive and IoT, and an emphasis on connectivity and moving large quantities of data—many companies developing those chips have invested time and money into flows and proven that they work. As such, they are reluctant to change those flows every time a new market opportunity crops up. In many cases, adopting new tools and techniques and learning how to use them effectively is as much of an issue as adapting existing tools and flows to new problems.

“Today, you have a core design team, a package design team and a chip design team,” said Broadcom’s Ruggero. “Who’s going to give up their piece? And who’s going to be responsible at the end if something goes wrong? So multi-physics is a good approach for bringing in the big picture, but it’s very difficult to implement from a structural standpoint. In an ASIC environment, the design is often owned by someone else. You have to collaborate at that level. If you’re dealing with noise, you need an understanding of which noise hurts. And with IR drop, not all IR drop hurts.”

Market consolidation, and moves by systems companies to develop their own chips, have solidified some of the traditional silo behavior. But there is only so long that can continue before physics begins reshaping design. At 7nm, the time it takes signals to traverse wires will increase, and at 5nm that will increase yet again, only this time it will be coupled with quantum effects.

Shanmugvel contends that what is required is a totally different way of looking at design. “Once you start thinking about what is the product, rather than what is the chip or the package, then the picture changes.”

This is route being adopted by companies such as Apple, Google and Amazon, which are developing their own chips according to their own specifications.

“There are two types of things that are siloed,” said Bjerregaard. “One is the chip and the package, and also the analog. Everyone knows that digital noise causes problems for analog. Nobody does anything about it except for guard-banding and double quantum wells. From the multi-physics perspective, the engineering community has been focused for a very long time on fixing specific problems instead of looking at what matters. Nobody really cares about IR drop as long as it’s stable. What does matter is the effect of IR drop, and we can only address that by looking at multi-physics. Timing, electromigration and aging are all affected by IR drop. We have to stop looking at technical problems in isolation and start looking at what matters for the chip. By doing that, it’s possible to filter out the problems that don’t matter.”

In effect, this separates design engineering into problems that do need to be addressed on a worst-case basis, and those that do not.

“It’s changing the behavior of designs rather than fixing everything at the end of the line,” said Bjerregaard. “It’s still important to do that, because naturally you want to implement a chip that works. But the designs themselves are more complex. There are more dimensions to them. The only way to deal with that effectively is to make them correct by construction. That requires good analysis—not necessarily perfect—throughout the flow. At every stage there are so many unknowns that having perfect analysis would require too many changes.”

The cost of overdesign
But how much overdesign is acceptable? There is no single answer to that question. It can vary significantly by application, by process node, and by company. In a 7nm mobile phone SoC, for example, extra margin could require many months of engineering to manage the different power domains, but the cost can be amortized across the entire device. In a 2.5D package, power may be less of a concern than throughput and packaging, and extra margin in the interposer might be acceptable. In comparison, for an IoT edge device, even at 65nm extra margin might add enough cost to make a chip uncompetitive.

“You may not need a certain level of accuracy,” said Broadcom’s Ruggero. “We don’t have the luxury of that much over-design these days, but we still do over-design.”

ANSYS’ Shanmugvel sees a different picture emerging. “If you overdesign to get a product out, you’re leaving money on the table,” he said. “If your die size bigger than it has to be, or you need one extra metal layer, that costs money. But overdesign is becoming more and more challenging at 10nm and 7nm. Threshold voltage has been pretty constant. If your threshold voltage is constant and you have a rapid increase in supply voltage, the noise margin has decreased. Any kind of overdesign affects this directly, so you have to be really careful, especially at 10nm and 7nm, not to overdesign.”

Which position is correct depends greatly on which market is being served, and by whom.

“There are two schools of thought,” said Bjerregaard. “One is the automated approach. It’s too conservative, but it works. The other is simulation, where you run millions of cycles. In my opinion, neither of them is sustainable in the long run. The too-conservative approach is not a good business model. The brute-force simulation model takes too long. And we know from machine learning and deep learning that we can learn from decisions, and so can machines. And they should be able to determine if IR drop is really a problem.”

Time will tell who is correct.
Related Stories
Have Margins Outlived Their Usefulness?
Why big data techniques are critical to building efficient chips.
Timing Closure Issues Resurface
Adding more features and more power states is making it harder to design chips at 10nm and 7nm.
Routing Signals At 7nm
Challenges of scaling and how to minimize IR drop and timing issues.
Tech Talk: Power Signoff
A look at the impact of margin in advanced designs and how to ensure there is sufficient coverage.
SoC Power Grid Challenges
How efficient is the power delivery network of an SoC, and how much are they overdesigning to avoid a multitude of problems?