Battling Over Shrinking Physical Margin In Chips

Increasing density and complexity makes it imperative to capture and integrate more data, from design through manufacturing and into the field.

popularity

Smaller process nodes, coupled with a continual quest to add more features into designs, are forcing chipmakers and systems companies to choose which design and manufacturing groups have access to a shrinking pool of technology margin.

In the past margin largely was split between the foundries, which imposed highly restrictive design rules (RDRs) to compensate for uncertainties in new process technologies, and design teams, which built extra circuitry into their designs to ensure reliability. The RDRs added margin into a variety of processes in the fab, allowing the fabs to buffer everything from misshapen features to process variation — which is always more problematic with new processes than mature ones. And for design teams, that extra circuitry provided a fail-over in case something goes wrong in the field.

But starting with the finFET nodes, just adding margin into a design was no longer an option. Increased transistor density and thinner wires reached a point where the total system margin — the sum of what foundries and design teams collectively built into chips — began to impact performance and power. In simple terms, it takes more energy to drive a signal longer distances over thin wires and through extra circuitry, and it reduces performance. Consequently, foundries began working much more closely with EDA companies to reduce guard-banding through better tooling, increasingly through the application of AI/ML and much more detailed simulation, in addition to tighter integration of those tools with new process technology.

The result has been a scramble among different groups to lobby for whatever margin is available in the design-through-manufacturing flow. That margin serves as a hedge against uncertainties in heterogeneous integration, as well as a buffer for various types of noise, and physical effects due to increased transistor density. It also has changed the insertion points for test, metrology, and inspection, particularly for safety- and mission-critical designs, and it has extended testing beyond manufacturing and into the field, where margin can be used to reroute signals when data paths degrade due to aging or latent defects. In some cases, it also is prompting chipmakers to choose between technologies that are well proven in silicon, or which are more resilient due to their inherent redundancy, versus the latest, most advanced technologies.

“People are looking for variability-tolerant designs to insulate themselves from margin issues,” said John Kibarian, president and CEO of PDF Solutions. “Certain architectures lend themselves to that. So any array-like or intrinsically parallel elements — bitcoin mining chips, or GPUs, or TensorFlow chips, or any other IPUs (intelligent processing units) — tend to be variability-tolerant relative to a CPU or a single processing element. Those have taken up a majority of the workloads, and workloads are now shifting to things that are intrinsically more variability-aware. That insulates you from the variability in the fabs. But the fabs with the lowest variability still accumulate the most market share, because you’re still better off with less variable technology that will result in products that are less variable, and you will get paid for that.”

Less margin also puts a premium on improving existing manufacturing processes, and one of the key efforts there is integrating data from one or more steps together with other steps in the fab.

“Data integration is a key part of this,” said Jon Herlocker, president and CEO of Tignis. “There are lot of data silos inside of fabs, particularly between the front-end and back-end, because a lot of the reliability and testing happens on the back-end, and a lot of times that data silo is not connected to the front-end data silo. Another interesting problem we’re seeing on the data silo side is because advanced packaging is becoming a very big deal. The kind of technology and data infrastructure that existed on the packaging side was low-tech compared to the front-end side — but that same group that had this low-tech infrastructure is starting to do some high-tech stuff. So now they’re asking themselves, ‘Do we take our back-end technology and level it up to the point where it can handle the complexity that we now have?'”

Every process in chip design and manufacturing needs to be tightened up to compensate for shrinking margin. That includes the obvious areas in manufacturing and test, metrology and inspection.

“If you look at copper-clad laminate, which is the current state of the art for advanced packaging fan-out, you may have up to 20 layers of RDL,” said Keith Best, director of product marketing for lithography at Onto Innovation. “You have to make sure the registration of those is accurate. But then, of course, people always are trying to get better [metrology and inspection] resolution performance. As resolution gets tighter, the overlay gets tighter, and then you’re worried about whether your substrate is stable. With copper-clad laminates, as you cure these layers, you can change the shape of the substrate. And as it changes over the many layers, the opening gets harder and harder to meet, and you end up with a yield loss.”

This has created an opportunity for new materials used in manufacturing, including glass and different sacrificial and permanent bonding materials. But there is margin needed there, as well, due to gaps in understanding exactly how materials will behave when combined with other processes.

“Where we need help is figuring out how exactly our materials behave in a customer process,” said Rama Puligadda, CTO at Brewer Science. “If we had access to the processing conditions, we could simulate how our materials are going to behave or perform through those processes. This will help us predict failures and shorten the feedback loops.”

Making matters worse, materials used today — like many of the manufacturing processes — are very different than five years ago.

“The materials used today in packaging are subject to higher standards of performance, stability, quality, environmental compatibility, and cleanliness,” Puligadda said. “Moving forward, PFAS- and PFOS-free materials will be required, and higher levels of cleanliness will be needed to support processes such as hybrid bonding. Packaging materials will see a shift towards front-end-level quality requirements.”

Better design tools, but more siloed data
On the design side, doling out margin has always been a challenge, but it is becoming more difficult in heterogeneous designs aimed at specific domains. That heterogeneity allows chipmakers to experiment with different options, and to enable engineering change orders for competitive reasons. But margins are now so thin that much more work needs to be done up front, which is why design technology co-optimization and system technology co-optimization are getting so much attention these days. Decisions need to be made earlier in the process, because physical margins are impacting everything from stochastics to atomic-layer processes.

“There’s been a lot of margin stacked on margin, which has been stacked on more margin for a long time,” said Simon Segars, board member at multiple companies and former CEO of Arm. “Some of the application of ML in designs was an opportunity to optimize across greater boundaries, squeeze out some of that margin, and understand failure mechanisms in a slightly different way.”

This has set up a point of contention, because while design teams always would like more margin, there’s a physics-related penalty. At least at the leading edge of design, less margin equates to better for performance and power, but it also requires a rethinking of various processes and approaches. Margin needs to be considered in the context of a whole system, not just individual blocks or processes.

“Everybody wants to reduce the margins,” said Mo Faisal, president and CEO of Movellus. “When you look at processors, at 300 watts and above, you literally cannot find a package. Maybe you only have to reduce it by a few watts and it goes from impossible to possible. The way to do that is by reducing the margin. ‘Where did I over-margin, because every piece of over-margining increases Vmin, which decreases voltage — power by V². So it all it all feeds back in. V is related to the timing, so there’s a push to squeeze every possible bit of margin out, and that all comes down to timing. But it requires a system view rather than just looking at a single block.”

It also requires some visibility into how that margin impacts overall performance and power in real use cases.

“To make sure you have correct performance guard-bands when the chip is running in the field, and accurate binning, time-zero ATPG and memory BiST based decisions are no longer enough,” said Alex Burlak, vice president of test and analytics at proteanTecs. “You need to monitor the actual timing margins, with agents that are connected to the actual logic path within the design, not just a pass/fail test. In addition, customers have gained up to 14% power reduction by using margin agents while the chip was operating in mission. This is achieved by measuring the actual timing margin of the end point flip-flops under monitor. If it’s greater than the defined threshold, that means you can safely lower the voltage, but still meet the performance. Without that visibility, there is no way to know how far away from failure you are in mission. The visibility into timing margin of actual logic paths is also critical to prevent silent data corruption during operation by detecting close to failure timing margins, which may happen for different reasons, essentially extending the useful life of the chip if applicable.”


Fig. 1: Tracking the impact of margin in a design. Source: proteanTecs

The challenge becomes even more complicated with 3D-ICs. “That is the scary part, and why people are hesitating,” said Shekhar Kapoor, senior director of marketing for digital design at Synopsys. “The methodology and the tools are there, and we can actually help you partition the design today. We can tell you purely from a connectivity point of view what’s the best partition. You can put all the macros in one die, you can have the logic over here, and then you can have a memory there, and you likely will meet your big-ticket performance goals. But is this the most optimal approach? Have you looked at all these other things that come into the picture? What have you done with the thermal part of it? You’ve got a thermal margin and a power margin, and you have to add those together. But we used to have 20 different corners. Now we have something like 200 timing corners today for a typical monolithic design. So you’ve got to get into all these combinations for nominal worst case, and all of these things have a huge multiplicative factor. And that’s just for timing. You also have thermal issues, aging, power. How do you extend your timing sign-off, not just for point-to-point, flop-to-flop, but take into account the effects of power and thermal. If you can do that right, then at least you’re handling the margin in one place.”


Fig. 2: Optimizing Vmin with path margin monitors. Source: Synopsys

Segars agreed. “You can worry about margin from ‘this block’ in your design or ‘this piece of IP.’ And with stacked dies or multiple dies on a different substrate, particularly if they’re coming from different foundries, everyone’s going to build in a margin of safety. But if you keep doing that, eventually you have no performance at all. That may lead to different ways of characterizing the building blocks.”

This also increases the need for power integrity analysis, which generally was deemed unimportant a decade ago. “Now it’s a first-level sign off tool because voltage margins have gotten so thin,” said Marc Swinnen, director of marketing at Ansys. “The best way to reduce power is to reduce voltage, so there are ultra-low voltage processes. But that means you have the side effect of no margin for voltage drop. You’ve pushed the voltage down so low that you really can’t afford to lose any on the path, so they become very, very sensitive to voltage drop on, and EM/IR become first-level sign-off tool. If you increase the voltage drop margin, your maximum frequency goes down because now you have to design for a lower voltage. So not only don’t you have much margin, but any margin you create goes straight to your bottom line on performance. That means you really don’t want to put that margin in there unless you absolutely have to. Despite that, people have been seeing chips that come out with about a 10% lower Fmax than they originally simulated, and they can’t quite get the frequency they’re supposed to get. The most common reason is dynamic voltage drop. There are escapes in the voltage drop analysis they don’t see that in real chips causes local voltage drop that impacts the timing. They’re seeing a mysterious 10% drop in frequency due to voltage drop situations they didn’t anticipate, and it’s probably due to dynamic voltage drop, which has become completely dominant over the good old static proper voltage drop. The challenge is to identify which switching combinations are realistic, which will cause give the worst voltage drop, and how to mitigate those how to fix those. But the idea of blanket margins across the chip to take counter that is a non-starter. It has become a very difficult problem, and you need much smarter techniques to identify the realistic switching.”

In addition, margin may determine which process — or in the case of advanced packaging, which processes — works best for a particular design based on the fact that guard-banding is no longer an option. “Advanced nodes are not mature,” said Movellus’ Faisal. “There’s more variation, more resistances in the wires, and you pay for it by cranking up the voltage. You can go down to 0.6 volts for gates, but you have to stick around 0.75 even for 3 nanometers. That’s all going to margin.”

Conclusion
How margin is doled and, and to which groups, is becoming a significant challenge. It’s no longer confined to one process or part of a flow. Instead, margin needs to be considered in the context of a system, and sometimes even a system of systems, and it needs to be viewed as a total number that spans multiple groups.

The goal is improved reliability, and margin can affect choices for processing elements, memories, chip architectures, and ultimately the integrity of signals and the resiliency of systems. It is at the core of every device, even if it is not always obvious to different parts of the design-through-manufacturing chain. The chip industry today is grappling with the impact of less margin and how to compensate for the loss of a valuable shortcut.



Leave a Reply


(Note: This name will be displayed publicly)