Margin is no longer effective, so now the problem has to be solved on the design side.
Variation is becoming a major headache at advanced nodes, and issues that used to be dealt with in the fab now must be dealt with on the design side, as well.
What is fundamentally changing is that margin, which has long been used as a buffer for variation and other manufacturing process-related problems, no longer works in these leading-edge designs for a couple of reasons. First, margin impacts performance and power, which is the whole reason for moving to these advanced nodes in the first place. And second, margin doesn’t always control variation-related problems in these chips. Dielectrics are thinner, resistance and capacitance are higher in thinner wires, dynamic power is denser, and distances between transistors is smaller. So while variation may not have caused a problem at 28nm, shrinking the features in a derivative design can turn a second-order effect into a first-order problem one or two nodes later. Moreover, those problems can shift over time as machine-learning is introduced into these designs.
“Traditionally, that progression on engineering is always very simple,” said João Geada, chief technologist at ANSYS. “To begin with, we ignore it until we can’t ignore it any longer. After you can’t ignore it, you margin it. And when margining becomes too painful and too expensive, then you measure it. That’s how everything has evolved. Any of the technologies we use in silicon manufacturing follow that track. So variation became noticeable at 90nm, but that was marginable. It became something that couldn’t easily be marginable at 16nm, so it had to start to be measured. From 16nm onward it’s become more and more critical to understand and deal with it and take into account at the system and design level.”
Typically, this is measured in time to yield, but as more advanced-node chips find their way into functional safety and mission-critical designs, that yield takes on another dimension. It’s not just whether a chip works after manufacturing. It’s also about how the chip works throughout its expected lifetime, which further boosts the time spent in simulation, testing, inspection, and post-manufacturing analysis. And it raises the stakes for anyone competing at the most advanced nodes, where the costs are already astronomical.
“Given the the cost of producing, say, a 7nm or 5nm or below design, nobody wants to put in a $100 million investment to be second place,” Geada said. “Ignore the engineering, just look at tooling and printing costs. If you add engineering verification, it’s absurd. You don’t want to do this and be second, so you want to push this as close to the edge as you can without crossing over. Margining isn’t the way of doing that. Margins are an engineering fail-safe that allow you to get there quickly, but they don’t allow you to get that close to the edge because the edge is rough.”
Variation used to be thought of as purely a manufacturing issue. Increasingly, however, it includes an element of hardware-software co-design, where algorithms need to behave within an acceptable distribution in conjunction with the hardware.
“This leads into the discussion about analyzing in bulk, because what the system is doing in any one unit of time interacts with what the physics of the hardware is actually able to keep up with,” Geada said. “It interacts with what the manufacturing is actually able to achieve, and it interacts with how that one particular copy of that chip got put into a package. All of those things have variable components. Engineering isn’t precise. Also, the system yesterday is not the same system today. In many respects, we’re at the beginning of the inflection of that world. As an industry, we have the beginning of the story. I don’t know that we have the full story yet. We’re still building it.”
Modeling variation
What’s missing from that picture, at least from the design perspective, is an understanding of how all of the different pieces in a complex system interact — particularly under various use scenarios — and how variation can impact the entire system operation. That includes multiple constraints, dependencies, and variation across multiple process, voltage and temperature corners.
“Circuit behavior near-threshold has always been highly susceptible to process variations,” said Seena Shankar, senior principal product manager at Cadence. “Non-Gaussian distributions have become the norm in many regular supply designs, but at near-threshold, the skewness of these distributions is extremely high.”
Several years ago, approaches such as advanced on-chip variation (AOCV) and the Liberty Variation Format (LVF) were introduced to enable more accurate variation-aware timing signoff. Those models were not able to capture non-Gaussian behavior, however.
“OCV is a single de-rating factor for all instances, which ended up being grossly optimistic or pessimistic, and therefore insufficient to model variation at advanced nodes,” Shankar said. “Hence, LVF is the industry standard for finFETs and below. It captures slew- and load-dependent sigma for every timing arc, including delay, transition, and constraints. Cadence proposed the introduction of additional Liberty constructs, such as mean shift, standard deviation and skewness, which have been approved by the LTAB (Liberty Technical Advisory Board). These constructs describe the first three moments of the distribution, modeling the non-Gaussian behavior more precisely. To model the highly non-Gaussian behavior, LVF with moments is highly accurate and correlates precisely with Monte Carlo results, both for advanced nodes and for mature nodes supporting near-threshold voltage domains targeting IoT applications.”
There are alternative views on this. “While improved, but ever-more-complex modeling of variation is available and being further developed, it remains a difficult area especially for advanced finFET nodes with increased process variation across large die sizes and increasing power per unit silicon area at ever lower core VDD voltages,” said Richard McPartland, technical marketing manager at Moortec. “One challenge is large, multi-core processors chips have software-driven workloads. But worst case workloads can be difficult to predict, especially if the software is written later by another team,”
Moortec advocates embedding a fabric of in-chip sensors, which provide visibility of conditions at critical circuits across what are often very large die. Those sensors track hotspots, voltage droops and process variations. “This will enable SoC teams to validate their designs, especially in the bring-up phase of new silicon,” said McPartland. “For example, how much process variation do I actually see across my large advanced-node finFET die? How much voltage droop occurs at critical circuits? Where are the hotspots and how are they behaving under load? Improved variation models are essential to keep up with the increased process variation and other challenges in advanced nodes, but in-chip monitoring will give you insights into how effective that really was. In addition, it enables optimization of performance and/or power.”
Still, for 22nm and smaller process nodes, the Liberty Variation Format (LVF) is today’s leading standard format to encapsulate variation information for standard cells and custom macros.
“If you have an advanced process node library, chances are that your variation modeling for delays and constraints are described in a LVF .lib,” said Wei-Lii Tan, product manager for AMS Verification at Mentor, a Siemens Business.
But creating LVF .libs is significantly more challenging than nominal .libs, Tan explained. Instead of a single nominal value for each table entry, LVF tables store the early and late statistical variation (sigma) values of each measurement, which require Monte Carlo-equivalent simulation to generate. This results in a significant runtime impact during the characterization process. With the introduction of Moments, LVF now contains additional attributes such as standard deviation, as well as non-Gaussian attributes such as skewness and mean shift. This enables accurate modeling of the statistical distribution of each measured value, at an additional runtime cost.
Fig. 1: LVF .libs with Moments contain the standard deviation values for each measured entry. Source: Mentor, a Siemens Business
Fig. 2: LVF .libs with Moments also describe non-Gaussian statistical values (e.g. skewness) for each measured entry. Source: Mentor, a Siemens Business
As the figures above show, LVF variation models contains significantly more information than just nominal value timing models. LVF models require Monte Carlo analysis to produce, resulting in a lengthier characterization process.
“To make LVF characterization feasible, characterization tools use various techniques such as netlist reduction and sensitivity-based approximations. However, these approximations may introduce inaccuracies to the resulting .libs. These inaccuracies lead to incorrect static timing analysis (STA) results, potentially causing silicon failure,” Tan said.
Fig. 3: A common example of an LVF issue found in characterization results — inaccuracies in long tail values. This leads to timing differences and potential silicon failure. Source: Mentor, a Siemens Business
Many of today’s variation modeling flows lack a reliable method to validate variation data in .libs, resulting in faulty or noisy LVF values that may sway timing results by 50% to 100% outside of production-accurate ranges.
“A key step in effective variation modeling for standard cells and custom macros for advanced process nodes is a highly reliable validation methodology for the variation models,” Tan said. “The verification methodology should have broad coverage to account for variation effects contributors across all process, voltage and temperature (PVT) corners, and also be able to provide full Monte Carlo-equivalent verification for any problem areas.”
This, in turn, opens the door for machine learning to analyze a library’s full variation model data set across all PVTs in order to identify outliers and potential issues, which then can be followed up by drilling down into potential problem areas and running full Monte Carlo-equivalent verification on those data points. The goal here is accuracy equivalent to running brute-force Monte Carlo, but with many fewer simulations.
And while it would be nice to think that the manufacturing piece of variation within process nodes is settling down, that’s not exactly the case.
“It can’t settle down,” said Geada. “We’re dealing with atomic tolerances and quantum systems. It has gotten better in certain respects, and particularly with e-beam technology, there are certain things that can be more reliably manufactured. But we hit the limits of physics a while back.”
Conclusion
While technology continues to advance, new solutions are needed to support that advance.
“More than the manufacturing side, what we’ve learned is how to design this and that,” said Geada. “We’ve adopted standards, LVF, so the timing understands how variability is impacted by individual transistors. But what hasn’t happened yet, for example, and is starting to become a hot topic, is dealing with metal variation. The metallization itself has variables. It has very different properties than the individual transistors. Things that we’re looking at with the voltage are variable. The power supply isn’t clean, temperature isn’t clean. And particularly when you start talking about full 3D and chiplet domains, all of a sudden a number of assumptions that existing tools have always had, such as the temperature is a uniform thing, are no longer true. You’re going to have explicit thermal gradients across the chip that are persistent and movable.”
These issues can make it very challenging for the vertically integrated manufacturer, as it changes what the perspective is on the system. “But it’s something that can be understood. But what if I put the chiplets that got manufactured in isolation, and then I’m putting it together in this system and somebody put something that’s hot on top of it, but only half of it is hot? Uniform temperatures are easy to deal with. Temperature gradients are not,” he said.
This is also where things start to get really funky with the algorithms, too, because as the algorithms begin to adapt and optimize, the gradient will change. This harkens back to the first Intel Itanium processor, Geada recalled, which could never keep all of its cores active. “It actually had to have one cold core that rotated because any one core that was active generated too much heat, and you couldn’t have too much of a thermal gradient because that would crack the package so it had to be rotated to keep this chip evenly heated.”
That throttled performance simultaneously, because there’s always a performance tradeoff, he said. “Now, with these heterogeneous environments, you have that problem but in spades. The Itanium was a 2D problem. It was only the edges that potentially had a gradient. When you’re going into a full 3D environment, with HBM stacks and SerDes, it’s challenging and it’s something that you can no longer safely margin.”
Modeling variation is an old problem. It uses standard languages like Verilog-AMS.
You also need to move to asynchronous logic when working in high-variability silicon. There are not too many tools around for that.
Good point, Cameron. So what are the options for this today? Everything custom and hand-written?
That makes less sense without the links –
patents.google.com/patent/US8478576B1
http://www.linkedin.com/company/29351052
High variability Silicon requires an asynchronous logic approach that only one company has managed to make work: ETA Compute.
Asynchronous logic has a lot of disadvantages. The most of them: 1. the async approach is almost incompatible with STA-based (99.9%) tools 2. async is very vulnerable to cross-talks and noise, and 3. in some cases (so called dual rail async logic) it may have a very high power consumption in compare to traditional synchronous design.
Past 20 years a few companies first declared they offer a async solution but later disappeared without any remarkable achieves. There are a few academic institutions across the world who studies the async approach to design. The most significant achievements of their work were for A4A (asynchronous for analog) applications, a few secure chips, and for harsh environment and energy harvesting application. But no one there talk about the async development for consumer markets.
From the other hand, last years reveals new needs for NTV design: Neural networks and low speed NoCs. These applications already shown the renewal interest to async design. But apart of this application – there are no other rooms for async design in commercial electronics, I am afraid. Just a wishful thinking.
Concerning the subject: the situation is really so serious as described: non Gaussian distribution often leads to a big difference between SSTA results and spice-level simulations (with MC). Especially near the threshold. That is why Synopsys even uses machine learning approaches to achieve better correlation with spice-level MC simulation.
Thanks for the article!