Variability Becoming More Problematic, More Diverse

Increased density, heterogeneous designs, and longer lifetimes make it critical to reduce variation early in the design process.

popularity

Process variability is becoming more problematic as transistor density increases, both in planar chips and in heterogeneous advanced packages.

On the basis of sheer numbers, there are many more things that can wrong. “If you have a chip with 50 billion transistors, then there are 50 places where a one-in-a-billion event can happen,” said Rob Aitken, a Synopsys fellow.

And if Intel’s recent projection of a trillion transistors by 2030 comes to pass, process variability will increase exponentially. But variability also is becoming more complicated and difficult to assess because the sources of that variability are increasing, and in some cases chips are expected to last longer than in the past. Moreover, not all of the components in an advanced package are developed at the same node, and in some cases they are not even manufactured by the same foundry.

“With SoCs today, you can have a situation where there could be different dies on a single chip and they could be in different process technologies completely,” said Pradeep Thiagarajan, a principal product manager at Siemens Digital Industries Software. “That introduces a whole new level of variation across processes.”

Fig. 1: Planning for variation based on density of cells. Source: Siemens EDA

Fig. 1: Planning for variation based on density of cells. Source: Siemens EDA

Variability issues can be complex and subtle, and they can be additive. So while any issue by itself may be problematic, at advanced nodes and in advanced packages they can compound. This is the evident with silent data errors, for example, which are based on manufacturing defects. But they only show up only occasionally, and then only after a series of specific operations done in a certain order.

“Consider DVFS and binning,” said Aitken. “Historically, the vast majority of devices live in the center of a distribution somewhere, and something really bad has to happen before they’re going to fail. You also have a smaller number of devices that are very marginal. Anything will tip them over. If you’re doing DVFS, or something like it, your goal is to push as many chips as possible into this marginal place. Then you have a larger population of things where something can change and break it. For example, say there’s a resistive defect that slows the device down a bit. When you test for it, you won’t even know it’s there. But when you run the device for a while, the resistance might change due to any number of effects, and suddenly it fails.”

One traditional strategy to try and avoid this problem is simply to build in margins that account for random errors. However, adding in extra circuitry can impact performance at advanced nodes, and it can reduce energy efficiency because signals need to be driven further through increasingly thin wires. That also increases heat, because of the resistance in those wires, which can further impact performance.

“The correct way to address that is to make sure the validation is more accurate so that the required over-design is minimized,” said WeiLii Tan, principal project manager at Siemens EDA.

Engineers need to build models and monitor effects. But even that won’t be enough. To maximize reliability, companies must incorporate on-chip monitoring. “We’ve gone from a ‘giant margin/don’t need to care’ situation, to smaller margins that are calibrated very carefully, to observing issues in real-time while a system is running,” said Aitken.

The on-chip monitoring market has grown accordingly. Once purpose-built by foundries and OSATs for specific use cases, on-chip monitors now have been introduced by proteanTecs, Synopsys (through Moortec acquisition in 2020), and Siemens (through UltraSoC acquisition in 2020), among others.

Collaboration
All of this is leading to increasingly tight collaboration between foundries and EDA houses. “The foundry owns the characterization defining the behavior of the transistor through a model,” said Jayacharan Madiraju, product management director at Cadence. “It’s up to the simulator to implement the model, and how each model behaves is customized for that process node through model parameters. Every model has parameters, and the foundry gives you a ‘model card’ that lays them out.”

Additional parameters are included to deal with issues such as stress, which due to variation can transform latent defects into real ones. These parameters may be added directly to the model card or go into a separate model library, such as an aging library.

“The critical relationship between the EDA provider and the foundry is the SPICE model, because a lot of those factors of reliability and variability are modeled directly by the foundry in an industry-specific SPICE model language,” said Brandon Bautz, senior group director, product management at Cadence. “That feeds into simulation and verification of all types of analog and mixed signal componentry. On the digital side, we use that same simulation capability. In fact, we’ve embedded fast SPICE simulation technology into a lot of our tools to do characterization, which levels up that variability information into another industry specific format, Liberty, which is widely used in the digital realm for sign-off. There is a whole flow and format hierarchy to how variability, and to a lesser extent reliability, are modeled and then characterized and then ultimately deployed in the digital sign-off realm.”

Physics can’t be ignored, but designers play a significant role in overall reliability and variability reduction.

“It all comes down to building a robust test case, and that test case has to grow,” Siemens’ Thiagarajan said. “You’re going to have bigger test cases that cover different physically extracted areas, across locations, and you have to simulate those. You’re talking about larger element capacities that need to be accounted for, along with a very smart, variation-aware simulation system. You need a combination of both better test cases as well as a smart simulator.”

Data
If engineering and physics created the problems, better data analytics can help to minimize them.

“Statistical analysis can be classic GIGO (garbage in, garbage out), said Aitken. “It’s important to remember that different domains, such as memory and digital logic, involve different values. In an SRAM, you have millions of identical paths. The statistics that you have to look at are the statistics of large numbers of identical things. What you’re actually looking for is deviation that’s in the large number of sigma realm, such as five or six sigma. All the statistics that you learn in college is concerned with the centers of distributions and not the tails. For the tails, different statistics apply, such as extreme value theory. That’s how memories get margined, and that’s how you make sure that you know when people design them and sell them, they work the way you anticipate.”

Accurately assessing variability issues in digital logic requires a different approach. “If a digital path fails, it’s likely not because one element of that path is 12 sigma out of spec,” he said. “It’s because all or many of the elements are a little bit out of spec, but in the same direction. The way we deal with that has evolved over time. There’s now a POCV (parametric on-chip variation) characterization approach to describing the variation and how it changes. Interestingly, in both the extreme value theory from memory and the POCV and digital logic, the variation is asymmetric, so you don’t have this beautiful bell shaped curve.”

Instead, while there might be some underlying physics that follows the bell-shaped curves, the effects don’t, and the result is subtly different halves of the distribution. “The tools hide a lot of that from you,” said Aitken. “You’ve got a combination of a model that was generated by a foundry that’s gone into a standard cell library with a characterization around it to model that behavior in POCV which limits the description of it to a region of operational interest. Then timing tool has to be able to read that model and make sense of it and apply it in a sensible way.”

Ultimately, the answer to the complexity will be a mix of advanced statistics and practical choices.

“You’ve got hundreds of sign-off corners on chips where you have to keep track of temperature, metal thickness, voltage, and other variations that all interact with each other. So you look at all these extreme behaviors to bound the space, because as long as the design lives inside the space, then it should work. You have to make sure that your library circuits actually have their worst-case behavior at boundaries,” said Aitken.

With something like a NAND gate, this doesn’t really matter. With a flip-flop, on the other hand, it does matter.

“If you’re not careful in your flip flop design, you will find that its worst-case operation is not at its minimum voltage or its maximum voltage. It’s somewhere in between, which means all your corner methodology starts to fall apart. You want to design your circuit so that it fails at one extreme or the other, but doesn’t fail in the middle somewhere because then you may not have simulated for your worst-case design point. One way this happens is with CMOS temperature inversion, where circuits operating near their threshold voltage speed up at higher temperatures, as opposed to slowing down as we normally expect. This becomes an issue when you have both N and P devices simultaneously trying to charge or discharge a node, and one exhibits temperature inversion while the other doesn’t. This can happen with keeper devices, transmission gates, or even large leaky transistors connected to a small active transistor. It’s generally best to avoid circuits like that to avoid trouble, but when they can’t be avoided careful design is needed,” he explained.

To model and predict variability, the primary applied statistical method is the Monte Carlo simulation (MCS), which uses repeated random sampling to obtain the likelihood of a range of results of occurring. It provides a number of advantages over predictive models with fixed inputs, such as the ability to conduct sensitivity analysis or calculate the correlation of inputs.

Fig. 2: Characterization methods using Monte Carlo and sensitivity-based analysis. Source: Synopsys

Fig. 2: Characterization methods using Monte Carlo and sensitivity-based analysis. Source: Synopsys

“You take input variation, and you check what the output variations are. That’s basically Monte Carlo,” Cadence’s Madiraju said. “You take the model card, add specific model parameters with variation. For example, the mobility of a transistor, they’ll say has a nominal mobility of 0.5, but it could be ±2% variation. When you do the Monte Carlo simulation, it needs to generate samples that show that variation within that range on the SDK.”

Monte Carlo can be applied with simple, repetitive brute force, or refined through machine learning, said Siemens’ Tan. “Sometimes we have circuit designers or chip companies that target six sigma. From a statistical sense, that means 1 failure in about 1 billion. If we use a brute force approach, that means that we need to check 1 billion times. To be more confident, we actually need to run about 10 billion samples to see if our circuit fails 10 times, because if you do one, then it might or might not happen. Of course, even with the computing resources that we have today, it’s not feasible to run billions of simulations.”

Machine learning allows engineering teams to run fewer simulations, checking whether the predicted outputs are correct, then running more simulations to get an accurate view of the complete probability density function.

“You also get the long tail where the higher sigma regions are, which makes a non-feasible task become feasible,” said Tan. “You can get high-sigma pass/fail in daily production routines.”

Conclusion
As one major contributor to variability is thermal issues, which are increasing with 3D designs, EDA tool providers are taking those issues more into account, but it’s a two-way street, and more is still needed from the foundry side.

“Today there are parameters that allow simulators to model self-heating effects,” said Cadence’s Madiraju. “But if they want to go beyond that to see heat propagation and other issues, what is not available is a technology file or information in that form from the foundries that helps you create a thermal model for the propagation. What’s not available to us is that thermal technology file that tells us the material properties and all the things that pertain to that process. Companies have that information, but they don’t have it in a form they can easily give to us.”



Leave a Reply


(Note: This name will be displayed publicly)