Adding Circuit Aging To Variability

Traditionally these effects were analyzed separately, but they’re now heavily intertwined.

popularity

Moving to a smaller node usually means another factor becomes important. The industry has become accustomed to doing process, temperature, voltage (PVT) corner analysis, but now it has to add aging into that mix.

The problem is that planning for circuit aging is no longer a purely statistical process. Aging is dependent on activity over the lifetime of the device. Tools need to be modified and new methodologies developed. Until that happens, something is being left on the table because extra margining is required.

How big an issue is this? “Deviations due to aging were traditionally about one-digit percentages,” says Roland Jancke, department head for design methodology at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “If variation is in the same range, the resulting effect is below 1% of the characteristic value. However, such reasoning used to be valid for technologies down to 20nm. In the latest sub-10nm technologies, aging as well as variation have significantly larger values.”

These variation factors are growing. “It was not a problem in the past, but for technology nodes starting at 28nm, we started to see some aging models,” says Ahmed Ramadan, product engineering director for Siemens EDA. “By the time we get to 3nm technologies, it will be a must. Aging will become a much bigger problem.”

It is a significant enough problem today that it has already become a must for some industries. “We have often seen a double-digit percentage impact of aging on path delays,” says Hitendra Divecha, product management director in the Digital & Signoff Group at Cadence. “This is something designers cannot ignore anymore. The failure to model differential aging accurately can lead to silicon failures. At the same time, modeling aging too pessimistically will negatively affect PPA [power, performance, area]. That is why it is important to have an accurate, yet practical aging-aware static timing analysis solution.”

Aging is particularly concerning for safety-critical designs, which until recently were not done at the leading-edge nodes. “Automotive is becoming a bigger sector, and they care about this,” says Marc Swinnen, director of product marketing at Ansys. “There is a lot of concern in the industry as people realize that the methodology we have today is insufficient. But they do have something. They do have aging libraries, and that is what is being used. There is a workaround, so in that sense people are limping by with what’s available. But there is a drive for a better solution.”

What makes aging so difficult to deal with? “For complex advanced node SoCs, the topic of circuit aging and degradation continues to battle against the singular fundamental challenge — it’s hard to predict the future,” says Stephen Crosher, director of SLM Hardware Strategy at Synopsys. “This effort has seen great gains within the design stages of semiconductor development, with the continual evolution of age simulation capabilities and enhanced model accuracy. However, we still face the fact that production-induced process variability is hard to predict, as well as lifetime chip activity. So the killer problem is simply ‘unpredictability.’”

The basics
Variability acts in different ways. “Today, we analyze for process, voltage, and temperature (PVT) variability,” says Siemens’ Ramadan. “Process is statistical in nature, and it is a static source of variability. If you look at the voltage and temperature, these are dynamic sources of variability. An increase in voltage means that variability will take place. If you have higher power in certain circuits, then temperature will increase and variability will take place. These dynamic factors go down when the temperature goes down or the voltage biases change. When talking about the variability that is coming from aging, it is a more permanent source of variability over time, because when degradation is happening to the device it is not recoverable. When threshold voltage is increased, it will not decrease again.”

Except when it does. “Some aging actually reverses over time,” says Ansys’ Swinnen. “Some aging heals, like hot carrier injection under the right conditions. Over time those problems can relax and anneal out, so if you don’t use it for while it may actually improve again. There are a lot of complications and that has been the problem.”

Ignoring that complication, the industry is using the current methodology. “There are pretty good aging models available that will give you the transistor BSim models for any aged transistor,” continues Swinnen. “You characterize the standard library, and you characterize a 5-year-old library, a 10-year old library, a 20-year-old library based on your assumptions of activity, temperature, and so on. Then you re-analyze your design at each one of those states. What is my timing going to look like in 5 years and 10 years or 20 years? Nobody is happy with that.”

The biggest problem is that not everything ages uniformly. “We need to capture the non-uniformity of aging degradation in time and space,” says Cadence’s Divecha. “This is sometimes referred to as differential aging. The non-uniformity in time is important when it comes to reflecting realistic operations on modern chips, which have multiple modes with different voltages. Differential aging, on the other hand, stems from various factors. For example, transistors on a chip have different parameters (threshold voltage size) that react differently to aging stress. In addition, different instances of the same cells may experience different activities like duty cycles and toggle rates. As a result, two instances of the same library cell will age differently. Until now, it has been challenging to model these effects with acceptable costs for industrial solutions.”

Not all solutions are elegant. “I have seen major foundries do aging plus Monte Carlo,” says Ramadan. “Running brute force Monte Carlo is time-consuming, and running aging by itself is also going to be time-consuming. There are studies showing that running them together will give you a more accurate estimation of how aging is going to impact your design. Most over-design finishes up increasing the power consumption of the circuit. This can be avoided by having an accurate methodology by running Monte Carlo and aging together.”

But this is not a general solution. “You start with a fresh simulation, or sometimes a stress simulation for a short period of time, and measure the activity on each and every device in the circuit,” explains Ramadan. “Based on that activity, you calculate an aging factor for that specific device. This gives a measure of how much stress is applied on that device, perhaps because of bias conditions and activity. With that amount of stress, through the aging equations, model parameters for all the devices in the design will be updated based on the amount of stress on each device. Then you run another simulation, which will represent the actual state of the device after 10 years or so. From this final analysis you will be able to figure out which devices have caused the most degradation in the circuit.”

Combining the analysis
However, there are some pitfalls to be aware of when combining traditional Monte Carlo simulation with aging. “When talking about statistical analysis, people might think about worst-case corners and think that is the worst case for aging,” says Ramadan. “It is not. There may be some correlation, because the best-case corner may be driving more current, and that may impact the stress on the device and in turn affect aging. But you cannot just take the most pessimistic corner and use it to find out how aging is going to impact my design. Maybe the best-case statistical corner can lead to higher aging.”

The biggest factor is activity. “A lot of aging is activity- and thermal-dependent,” says Swinnen. There may be parts of the chip that are rarely used, perhaps only used at startup, and that happens once a month. And there are other parts that are switching continuously. The chip ages variably across the chip, and temperature varies across the chip. So it’s not that we can’t determine how these parts will react to aging, but we have to find out which parts age and how much. And then, if you have two parts of the chip that have aged differently, you start getting clock skew issues.”

This is a non-trivial problem. “One part of the clock path is fast and the other slow, and whereas before we were just meeting hold times, now suddenly you are not after 5 years,” Swinnen says. “So saying the entire chip is 5 or 10 years old is not good enough. You really need individual factors. What you really need is to be able to select the appropriate timing models for the pieces of the logic based on inputs over time, on activity, on temperature, that can then be used on the fly to do timing simulations with different models for different parts of the logic.”

How big an impact is this? “In advanced node designs, aging is a first-order problem, and it deserves attention,” says Divecha. “It is common to see a 5% to 10% degradation, even in the first two years of a product’s lifetime. For high-performance products like GPUs, server CPUs, etc., operating at higher voltages and temperatures, degradation can be more rapid. It is also a large problem for products that are designed for long lifetimes, such as automotive and industrial parts. Simplistic approaches, such as timing derates, no longer suffice to address the problem.”

What else can be done?
Until complete analysis becomes possible, there are a couple of ways to minimize the impact.

“Sometimes in circuit design, there are techniques where a perfect match between two or more devices is required for the functional behavior, e.g., in differential pairs,” says Fraunhofer’s Jancke. “Now, if both devices of such a pair are unsymmetrically stressed, they will age differently, and sooner or later the differential pair will lose its matching property. Therefore, mitigation techniques have been developed that switch the input of a differential pair between both input terminals on a regular basis to avoid unsymmetrical stressing.”

Another possibility is to track aging as it happens in the device. “By seeing, measuring, and quantifying the effects of aging within operational silicon, you are not only able to determine the degradation rate, but you are also able to see the effects of localized, differential aging,” says Synopsys’ Crosher. “The attractiveness of such visibility is that not only can the trends of aging be seen across entire product ranges, but also within each individual chip. This allows for algorithms to predict when maintenance is required or when failure may likely occur. Another benefit is that such in-field, operational degradation information can be fed back into the age models for better simulation accuracy.”

Measuring heat is particularly helpful. “Although thermal measurement can be considered a traditional sensor, it is by far the most potent resource for managing aging effects, especially if done accurately,” says Crosher. “We’ve not only seen that circuit activity will create accelerated aging through self-heating and stress, but also accelerated aging through circuit complexity. Complexity is especially relevant for localized aging and electromigration when considering the heating effects of cross thermal conductivity between high density signals and power tracking, which is compounded by the increasing number of metal layers made available with the aim to mitigate routing congestion for advanced node SoCs.”

Still, some things cannot be monitored or even tested for. “When you look at electromigration, this is an aging effect,” says Swinnen. “But you can’t test for it, and it is temperature-dependent. We now have electromigration analysis that is temperature-sensitive. We can feed back the temperature of each part of the chip, and we can have the electromigration limits vary by local thermal regions. Then we can flag areas that normally would seem to have wires that are wide enough, but because they’re so hot they are not wide enough.”

Aging models
As with so many things in EDA, it starts with models. “Until recently, accurate aging analysis was only possible with SPICE,” says Divecha. “That requires an expert to bring in knowledge from various disciplines, such as semiconductor physics, electrical analysis, logical analysis, timing analysis, etc., to thoroughly study the problem and its impact. Another approach sometimes employed in the industry is to characterize libraries under fixed aging stress conditions. A library is generated using a fixed value of parameters such as stress, voltage, temperature, switching activity, etc., and this is used in static timing analysis to account for aging effects. However, it has hardly been practical, nor a sufficiently accurate solution, because it assumes uniform aging in time and space and requires very high compute resources for additional library generation.”

Progress is being made, though. “Recently, we have been pushing a standard solution that comes from the Compact Model Coalition (CMC) under Si2,” says Ramadan. “This solution is called the Open Model Interface. It is a flow that enables foundries, or CAD teams in any design house, to integrate an aging model inside that interface and it works with any of the simulators they are using today.”

Conclusion
Both variability and aging are getting worse, and they are to some degree dependent on each other. However, traditional Monte Carlo simulation is not sufficient because aging is activity-dependent, though often in complex ways. There are few viable solutions in the industry today, but the problem is growing to the point where some industries, such as automotive, are very likely to push hard for more comprehensive solutions.

Related
Design For Reliability
How long a chip is supposed to function raises questions design teams need to think about, including how much they trust aging models.
Dealing With Device Aging At Advanced Nodes
Gaps remain, but understanding about how circuits age and what to do about it is improving.
How Chips Age
Are current methodologies sufficient for ensuring that chips will function as expected throughout their expected lifetimes?
Improving Circuit Reliability
Why using chips longer and in different applications can add more stress to circuits.



Leave a Reply


(Note: This name will be displayed publicly)