Aging must be understood, analyzed, and mastered. Until then, additional margins are the only way out.
Nothing lasts forever, but in the semiconductor world things used to last long enough to become obsolete long before their end of life. That’s no longer the case with newer nodes, and it is raising concerns in safety-critical markets such as automotive.
Being able to fully understand what happens inside of chips is still a work in progress, and analysis approaches are trying to keep up. Until they do, additional margins are the only way to protect yourself.
“Historically, we have just been very conservative with design,” says Dan Lee, product management director at Cadence. “We do the analysis when possible, but so far we haven’t heard headlines or a lot of reports about aging being the culprit for failures.”
Some of that is due to a lack of good data. “Silent data error corruption has suddenly become a hot topic,” says Nilanjan Mukherjee, senior engineering director for Tessent at Siemens EDA. “Now, when people are getting failing parts, they have to RMA those, they have to do diagnosis to determine the cause of the failure. There is a consensus across the industry that the failures they are seeing are mostly related to small delay defects, meaning there are some defects that are causing certain paths in the circuit to exceed the clock period, resulting in a failure.”
Until it is fully understood, a margin tax is required. “Aging was always considered, but they did this by putting guard-bands in STA,” says Manoz Palaparthi, senior staff product manager at Synopsys. “They have as much as 5% to 10%. They do a derate that applies to the full design, like a flat tax, across all the cells in the design, and then that is used as a threshold for any aging impact that applies to the design. But now we’re at a point where customers want to reduce those guard bands and improve their PPA.”
Understanding failure
There are multiple reasons why semiconductors age. “You have multiple components like hot carrier injunction and bias temperature instability (BTI),” says Synopsys’ Palaparthi. “These effects are accelerated by voltage and temperature, the stress condition that you are applying. Stress also comes from signal activity — the duty cycle. All these components accelerate the aging of the design.”
While these effects always have existed, they are being magnified by the latest technology nodes. “We are looking at features that are quite small,” says Kelly Morgan, senior principal application engineer at Ansys. “If there’s any variation in them, what does that mean for reliability, both thermal and mechanical? What if we vary some dimensions a little bit, run a sensitivity study to see if, from a manufacturing perspective, something comes back and maybe it’s a little bit too thin, little bit too thick. Does that affect performance in some way?”
That requires a particular type of analysis. “With aging, as with thermal, you tend to deal with averages over time, so you’re less worried about small peak events,” says Marc Swinnen, director of product marketing at Ansys. “It’s more about long-term activity, the long-term thermal picture. Since aging is time-dependent, it doesn’t require accurate modeling. It’s not like timing where a single event can cause a timing failure. That’s not the case with aging. Any single event is not going to make that much difference. It’s more the average over time.”
Fig. 1: Thermal-aware statistical EM simulation for predicting aging. Source: Ansys
Failures always have followed the traditional bath-tub curve. There are early life failures, called time t=0 failures, which usually are caught during test and which appear somewhat random. But those are different from aging or wear-out failures.
“The majority of failures in the field are happening in the early life period,” says Mukherjee. “These can be divided into two different types of failures. One is latent defects that are escaping, meaning we are not stressing the transistors enough as part of manufacturing test. The second is intermittent defects, where the parts have been tested in manufacturing test, but the environmental conditions — such as the voltage droop on the power rails and software workloads — are impossible to mimic. These are conditions that are in existence when running in a data center, and those failures are showing up at a specific time. If you change the workload, the failures don’t show up. But if that same workload is running, the failures do show up. These are called intermittent defects. This is happening within the first two weeks or a month, but there are also failures six months or a year down the road. That is because of transistor aging, particularly attributed to the workload, because they are running workloads for months continuously.”
3D-IC impact
One of the big issues associated with 3D-ICs is heat generation and how to get rid of it. “Thermal is interesting because a lot of electrical properties and material properties tend to be temperature-dependent,” says Ansys’ Morgan. “As things get hotter, that will affect the electrical property, which will in turn affect the heat that’s being generated, and then back into thermal. Having a good understanding of that loop can be influential.”
Aging is proportional to heat. “The problem is that we’re talking about a complex structure,” says Cadence’s Lee. “While you can apply techniques such as extrapolation, or you can try to piecemeal analysis together, you’re dealing with a highly complex problem that’s challenging to analyze from a skill perspective. The methodologies and the tools may have been keeping up with this, maybe not, but the scale is just so huge when you try to analyze the entire 3D stack.”
It requires planar techniques to be rethought. “Previously, designers assumed a uniform thermal gradient profile across a single die,” says Synopsys’ Palaparthi. “But when you have two dies sitting on top of each other, that assumption is no longer true. You have very different local thermal effects. And the impact of that change in thermal profile also can impact your stress condition for aging. One of the key accelerators for device aging is a stress condition, both on the PVT side, signal activity factors and duty cycles. As a part of PVT, if your temperature component is going to have a higher impact for your multi-die designs, that accelerates your aging for the whole design.”
And all of this adds a layer of complexity. “Designers haven’t done real thermal analysis with meshing and everything,” says Ansys’ Swinnen. “They’ve used power density as a proxy for thermal. The traditional chip tools can calculate power density, meaning how much power is being used per square micron, and they use that power density as a proxy for temperatures. They assume higher power density means higher temperature, and they go with that.”
Key problem areas
Some parts of the design are likely to be more vulnerable to aging. “Even if you have a leading-edge digital design, you’re going to have some analog for maintaining your voltage or your power supply,” says Lee. “Those are fairly hefty analog devices. Aging simulation of those is pretty common because there’s so much current draw, and that causes a lot of stress. That’s a place where we really need to focus our aging analysis to make sure the sizing is appropriate, to make sure that we are getting five years out of this instead of six months. In a mission-critical application the PHYs are another area of concern. They are mixed-signal designs that are going to be under a lot of electrical stress all the time, and it makes a lot of sense to focus your aging analysis on them.”
The clock tree is likely to see a lot of activity. “People want to see aging on the clock network,” says Palaparthi. “Clocks are often running at 3GHz or more. How is my fresh clock jitter? What is my aged jitter two years down the road? What is my uncertainty, and what does my duty cycle distortion look like? These are the effects that are always present in the design. Similar to aging, customers insert margins to counter these effects. With high frequency designs, people want to quantify and adjust those margins.”
Things are getting worse on the wire side, as well. “The physical dimensions are shrinking at a certain factor, but the power supply voltages do not scale,” says Lee. “You can tolerate small imperfections in interconnect with larger technologies, but today even a single defect can compromise the capacity of the interconnect. The amount of current, or charge, that you need toggle a couple of MOSFET gates hasn’t shrunk that much, either. The punchline is, ‘The demand for current density, the amount of current moving through a cross section, is going up.’ We are getting to the point where the density is close to where it’s difficult to support.”
There are several other areas in which people are beginning to get worried, but at the moment, there is insufficient data to know how big of an issue they may be. While the industry has not heard of issues with 3D-IC assemblies, it is not clear they would be announced. Upcoming changes in transistors, backside power and clocking, warpage, mismatched coefficients of expansion, and thinner substrates are all on the radar.
Doing the analysis
Standard practice today is to take pre-characterized aged libraries from the foundries. These attempt to represent how devices would have degraded under typical conditions. “The issue is that it assumes aging is consistent for all transistors. But aging is dependent on temperature and activity,” says Swinnen. “Active parts of the design will age faster than parts of the design that are rarely activated, and warmer parts of the design will age faster. Aging is not uniform across the design. That can lead to setup and hold issues if one set of transistors slows down and the other ones don’t. Capturing that has been a problem because you need to consider average activity over time, then assign different libraries to different cells. Aging everything equally is not worst case. That’s the problem. It’s somewhat optimistic.”
The industry is providing a range of solutions, depending on the accuracy required. Palaparthi cites four such possibilities:
By going through the levels of accuracy, guard bands can be reduced from the 5% to 10% range down to about 2%.
Most companies provide some kind of sensitivity analysis based on activity, but it is likely that this form of analysis will bring in more aging factors, such as local temperature or other forms of stress.
There is a new twist coming into play, as well. “We are starting to consider effects like local layout effects (LLE),” says Palaparthi. “Foundries are adding requirements for LLE, which are the effect of the layout that are near to the current cell to see how that affects switching or other parameters, such as the delay on the current cell.”
Handling errors
The chip industry is at the early stages of intelligently handling aging. Not only is it possible to detect when aging may become a problem, but there are some strategies that can be deployed to ensure the device continues to operate, even if in a partially degraded manner.
A lot of this is made possible by using in-built monitors. “There are three things that need to be done,” says Mukherjee. “First, we have to figure out where to place the monitors. Second, how do you share those monitors across paths so that it maximally covers the transistors that are critical? Third, how do I get the data on and off the monitors?”
Those monitors can be used to assess aging, and the data also can be fed back into control systems for adaptive voltage or frequency control. A device could either be slowed down, or the voltage increased to bring the system back into an operational range.
Monitors also can be used to determine what has failed. “If I can apply structural, deterministic patterns in-field, and make sure I target the paths that are most critical, then I can start do correlation,” says Mukherjee. “Today, when a monitor says something is failing, it has already failed. But you have no way to figure out why it failed. Once you have structural patterns, it can give you that information. It has much more diagnostic information. I can start to see which part of the design, or which cones of logic, are more susceptible to failures, versus which cones are not. Now I’m starting prediction. I don’t need to wait until a path fails. I can predict how close they are to failure.”
Conclusion
Aging is a significant tax on semiconductor performance, and in an age where every percent counts, aging is a problem that must be better understood. Without deploying the right methodologies and tools, that tax will increase with every node. The industry is attempting to keep up with the problem, but nobody truly knows how close to the edge a design is, or when it may reach that point. While more detailed analysis is inevitable, it is also possible to do too much for too little gain.
Related Reading
Chip Aging Becoming Key Factor In Data Center Economics
Rising thermal density, higher compute demands, and more custom devices can impact processing at multiple levels.
Heat-Related Issues Impact Reliability In Advanced IC Designs
Retaining data in memories and processors becomes more difficult as temperatures rise in advanced packages and under heavy workloads.
Leave a Reply