Assessing the reliability of a device requires adding more physical factors into the analysis, many of which are interconnected in complex ways.
Chip aging is a growing problem at advanced nodes, but so far most design teams have not had to deal with it. That will change significantly as new reliability requirements roll out across markets such as automotive, which require a complete analysis of factors that affect aging.
Understanding the underlying physics is critical, because it can lead to unexpected results and vulnerabilities. And the usual approach of overdesigning a chip is no longer a viable option, particularly when competitors are utilizing better design and analysis techniques that limit the need for overdesign.
“Semiconductor devices age over time, we all know that, but what is often not well understood are the mechanisms for aging or the limits that will cause a chip to fail,” says Stephen Crosher, CEO for Moortec. “In addition, there is bound to be a requirement for a minimum lifetime of a device, which will depend on the application. This could be 2 or 3 years for a consumer device, and up to 10 years for telecommunications devices. Given that aging processes are complex and often difficult to fully predict, many chip designs today are often over-designed to ensure adequate margin to meet requirements for reliable lifetime operation.”
The types of devices that demand high reliability are growing. “The advanced node devices that go into a base-station or into a server farm have pretty stringent reliability requirements,” points out Art Schaldenbrand, senior product manager at Cadence. “They operate 24 hours a day, 7 days a week. That is continuous stress. Then there are mission critical applications. A lot of focus on automotive, but that is just like industrial applications, or space applications where the cost of failure is very high. Once a satellite is sent into space, you want it to work until the end of its useful lifetime.”
What makes this all the more troubling is that some failure modes are statistical. “If aging processes could become more deterministic, or better still if you can monitor the aging process in real-time, then you can reduce the over-design,” says Crosher. “You could develop chips that react and adjust for aging effects, or even predict when chip failure may occur.”
The physics of aging
But first we have to understand the underlying causes of aging. “Damage is caused when the design is subjected to electric stress,” explains João Geada, chief technologist for ANSYS. “There are things that happen to metals and things that happen to transistors.”
Transistors are vulnerable on multiple fronts. “These are the three major degradation mechanisms that can affect a MOSFET, a finFET, or an FD-SOI device,” says Ahmed Ramadan, senior product engineering manager from the AMS group of Mentor, a Siemens Business. “The impact is that they will alter the threshold voltage of the device. That can affect the drive current of the device and lead to a slowdown of the device, consequently slowing down the whole circuit.”
Eventually, under continued stress, the device may stop operating altogether.
The three issues that make transistor vulnerable are:
Fig 1. Effect of NBTI on an SRAM cell. Source: Synopsys
The research community is aligned on the underlying mechanisms that cause HCI and NBTI, but there are different explanations for TDDB. This creates difficulties modeling it.
In addition, with advanced technology nodes, there is scaling of dimension and voltage. “However, the voltage is not scaling as much as the physical dimension of the devices, which is causing an increase in the electric field that causes these effects,” notes Ramadan. “Some of them also are affected by temperature, such as NBTI, so with high temperature added together with a negative bias on a PMOS device, the NBTI is significant. There is also PBTI which can happen to an NMOS transistor.”
André Lange, group manager for quality and reliability at Fraunhofer IIS/EAS sees multiple new challenges as we move to these nodes. “First, these technologies tend to be slightly less reliable than larger technology nodes. Second, current densities might rise and locally exceed critical values. Third, recent technology advances mainly target digital circuits so that analog design gets more and more complicated. Fourth, new application scenarios, such as autonomous driving, will introduce completely new usage scenarios where they could see approximately 22 hours of operation per day compared to about 2 hours of operation today.”
The industry is still learning. “At advanced nodes, the challenges are that the technology is new, and we don’t have as good an understanding of them,” says Schaldenbrand. “Thus, predicting the device physics is a bit more of a challenge. We have done a lot of work modeling these devices, and we have seen that some characteristics that we saw at legacy nodes are now a little different.”
There is an additional problem. “Just because you apply a particular voltage for a particular amount of time to a particular transistor does not mean that it will automatically break,” warns Geada. “It just has pretty good odds of breaking. Partly it is quantum. You are dealing with really small geometries. You are dealing with a gate that is a molecule or two thick to begin with, and you are dealing with quantum effects. There is no getting around some of the randomness.”
Schaldenbrand agrees. “Some devices will age faster than others and you have to account for the statistical variation in the aging. It becomes more important to account for all of the sources of variation, not just electrical variation.”
Temperature is becoming a much larger issue. “All of these factors affect planar devices as well, but not as pronounced,” adds Anand Thiruvengadam, senior staff product marketing manager at Synopsys. “With a planar device, you do not have to bother about self-heating. There are a lot of ways to dissipate heat with a planar device, but with finFETs that is not the case. The heat gets trapped and there are few chances for that heat to get dissipated. This has an impact on the device itself, and also has an impact on the overlying metal.”
Wires
Dropping down a level, wires are the source of many problems related to aging. Wires do not scale, and at advanced nodes that leads to a whole bunch of issues related to resistance/capacitance.
One of the key impacts for aging is electromigration (EM), which is caused by the transport of materials within a conductor. “Electromigration is one of the issues that affects aging and has become very important since 16/14nm finFET,” says Mo Faisal, CEO for Movellus. “Now, with 7nm and 5nm the wires have become very skinny and they can be damaged over time as current flows through them.”
This can create big headaches throughout the design flow. “Physically, as the wires get smaller, the effects become more important and the margins are becoming smaller,” says Schaldebrand. “So we are seeing a lot more demand for high accuracy in the analysis. At 28nm, plus or minus 30% might have been good enough. But as we get into advanced nodes, people are wanting +/- 10% accuracy. The margins are shrinking, so people want to have more accurate predictions.”
All of this needs to be considered in context, too. “If I step away from aging and look at reliability, device self-heating is an important factor to consider even for electromigration,” adds Thiruvengadam. “At 7nm this is even more so and has essentially become a factor for signoff.”
Some of the same issues also affect memories. “To do a write requires injecting some charge through a gate into the underlying capacitor,” explains Geada. “Because it does require slightly more voltage than normal, it does cause damage. And it means that eventually you can’t clear it out, and it causes traps to become embedded into the gates. The fundamental cause of damage is traps getting embedded into the gate and acting as if there is a permanent voltage across the gate. That degrades the ability of whatever that device is from operating, be it clearing a charge, from doing a transition. It will not be operating nearly as well as when it was in is original, new unstressed form.”
Variation
Process variation has become a persistent problem below 28nm, and the problem gets worse at each new node. It now has to be accounted for in multiple steps of the design flow and for each specific design at each new node.
“Because we are trying to have accurate predictions across a long lifecycle, we have to think about how process variation will affect the lifetime,” says Schaldenbrand. “Some phenomena, such as hot carrier injection, where we see electrons being injected into the gate, is related to gate thickness and gate thickness varies from device to device. You have to account for the statistical effects of process variation on aging.”
This requires a different mindset for design teams. “Realistically, just like for regular timing, we have to deal with variance as a first-order effect and make designs that can tolerate variance as opposed to try and engineer variance away,” adds Geada. “You can’t engineer variance away. The devices are too small. The effects are not controllable in that way. The same is true for aging. This is not something that the fab can make go away. It is an inherent property of the devices and the physics we are dealing with.”
The impact
Understanding the impact of aging requires separating analog and digital. Digital is the easier case.
“Consider a simple inverter,” says Movellus’ Faisal. “If the threshold voltage of the transistors within an invertor shifts by 50mV over 4 years, it will still invert. It will be slower than it was designed for, and that can be budgeted for. As delays grow, it may become a problem at some point. The faster the circuit, the more active the circuit, the faster it will age. However, even with a clock, you just have an edge, and the amplitude of the waveform has to be enough to trigger the circuit – basically Vdd/2. All of these things can be budgeted. If you are going to run a 1GHz clock and expect 10% degradation, I can design enough margin such that even if it degrades I am still within the speed range specified.”
This simplifies aging models. “The nice thing about digital is that currents only flow for a very limited amount of time,” says Geada. “So even though we have to be power conscious, digital by comparison to analog is a lot more static. It has brief high-activity intervals, but then waits until the next cycle of the clock. Analog is never on or off. They are always active, and they accumulate thermal stress differently. Analog has to deal with higher voltage swings and higher currents, which causes the metal to be susceptible. Analog has to deal with a different collection of things. It has to deal with thermal effects because current is always flowing.”
In a similar manner to digital, analog circuits age over time. “Analog devices typically drift in their performance characteristics,” says Schaldenbrand. “At the individual device level, they are more sensitive to aging. There might be some rare cases where you are worried about the gain changing and if you design it properly, you can make designs that are relatively insensitive to those effects. There are some things that you can do in analog design to desensitize it to aging, but because you are directly dependent on device parameters, analog devices are more sensitive.”
But that can become very difficult to achieve. “Consider an op-amp, which is the building block for many things,” says Faisal. “An op-amp has to be biased correctly, and you have to build some margin into the overdrive voltage. Then you have to make sure that you have left enough margin, such that over time as the op-amp ages it will stay within the saturation region of the transistor. The overdrive margins for the transistors are shrinking because the supply voltage at 7nm is 750mV and the threshold is about 350mV, so there is barely any room to leave lots of margin. With aging, the threshold voltage can shift by as much as 50mV. If the op-amp bias circuit shifts by 50mV, it could go from the saturation region to the linear region or triode, where a transistor becomes a resistor and no longer has gain. The function of an op-amp is to provide gain, so that is pretty drastic. At that point the circuit is rendered useless.”
Analog design was difficult to begin with. “Aging and reliability are challenges for analog designers,” says Ramadan. “Designs of today might not operate the same tomorrow because of the degradation that can happen to these designs. You have to make sure that all of the aging and reliability requirements are fulfilled.”
Related Stories
Chip Aging Accelerates
As advanced-node chips are added into cars, and usage models shift inside of data centers, new questions surface about reliability.
Transistor Aging Intensifies At 10/7nm And Below
Device degradation becomes limiting factor in IC scaling, and a significant challenge in advanced SoCs.
Will Self-Heating Stop FinFETs
Central fins can be up to 50% hotter than other fins, causing inconsistent threshold behavior and reliability problems.
Thank you for this excellent article.
This is a very valid article for the time.
My biggest gripe right now is the reliability models are very limited. They generally only consider a device being operated in digital modes, but analog modes vary dramatically, where VGS does not equal VGD or VDS; and aging is affected accordingly, but it is rarely taken into account.
Thanks Stephen. There will be a second part to this article and it will discuss some of the problems getting good reliability models.
In 2011, Synopsys provided a comprehensive treatment on major device aging mechanisms, including hot-carrier injection (HCI), negative bias-temperature instability (NBTI) for p-channel MOSFETs, and positive bias-temperature instability (PBTI) for n-channel devices; albeit it was primarily focused on aging models of 28nm MOSFETs (https://www.sciencedirect.com/science/article/pii/S0026271411005221)
For high-power (and often high-speed) MOSFET devices used in automotive, aerospace/avionic, communications, etc., the main design-for-reliability (DFR) concern should be about the electro-thermal and/or thermo-mechanical aging of metallic parts within these devices. For example, such aging phenomena lead to excessive metallization degradation at wire (bonding) pads or between TSV interfaces, delamination within and/or cracking of die-attachment structures, etc.
Aging also makes the presence of thermal noise and its impact on the system’s signal integrity (e.g. kT/C noise for mixed-signal or switched-capacitor circuits) more pronounced over time.
To derive a more systematic approach to device-aging prognostics for automotive applications, one may consider referring to this NASA’s Playbook (https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20170011538.pdf)
Thanks Michael. They are some good resources.
Thanks for writing this article, Brian.
” …they could see approximately 22 hours of operation per day compared to about 2 hours of operation today.” This will be a real push for the industry to face the issues rather than playing around with the number.
Yes – when you change anything by an order of magnitude, it can mean that a problem needs to be rethought.
Wow, this article handles most of issues from the semiconductor’s reliability.
It is terribly annoying problem that the NBTI or PBTI mostly comes from highly induced voltage test like bathtub aging.
EM problem also I suffered, and made own verification tool working with synopsys star-rc and cadence virtuoso.