中文 English

Dealing With Device Aging At Advanced Nodes

Gaps remain, but understanding about how circuits age and what to do about it is improving.

popularity

Premature aging of circuits is becoming troublesome at advanced nodes, where it increasingly is complicated by new market demands, more stress from heat, and tighter tolerances due to increased density and thinner dielectrics.

In the past, aging and stress largely were separate challenges. Those lines are starting to blur for a number of reasons. Among them:

  • In automotive, advanced-node chips are being used in extreme environments and expected to function reliably for up to 18 years. Likewise, in carrier-class equipment, chips are being subject to heat and cold and expected to last at least a decade or two.
  • Margin, which has been used to buffer some of the causes of aging and stress, is shrinking in many designs at 7/5/3nm due to relentless demands for lower power and higher performance.
  • More designs are becoming heterogeneous, raising the possibility that not everything in a multi-chip package will age at the same rate, particularly under different use cases and applications.

If design teams are lucky, problems will show up early in production and they can be fixed before chips are released to the market. But more often than not, issues will take years to materialize as circuits age and something eventually breaks down.

“Everything has to line up correctly to even begin to see this, and it’s hard to duplicate it on a tester because the moment you take the chip out, re-solder it, and put into the tester, the problem goes away,” said Mo Tamjidi, president of Dolphin Technology. “The problem is temporary. It has to run for 7, 8, or 10 years before you see it. Imagine it’s something you did 15 years ago in a field in northern China, on some tower, in a very cold place, running at high voltage, running at the maximum possible speed. Even if you have a little bit of a timing margin, it will work because it just slows down, and if you have some room to fluctuate you’ll never see this. You only see this when you have a very tight timing budget, in a very cold environment at max speed, and at very high voltage.”

There are nearly unlimited causes for failures over time. Frequently it starts with one component, such as an IP block or a transistor, that runs out of spec after repeated use. This can be due to a latent defect that wasn’t caught during the inspection or test phases. Or, it can be due to excessive wear and tear from constant switching due to a unique use case, or dielectric breakdown that results in leakage and thermal damage. Failures also can be caused by various types of stress, either mechanical as in the case of a solder ball in the corner of a chip caused by excessive vibration, or by a variety of electrical stresses.

Mechanical stresses can result in physical shorts or signal interruptions. Electrical stresses, meanwhile, can impact anything from drainage source voltage to saturation current and threshold voltage, eventually causing switching time delay that is outside of spec tolerance, according to Ashok Alagappan, lead consulting engineer at Ansys. As dynamic power density increases, and as leakage rises, any or all of these can affect reliability, and the likelihood that something will go wrong increases with time and use.

Another complication is that at the high level, overstress will accelerate aging, however, even normal operation can result in device degradation.

“Depending on the type of stress, the gate oxide may fail and rupture, or key device parameters, for example, threshold voltage, may shift so much the design no longer operates. The challenge is that there several mechanisms that can result in device degradation. The result is that different types of overstress can result in different types of aging,” Art Schaldenbrand, senior product marketing manager, Custom IC and PCB Group at Cadence, pointed out.

What is reliability?
Reliability is a relative term. A chip needs to perform as expected throughout its lifetime. The most accurate way to measure that is through in-circuit sensors, but use cases and aging also need to be modeled up front and simulated and tested in the context of a system.

“We can use our mobile phone for 10 years, but we cannot measure it in the laboratory over 10 years,” said Meng Duan, senior TCAD application engineer at Synopsys. However, methods do exist to accelerate the degradation through enhanced burn-in conditions, for example, to measure degradation over a much shorter time span, such as 1,000 seconds (16.7 minutes). “Modeling is the bridge to link the overstress with the real use condition,” he said.

Building an accurate model is essential, and TCAD tools often are used for that. Then, at the circuit level, burn-in stresses are applied to sort out poorly performing chips and reduce the fault rate in the field. In effect, circuits are stressed beyond what they normally would experience in real-world use.


Fig. 1: Modeling behavior to predict reliability. Source: Synopsys

“In reliability measurements, overstress is used to accelerate aging,” said André Lange, group manager for quality and reliability in Fraunhofer IIS’s Engineering of Adaptive Systems Division. “This has to be done to achieve reasonable measurement times. In applications, overstress can lead to sudden failures, but this is beyond aging. Combining overstress and aging, the amount of overstress and its duration need to be carefully examined — preferably based on simulations. A detailed investigation of stress waveforms with respect to safe operating areas (SOA) or aging simulations is a feasible approach to predict the impact of overstress on aging with a high level of confidence.”

Overstress can occur in the field after months or years of use, as well, and lead to a device failure. “Hot carrier injection, predominant for the N-type MOSFET transistors, is caused by electrons near the drain region experiencing impact ionization under high lateral electric field, gaining high kinetic energy that enables them to surmount the Si-SiO2 barrier and get injected into the gate oxide, generating interface or oxide defects,” said Ahmed Ramadan, senior product engineering manager at Mentor, a Siemens Business. “That results in threshold voltage, transconductance, and other characteristics degrading over time. The same will take place for holes in a P-type MOSFET.”

Negative bias temperature instability (NBTI), predominant in P-type MOSFET transistors, is another issue in aging. When a P-type MOSFET transistor gate voltage is negatively biased for long time under high temperature, the Si-H bonds break along the Si-SiO2 interface, causing the generation of interface traps. These interface traps cause threshold voltage to increase and channel carrier mobility to degrade.

Process variation adds yet another cause of failures in chips, creating new and sometimes unexpected causes of stress. Any part of a manufacturing process can be affected by variation, from exposure on a photomask to irregularities in polishing a wafer. As dimensions shrink at each new node, tolerances to variation shrink proportionately. An irregularity or latent defect that might not cause problems at 28nm could push a chip well outside of spec at 3nm. It also can result in stress to other components, accelerating aging in both predictable and unexpected ways.

“In IC design, this can be considered by combining Monte Carlo (MC) and aging simulations to simultaneously investigate the impacts of process variations and stress-specific shifts in device and circuit performance,” Lange said.

Models help
One thing that does help is that models from the foundries/fabs are more readily available than in the past. For each of the anticipated degradation mechanism there are models, but generally each model focuses on only one of these degradation mechanisms. Duan noted, however, that the framework for these models can be the same or similar. So while aging models themselves may only capture voltage, temperature or current, they can be combined under the same framework.

This helps because as tolerances tighten with each new node, one degradation mechanism can affect one or more other mechanisms. But it also can cause problems in other parts of a chip or system that themselves are not the result of aging. “A relationship could be established between the conducted EMI of a circuit based on an aged transistor model,” said Ansys’ Alagappan. “In other words, component aging can degrade the physical parameters, resulting in increased sensitivity to EMI. Technology scaling does result in increased sensitivity to electromagnetic interferences of circuits due to aging effects.”

And this is where modeling becomes complicated, because not everything is covered under one umbrella. “Today, different physical mechanisms are usually modeled separately,” Fraunhofer’s Lange said. “Corner and statistical models capture process variations while degradation models capture the impact degradation mechanisms for transistors, mainly hot carrier injection (HCI) and bias temperature instability (BTI). Especially from a characterization effort perspective, statistical and degradation models will remain separate, and I assume that this will also hold for other factors.”

Mentor’s Ramadan agreed. He said researchers, foundries, IDMs and design houses have been developing aging models to account for the key aging mechanisms impacting MOSFETs. Those include HCI, BTI and time-dependent dielectric breakdown (TDDB), as well as introducing models for interconnect aging mechanisms as EMI.

He noted that the Si2 Compact Modeling Coalition (CMC) introduced a standard aging model in 2018 to address the key MOSFET degradation mechanisms. This Open Model Interface (OMI) serves as a platform enabling model developers/designers to use their aging models across different simulators.

That’s a starting point. But aging also is foundry and process-specific, which makes things far more complicated, said Frank Ferro, senior director of product marketing for IP cores at Rambus. “You definitely have to have the correct models, and each application has different requirements. With many of today’s 2.5D systems going into networking applications, they have long life and high reliability requirements, so we have to have the right models to support those types of applications.”

This is particularly challenging with heterogeneous designs. Saman Sadr, vice president of product marketing for IP cores at Rambus, noted that every foundry has their unique way of stacking up dies, and then going about the 2.5D or 3D technology between the die and the interposer.

“That’s actually where the biggest struggle is, because these cannot be easily mixed and matched,” said Sadr. “What it means from an IP interface provider or a chip provider point of view is that you want to have a robust solution that works with these. Today, one complication is that, yes, you have to rely on the models that you’re getting from one foundry. From the IP design perspective, we want to make sure that we’re covering all applications. We have to put enough configurability and flexibility without compromising the power and area in the IP design, so that if one foundry had a little bit more resistive solution versus capacitive solution, or the microbumps had more capacitance moving from one silicon to interposer layer, it can be managed. That has implications on the IP design, so we have configurability and margins built into the design in a creative way so that it doesn’t burden the design.”

Design techniques to mitigate stress
From the device point of view, many of the reliability issues are linked to various dielectric layers. But as these layers become thinner, a single contaminant in one or more materials can have a much bigger impact.

One way to deal with this is to add monitors into devices. Those can take measurements and compare them with a reference model.

“Adaptive voltage scaling schemes can use this information to adjust the supply to maintain performance at the required level,” said Ramsay Allen, marketing manager at Moortec. “The information can also support predictive maintenance and avoid costly unplanned downtime. Aging is very complex and very dependent on use case and environment. In many applications, neither of those is always well-known, and can vary over time itself. If we take the smartphone as an example, there will be modes where it is doing very little – where the clock frequency is low, the voltage supply is low. At the other extreme it might be playing HD video for extended periods on a hot summer day. The clock will be run at high rates and the supply will be correspondingly high to support the high clock rates. Obviously if you took that device and left it in the low power state, it would age at a significantly lower rate than if you left it in the high power state.”

The problem is that at design time, that ration isn’t always obvious. “This example is actually already a simplified case because more often than not there will be more than two states, so you have to make assumptions about time spent in each state and build margins in to cope with the unknowns,” Allen said. By allowing the system to monitor that aging, then potentially you can optimize DVFS schemes, you can predict lifetime, or perhaps even rein in certain modes to ensure that a particular lifetime is met. Another example is bitcoin mining. This is at the other end of the scale, where devices are manufactured to sit in large arrays. Each chip will vary with process and they will age differently partly as a result of process variation, and partly because their loads won’t always be equal. If you can monitor all those conditions, then you can optimise each of those chips to run at peak performance.”

Others have similar views. “People also try to optimize the device structure to reduce the degradation,” Duan said. “In the past decades, hot carrier degradation was a key reliability issue, and light-doped drain (LDD) process had been adopted to reduce HCD. But nowadays, HCD has reappeared as a dominant reliability issue in sub-28nm. The mechanism is also evolving and becomes more complicated compared with the past. It now contains both components from NBTI and hot carrier, instead of pure hot carrier. These issues bring more challenges to researchers, and so far we have not found an effective way to completely remove these degradation effects. From an application point of view, optimization of ambient temperature and Vdd can be another way to mitigate aging, but it’s at the price of performance.”

Mentor’s Ramadan agreed. “Mitigation by guard-banding, through reducing the operating frequency or increasing the supply voltage, are implemented at the expense of performance and power. Techniques to reduce the NBTI impact were implemented by putting PMOS devices into the recovery mode, others by using dynamic Vdd and threshold voltage tuning. It is crucial to accurately model all aging mechanisms and run aging simulation to study which mitigation techniques are most effective.”

The responsibility for providing coverage for aging failures belongs to IC designers, Alagappan said. It’s up to them to identify stress conditions on high-risk circuits, critical transistors must be identified prior to manufacturing, and functionality must be verified using proper stimulus conditions.

“Specific design techniques depend on the type of circuit, but typically it involves using appropriate aging device models for full-chip simulation, and having a simulation methodology in place for evaluating design margins against worst-case operating specs after aging assessments during all stages of the design validation process,” Alagappan said. “Some design techniques include adjusting the device dimensions so the transistors do not age. At a circuit level, care must be taken to ensure the variations in voltage distribution and the operating temperatures do not result in cumulative degradation effects on the individual transistors.”

The mitigation technique is dependent on the type of stress that is causing degradation, Schaldenbrand noted. “For example, adding a cascode transistor to reduce the drain-source voltage is often used in analog gain stages. However, if the source of degradation is the gate-oxide overstress, then using a cascode won’t help.”

Also, reducing stress strength is a very effective way to reduce reliability aging effect so periodically turning off device operation will help it recover from reliability impacts, explained Jushan Xie, senior software architect at Cadence. “Temperature impacts reliability speed or rate, condition device operating temperature can mitigate reliability effect.”

Still, aging has never been so challenging. In the data center, the lifetime of a circuit is estimated to be three years before being replaced, while in antennas and carrier-grade equipment, it’s expected to last 5 to 10 years because it’s costly to exchange or repair those, Sadr said.

One consideration here is how to stress test them enough to mimic that operation. In the past, this was done by putting chips in ovens or subjecting them to vibration, but those techniques are no longer possible at advanced nodes.

“One of the ways to stress the device to mimic that 10-year effect is while the device power rating or supply rating was the same (design for +/- 10%), they stress it with something at about 50% to 70% higher than what it was designed for, which means that functionally you have to be able to sustain that. They want to stress it in weeks or days rather than years,” Sadr said. “However, a lot of circuits were not even functionally designed to handle a signal that was applied to it at 70%. Another angle that was not well understood years ago is that everyone assumed that those scenarios of aging will happen when you’re actually using a device, and it turns out that a lot of devices that have long life in terms of their application haven’t been forward-looking use cases. Let’s say you had a switch or router that had 10 ports. Maybe today only one port was being deployed, and the other nine ports were switched off. They were expecting that in five years they would deploy the entire 10 ports of that system. It turned out that parking a design some in ‘off,’ some in ‘on,’ those designs that have been turned off or parked, the way they’re powered down, the signals are often differential. The concept of ‘on’ or ‘off’ or a differential signal is that you park one side of the design at ‘wide’ or ‘high,’ and you park the other side ‘low.’ It was not simulated to be for that particular scenario that I’ve designed something, and it was meant to be alternating constantly, but now I’m parking it one side ‘high.’ It’s almost like a mechanism that was intended to be balanced, or to hang it one side ‘high,’ one side ‘low,’ maintain it for 10 years, and then try in year nine to deploy it.”

Avoiding problems with stress and premature aging is very complicated, Dolphin’s Tamjidi said. “It’s almost like tightrope walking. You’re trying to balance. If some voltage is going up too much, you have to find something that can dampen it by coupling or otherwise. It all has to be tuned, almost in tandem, so that the stress gets limited. It is an iterative process because you increase one thing, then something else changes. You change another thing, and something else changes. You have to run hundreds of different layout changes, extract it, make the change, run it again. It didn’t work? Do it again. Thicken this line. Thin that one. Put this one next to this, put that one there. You have to go through hundreds of iterations to come up with one version that works. And now for 5nm and 7nm, for every I/O you have to do two versions because the horizontal and verticals are different. You have to go through the exercise twice to make sure that both meet what you want them to.”

While the models can’t precisely and accurately reflect those results, they can point in the right direction, he said. “What is needed is good judgment, some device physics, and understanding of it. The models try to be very conservative, and even though they are incorrect, they push you in the right direction so if you address it using the model, you still suppress the problem. The models basically say, ‘Don’t push this voltage up because you’re aging.’ So you bring it down, and the models are so conservative that once they say you’ll make 10 years, you’ve probably made 20 or 30. They can’t predict if NMOS will die first or PMOS, but because there is so much safety margin built into the models, just meeting them should resolve the problem.”

Related
Aging Problems At 5nm And Below
Semiconductor aging has moved from being a foundry issue to a user problem. As we get to 5nm and below, vectorless methodologies become too inaccurate.
Circuit Aging Becoming A Critical Consideration
As reliability demands soar in automotive and other safety-related markets, tools vendors are focusing on an area often ignored in the past.
How Chips Age
Are current methodologies sufficient for ensuring that chips will function as expected throughout their expected lifetimes?
Different Ways To Improve Chip Reliability
Push toward zero defects requires more and different kinds of test in new places.
Taming NBTI To Improve Device Reliability
Negative-bias temperature instability can cause an array of problems at advanced nodes and reduced voltages.



Leave a Reply


(Note: This name will be displayed publicly)