Understanding aging factors within a design can help reduce the likelihood of product failures.
Aging kills semiconductors, and it is a growing problem for an increasing number of semiconductor applications—especially as they migrate to more advanced nodes. Additional analysis and prevention methods are becoming necessary for safety critical applications.
While some aspects of aging can be mitigated up front, others are tied to the operation of the device. What can an engineering team do to understand where their designs may have aging problems and what can be done to reduce their impact? Some of the answers are counterintuitive.
It helps to put the problem into context. “At 7nm, you are working at 0.5V,” states João Geada, chief technologist for ANSYS. “You don’t think about that as being a large voltage. It feels as if this could not harm a fly. But the dielectric is about 1nm thick, and you are applying 0.5V across 1nm. That is 0.5MV/m, and it is enough to arc on a high-power line, which means that you are applying an enormous amount of stress.”
Risk areas
Understanding aging mechanisms can focus your areas of concern. For example, negative bias temperature instability (NBTI) is a problem when you have a charge that isn’t moving. “Switching is not the problem,” explains Geada. “If you remove the stress, a circuit recovers over time as charges leak back out of the dielectric. However, if you leave applied stress long enough, the damage becomes permanent. In conventional designs, we call that clock gating. Clock gating is a wonderful technique for power saving, but it comes with a reliability side effect.”
When part of the design that is clock gated wakes up after being turned off for a period of time, all of the stresses that were embedded from a long period of non-activity with the applied voltages are released instantaneously. After a few cycles of voltages moving up and down through switching, the stress is relieved a bit.
Designers can focus their efforts around this. “A designer knows, based on the switching characteristics of the device, which gates will be impacted,” says Anand Thiruvengadam, senior staff product marketing manager within Synopsys. “Similarly, with transistor sizes or dimensions, and the current going through the devices, they have a sense for which is more vulnerable from a self-heating point of view.”
While such rules of thumb may have been sufficient in the past, that approach is no longer good enough. “With the advent of safety-critical applications like automotive and space, the reliability requirements are more stringent,” adds Thiruvengadam. “The onus is on the designer to ensure that their designs are reliable across a wide range of operating conditions and for an extended period of time. They do not have the luxury of judiciously looking at certain devices. Their strategy cannot be judicious. They have to have a broader approach, and it needs to be across the entire design.”
Basic analysis
You cannot fix a problem until you understand it. “Front-end-of-line (FEOL) degradation can be analyzed by aging simulations,” says André Lange, group manager for quality and reliability at Fraunhofer IIS/EAS. “Back-end-of-line (BEOL) degradation can be analyzed using EMIR (electromigration and IR drop) tools. However, additional approaches will have to be established.”
The first stage of aging analysis is basically safe operating checks. “The concept goes back to checking that the chip does not blow up,” says Art Schaldenbrand, senior product manager at Cadence. “You consider safe operating area checks to make sure devices do not get into breakdown regions. That technology has been extended, and people use it for reliability analysis. They will check that the devices do not exceed a certain stress level so that it should be good for the operating lifetime.”
Stress is a key part of this equation. “You cannot actually measure the aging of a device over five years by putting it under normal voltage condition and observing what happens,” says Ahmed Ramadan, senior product engineering manager from the AMS group of Mentor, a Siemens Business. “So you put the device under increased stress conditions by applying a higher voltage to the device and see the amount of degradation that will happen and try to extrapolate saying that a higher voltage for a shorter period of time can be extrapolated to nominal voltage applied for a longer period of time.”
This implies that you need stress models for these physical effects. The problem is there are no standard models that can describe any of these mechanisms.
“Fabs calibrate their findings with measured data, then they try and make the models fit the data and provide those models as part of an EDA solution,” Ramadan says. “But not all design houses have the luxury of getting the models from the foundry or developing their own models and doing their own characterization. So some of them will have to overdesign, which means that they will put in guard-bands.”
But at advanced nodes, guard-banding has an adverse impact on power and performance.”When people are really concerned about a block and want to do more detailed analysis, they will do aging analysis of the design where they apply electrical stress and will look at how the devices characteristics change after the stress has been applied,” adds Schaldenbrand. “We are working on making that stress simulation more accurate.”
That is a non-trivial task. “Aging is state-dependent,” says Geada. “The amount of degradation that each transistor sees is dependent on the stress pattern that it has observed over its lifetime. So the real question becomes how do you simulate the worst-case stress pattern?”
Another problem involves when this analysis is performed. “Aging analysis is done at the end of the design, and as with any other tool when you find a problem at the sign-off stage, it means problems have a large impact on the design cycle,” says Schaldenbrand. “People would like tools that enable them to look at these issues earlier in the design. Based on your aging scenarios, you generate worst-case corners for aging and include that in the design cycle. That should correspond to the worst-case electrical stress. From the stress we can calculate the change in the device characteristics at the end of lifetime.”
Advancing analysis
Being able to perform that kind of analysis may take some time. “Static timing and Monte Carlo simulation, alongside electromigration analysis, provide detailed information on the conditions under which a circuit may fail for given supply and temperature conditions,” says Stephen Crosher, CEO for Moortec. “But designing for the worst-case is problematic. In physical circuitry terms, it is statistically rare that a device will leave the fab with at a full worst-case condition.”
Still, there are two bottlenecks that must be overcome. “Basic transistor degradation models or electromigration models are no longer sufficient and need to be enhanced,” says Fraunhofer’s Lange. “Second, inter-dependencies of multiple physical effects, such as combinations of thermal, mechanical and electrical effects, gain importance and will have to be taken into account in the near future in order to meet the quality and reliability targets of IC-based products.”
For example, most of the circuitry on any given chip will be operating at, or even better than, nominal parameters. “However, achievable performance will be limited by circuitry that is closer to worst-case conditions,” adds Crosher. “The overall reliability of the device may be affected by circuitry that operates closer to the best-case corners, leading to issues with setup and hold timing. Designers can account for these problems by designing additional checks into the circuitry, but this creates an overhead, eating into available die area and the energy efficiency of the design.”
It gets complicated very quickly. “For analog circuits, we are adding the ability to do circuit aging,” says Schaldenbrand. “As their characteristics change, it accelerates the aging of the device, so we have added the ability to add incrementally aged devices. In the future we are looking at adding how device temperature interacts with aging and how process variation affects the design.”
The scale of the problem also can grow. “Foundries are currently looking at aging and aging-plus-heating because they tend to interact with each other,” explains Ramadan. “Local heating of the device and aging impact each other. Aging can cause the threshold voltage to go down, which lowers the current. These two phenomena interact with each other, and most models today need to account for both. We are talking about self-heating, but heat is not just a local thing. Nearby devices can cause a temperature increase on the device, so you need a thermal map of the whole design to understand how neighbors can impact local devices.”
That implies a flat analysis, which tends to require very large amounts of compute resources. “There are two ways to do back-end analysis,” says Schaldenbrand. “You can do flat or an abstracted view in the analog world. That works for blocks, but not much more. The next step is to use an iterated method where we solve the non-linear part of the circuit, the transistors, to understand what the currents are going to look like going through the wires, and then we analyze the wires. That provides more capacity because you have separated the hard part, which is the non-linear part, from the high-capacity part of the simulation. That allows you to do full chip simulations. The problem is that once you have separated the two parts, there are things you have to do to improve the accuracy.”
Others are looking at hierarchical methods. “I have seen design teams start by running aging or EM analysis at the leaf-cell level and then build it up,” says Thiruvengadam. “If they have a completely clean leaf cell and have satisfied the aging requirement, then they will take it up for next-level analysis. We can help them to get a look at this earlier in the lifecycle. In-design analysis gives you a sense for the resists of the net, the capacitance as you are drawing the polygons on the canvas. So you get immediate information on things such as EM. Aging tends to be more of a netlist-based approach. Once you have a partial design, you can run aging analysis and get a sense of its impact.”
Aging mitigation
The standard techniques for aging are very similar to those related to reliability. They are divided into three camps: detecting that something failed, or at least has degraded beyond the required operational range; being able to continue operation after something fails; and reducing the rate at which something degrades.
Given that aging processes are complex and often difficult to fully predict, many chip designs today are often over-designed to ensure adequate margin to meet requirements for reliable lifetime operation.
“If aging processes could become more deterministic—or better still, if you can monitor the aging process in real-time—then you can reduce the over-design and potentially develop chips that react and adjust for aging effects, or even predict when chip failure may occur,” says, Moortec’s Crosher. “We expect to see the emergence of aging-specific circuit control and management to optimize device lifetime, which is able to take in to account regional supply and temperature throughout the SoC design.”
One such solution comes from UltraSoC. “A concern in multiprocessor designs is that one core gets used heavily and wears out prematurely,” says Rupert Baines, CEO for UltraSoC. “Our project looked at cumulative/integrated temperature across cores using the Moortec temperature sensors. We added some smarts that monitored usage and influenced decisions of a load balancer. This extended life by avoiding premature wear from electromigration over time, and also enables preventive maintenance by spotting cores that are consistently getting too hot too fast.”
Other devices can be replicated to mitigate the effects of aging. “Instead of putting down one PLL, you can put four down,” says Mo Faisal, CEO for Movellus. “How to leverage those is an architectural and design decision. For example, for the first two years of its life you can turn on PLL 1, then for the next two PLL2, and so on. So you can control the aging. In addition, we have information about device speed, just like a processor that has process monitors, to detect if things are slowing down. We can provide health monitoring and provide a status indication that could prompt switching to another circuit. This is minimal overhead. It may be just a couple of hundred gates to provide health monitoring.”
But unnecessary duplication can have a competitive cost. “When you are competing, and if time-to-market is a key, or if you are going to lose some of your design margins because you are not able to predict how your design will perform in regards to aging or how it will be affected by aging, this can be a competitive challenge,” warns Ramadan.
The more you know, the more options available to you. “One approach is to look at clock gating and to have a periodic toggle of clock-gated areas,” says Geada. “Instead of staying static until the next burst of activity, wake them up every once in a while, flip state, and go back to sleep. This ensures that the stress isn’t accumulating in one direction all of the time. You need to relieve the stress that is causing the problem, and that tends to be straightforward if you know where to do it. You don’t need to do it all over the place because that will cost you power, but if you know where the design is vulnerable you can target those.”
There are also some tradeoffs that can be made. “You can spread the load by making the transistors bigger and more resilient,” adds Geada. “High Vt transistors tend to be a lot more vulnerable than low Vt transistors. They are also lower power, so you have to make tradeoffs. You cannot do it everywhere.”
In designs that use lots of IP or multi-chip designs, system integrators have to keep track of higher level impacts. “It is important to be able to have replacement compatible parts should a part prove defective or to create new updates of the device whenever needed,” says Ranjit Adhikary, vice president of marketing for ClioSoft. “Managing the various IPs, the associated foundries, the process nodes, license agreements, customers, and partners becomes an interesting conundrum. To respond quickly, design engineers need to be able to track the design data associated with the IP along with documentation, verification suite and associated knowledge-base to be able to identify any problems and create solutions or updates to the IP/SoC. Consequently, it becomes important to have IP traceability within the enterprise, as well, and be able to track the different versions of the IP/SoC—no matter where it is stored within the company.”
Reliability is everyone’s business
It used to be that reliability was the domain of a small group experts within a company. That is no longer the case. Safety-critical designs, be it ADAS or IIoT, require reliability. “The responsibility of reliability is moving into the design team, and this is causing new interest in aging,” says Geada. “It is also casting doubts on previous methodologies. Previously the methodology used in SoCs was to add percentage guard-band to every cell to deal with timing. It assumed that every cell was slower. But people have realized that the conventional way of dealing with timing in the presence of aging just doesn’t work that way. New solutions are required.”
The industry is working on solutions, but at the same time it is trying to fully understand what is happening so that better models can be constructed. “There is no standard model and there was no standard interface until recently,” says Ramadan. “The Compact Model Council (CMC) recently produced an Open Model Interface that supports aging and with this interface, a foundry can integrate their own aging equations to model the degradation that can happen specifically for their technology nodes and technology.”
Until complete solutions exist, a mix of old and new methodologies has to be applied.
Related Stories
Chip Aging Becomes Design Problem
Assessing the reliability of a device requires adding more physical factors into the analysis, many of which are interconnected in complex ways.
Chip Aging Accelerates
As advanced-node chips are added into cars, and usage models shift inside of data centers, new questions surface about reliability.
Transistor Aging Intensifies At 10/7nm And Below
Device degradation becomes limiting factor in IC scaling, and a significant challenge in advanced SoCs.
Will Self-Heating Stop FinFETs
Central fins can be up to 50% hotter than other fins, causing inconsistent threshold behavior and reliability problems.
Excellent Article. Aging in RF Devices such a concern now with smaller packages and not just impacted by the bias current heating.
We also need EDA tool to support design for reliability e.g. analyzing stress for periodic clock toggle, or add guard band to timing closure.
In addition, wafer foundry could also work on new materials to make semiconductor more resilient.