Design For Reliability

How long a chip is supposed to function raises questions design teams need to think about, including how much they trust aging models.

popularity

Circuit aging is emerging as a mandatory design concern across a swath of end markets, particularly in markets where advanced-node chips are expected to last for more than a few years. Some chipmakers view this as a competitive opportunity, but others are unsure we fully understand how those devices will age.

Aging is the latest in a long list of issues being pushed further left in the design flow. In the past, the fab hid many of these problems from design teams. But as margin shrinks at each new node, the onus has shifted to the design side to solve the problems earlier in the flow, as evidenced by the widespread adoption of back-end implementation software and sign-off tools. And power and thermal issues have become so constraining that they have been pushed even further forward in the development flow, starting at the architecture level.

Reliability is the latest concern to surface, and while it may not be as prominent yet, it is no less important. Left unchecked, devices may not survive their intended lifetime, a problem made worse by the fact that many of these devices are expected to last longer today than in the past. Reputations are at stake across the supply chain. On top of that, in-field replacements are expensive. And margining, the traditional approach to improving reliability, makes those products uncompetitive at advanced nodes.

“Aging has long been considered by designers in high-reliability design areas,” says Stephen Crosher, director for SLM strategic programs at Synopsys. “Maybe they were designing for particular applications, such as automotive, or for extreme stress environments. But it wasn’t necessarily a mainstream consideration for designers. Now it is transitioning into normal practice, and your standard designer will need to become aware of it.”

Several factors are changing. “In more advanced nodes and with increasing speed requirements, it has become much more of a critical design consideration,” says Ashraf Takla, president and CEO of Mixel. “The aging impact needs to be evaluated and accounted for in timing budget early in the design stage, and verification needs to be done to ensure the final budget meets aging budgets.”

The importance of addressing aging has spread beyond just safety-critical and mission-critical applications. “Several of our IP providers, especially on the lower process technology nodes, are asking about aging models and the aging capability of EDA simulation,” says Greg Curtis, product manager for Analog FastSPICE at Siemens EDA. “It is not just automotive anymore. We’re seeing it in mobile communications, we’re seeing it in Internet of Things. It is becoming good practice that companies start looking at the aging of their IP.”

For many companies, this is unavoidable. “The largest variability factor that impacts lifetime is temperature,” says Brian Philofsky, principal technical marketing engineer at Xilinx. “Operating electronics at reduced temperatures often has a measurable effect on the lifetime and aging of the circuits. Another factor an engineer has control over within the device is the amount of current consumed. Higher current draw can significantly reduce lifetime due to electromigration and other undesirable effects. Unfortunately, modern circuit design is put at odds as compute density increases with every node shrink. Simultaneously, the reduced voltage has the impact of increasing current draw within the same power envelope. Over the last few years, the trend is higher operating currents at higher operating temperatures, which makes reliability more challenging.”

Geometry impact
While the underlying mechanisms that cause aging have not changed, their importance has become more significant at each new process node. “Reliability is related to device size, and in particular the channel length,” says Jushan Xie, senior software architect at Cadence. “As the channel length gets shorter, the effects become more pronounced. The electric field inside the channel can become strong. Devices at 45nm and below have to consider reliability.”

That does not mean that designs at older nodes can safely ignore the impacts. “While it is more predominant in advanced technology nodes, such as 28nm and below, we have seen it on 40nm devices as well,” says Ahmed Ramadan, AMS foundry relations manager at Siemens EDA. “Recently, specialty foundries that are providing technologies on the 130nm and 180nm nodes are starting to consider providing aging models for their customers. This is coming because of pressure from the customers. It is a need they are seeing in the type of designs and applications that are working on.”

New device technologies are making it a larger issue. “At 28nm, people already were aware of some of the mechanisms of over-stressing devices,” says Oliver King, director of engineering at Synopsys. “Gates were very thin. They were susceptible to being overstressed. Along with continued shrinking of dimensions, designs switched to finFETs, which bring in new mechanisms such as the fin structure, and this just made it much more prominent.”

One of the big problems is that not everything scales equally in new geometries. “You are scaling the dimensions of the length and the width of the transistors,” says Siemens’s Ramadan. “But you’re not able to scale the gate oxide at the same pace. This is going to add additional stress on the device. You’re not able to scale the voltage at the same pace because scaling the voltage will not leave enough room above the threshold voltage of the device. This increase the amount of stress that the device will be facing.”

Gate scaling without voltage scaling is a huge issue. “If the average transistor is consuming the same amount of current as older transistors on larger nodes, then by scaling device density you’ve increased the power density,” says Synopsys’ Crosher. “That relates to heat, and heat can be the big culprit in this equation. Self-heating in finFETs is also contributing to this. The transition from planar into finFET is where we really starting to see those sort of stress issues applying to consumer products and broadening the concerns for reliability. They are exposing themselves to those kind of stress conditions, which need to be mitigated to try and get any reasonable lifetime from those devices.”

This gets even more complicated as more die are included in the same package. In the past, process-related issues could be solved with enough volume and time. But many of the advanced package implementations are unique, and the chips within them may age at different rates.

“The biggest problem we see is the huge number of implementation options,” said Andy Heinig, group leader for advanced system integration and department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It’s not clear how you compare these different options. Chip design is easier if you have previous chip generations that are similar. But now we have a huge range of options, both for packages and the software within them.”

The number of things that can go wrong increases inside a package, as well. “There is mechanical stress, in addition to warpage, and the potential for thermal mismatch,” said Marc Swinnen, director of product marketing at Ansys. “You also still have to get the power from Chip A to Chip B. Even if a bump fails, you still have to get the power through. But if you have a current spike as a result of this, other bumps can fail, too.”

Understanding how the various pieces goes together requires much more in-depth analysis throughout the design process.

Aging models
Analysis starts with suitable models and it can be tricky to model phenomena that may not be fully understood. “There’s an element of we don’t actually know whether the models are accurate enough,” says Synopsys’ King. “Ultimately the models predict a certain aging for a given circuit. And only really time will tell whether they were correct in that prediction. It is a complicated issue. It’s not just the aging mechanisms that we already know about. It’s also self-heating effects, process variations, Monte Carlo, and other effects that you need to take into account as part of analyzing any given circuit. Maybe the models are right, maybe they’re not.”

It can be easy to dismiss the state of the industry, but it has to work with something. “We have not seen any indication that the models are inadequate,” says Mixel’s Takla. “That said, foundries, in cooperation with tool providers, are continuously tweaking their aging models to improve accuracy based on silicon measurements.”

While techniques like burn-in have been successful for traditional devices, it is not known how exactly they apply to these new effects. “You cannot wait for 10 years. You have to find ways to get the results you need quicker,” says Cadence’s Xie. “You will be using some theory or equations for acceleration and you want to get some equivalent to 10 years of aging in a short period of time. Calibration is important, and there are theories about how to accelerate aging.”

The industry is trying to reach consensus. “I’ve been working with Compact Model Coalition (CMC) for more than 20 years,” says Ramadan. “It was probably 7 years ago when we first started having discussions about a standard aging model. At that time, we were not able to converge on a single standard for hot-carrier injection (HCI) or negative-bias thermal instability (NBTI) that would satisfy all foundries and designs communities. They felt they had to make customization and modification to fit their processes.”

But that can leave device companies holding the ball. “We guarantee our commercial devices for 10 years of operation when maintained within its operating specifications,” says Xilinx’s Philofsky. “There are two situations that may require further consideration or analysis — a design requiring an operating lifetime greater than 10 years, or a design that may exceed the operating conditions and wishes to understand impacts to the lifetime. In these situations, we have simulation models, analysis tools, and reliability data that can be used and applied to specific operating conditions for particular devices. This can fine-tune the lifetime specifications, sometimes allowing for a more efficient operating range. We’ve done this for decades and evolved our models to the point of having a high degree of confidence in them. Yet we are continuously improving them based on the latest theoretical and empirical data collected.”

Work continues within the CMC. “Still not yet there,” says Ramadan. “Each and every foundry and design house is creating its own models. Some of them are initially physics-based models. But there are a lot of empirical formulations that also are taking place in order to be able to fit their current process and the target applications. How confident are we in these models? We should be confident enough for them to give a good estimate for the amount of degradation that’s going to happen on the device.”

Even with accurate models, there are other sources of inaccuracy. “The nature of aging simulation itself utilizes a lot of approximations,” Ramadan notes. “Consider that you run the simulation for aging over a short period of time, and then you do an extrapolation for the intended period. With this extrapolation there’s a lot of approximations. But so far, we didn’t hear any complaints from customers that the models provided by the customers in terms of aging are far off. These things will need some years to validate. If you are actually running aging analysis today, you need maybe five years to make sure that what happened in real life is actually correct.”

Aging cannot be considered on its own. “Variability also plays a part,” says Crosher. “It goes hand in hand with increased gate density, the manufacturing process and greater variability. We haven’t seen them mature in the field for 15 years to actually know what the aging impacts and effects are. So that’s why there is a reliance, and a critical need, that within advanced devices you need some form of embedded sensing to try and manage those issues. If you can measure the conditions of the chip in real time and see how devices are degrading and how they’re aging, then they are able to take some mitigating steps to try and manage that.”

He’s not alone in identifying this as an issue. Fraunhofer’s Heinig pointed to system variation as one of the big challenges as more devices are integrated into systems and into packages, and as those devices are expected to last longer in the field. “There are no tools today to solve this problem,” he said. “It’s difficult to verify, because with software updates the product also changes over time.”


Fig. 1: Adding aging sensors into the design, tied to lifecycle analysis. Source: Synopsys

Where to focus
Digital and analog will be affected differently, as will devices subject to frequent change — and in some cases, infrequent change. “Any place where there’s a lot of activity will be more sensitive to device aging,” says Art Schaldenbrand, senior product manager at Cadence. “For devices, you can look at the clock tree and look at what is happening. Digital designs are sensitive to delay changes. The other place where this becomes a challenge is within analog designs. An example would be in a bias tree. With the bias transistors moving and aging, it can potentially accelerate the aging of other devices in the bias network. There’s always going to be some different elements in the design, and you have to look at them a little bit differently to be able to analyze the reliability.”

Designs that employ dynamic voltage and frequency scaling may have to be very careful. “Problems often arise when you are trying to optimize devices, maybe reducing supplies,” says King. “It may be tied to adaptive voltage schemes, and it is a question of how low you can go on the supply, with your logic still meeting timing. There could be designs that push the supply up when they detect that they need to. If performance drop off can’t be corrected, then at least a graceful bow-out may be an important design consideration.”

Sensitivity analysis is one way to approach the problem. “Let’s say that there is a certain design parameter they are concerned about, such as the gain for an amplifier,” says Ramadan. “They would want to see how sensitive each transistor is, contributing to change on that gain. Then they can consider the change in the threshold voltage or Ids due to aging. With sensitivity analysis, they can understand how big the impact of aging will be on specific devices in the design compared to the rest of the devices, and then start doing some guarding for those.”

But you have to be careful to consider all of the important areas. “There is a phenomenon called non-conductive stress,” says Cadence’s Schaldenbrand. “Consider a device such as a watch dog or monitor. It will be sitting idle, potentially for years, and you want it to spring into action if there’s some sort of condition that occurs. Even those circuits, that you think are you’re just sitting there doing nothing, are being stressed. They can age and potentially fail due to the aging that occurs while they’re sitting idle.”

How to tackle the issue
There are several ways to take the issues into account during the design, implementation and sign-off stages of development. Schaldenbrand lists three levels of analysis that can be performed:

  1. Monitoring conditions a device operates under. This is effectively monitoring things like the electric field by looking at device size and other factors. These checks are called device asserts. It may show that a device sees a lot of voltage, so it’s a place that is sensitive, and a potential problem.
  2. Run analysis. You can conduct aging analysis and say a device will operate under certain conditions for a given time period, and at the end of life it will have certain characteristics. If you do corner analysis, or Monte Carlo analysis, you also can do aging analysis at the same time.
  3. Gradual aging. This makes piece-wise approximations for an operating lifetime. Usually, designers are relatively experienced and know which blocks are more sensitive to those kinds of phenomena. You do not have to run those tests everywhere because they tend to be relatively expensive.

Process migration is becoming expensive. “For every process migration, say from 16nm to 10, to 7, to 5nm, all the way down to 3nm, every process node according to our customers requires three times more simulations because of the additional PVT corners they need to run,” says Siemens’ Curtis. “It puts a tremendous burden on their simulation needs to ensure first time silicon success.”

But even this level of analysis does not provide certainty. “Reliability is statistical,” says Xie. “You need to look at it as a Monte Carlo problem. You have 100 devices, and they are identical when first fabricated. Even if you apply the same stress to those devices over 10 years, and measure the device degradation, it will have a distribution. This distribution for relative aging is not being considered by most companies.”

Nobody wants to design for worst case. “When you embed sensors, you don’t have to predict aging,” says King. “You can measure it. You can see what is aging and make adjustments to that circuit, or highlight that the chip is close to failure and make the decision to go into a safe state. That may enable you to pull a failing computer from a data center, or ensure safe operation of your self-driving car.”

Built-in analysis can be changed over time. “Xilinx provides a system monitor circuit to allow users to monitor temperature and voltage to ensure safe operation,” says Philofsky. “Having programmability for the device will enable us to further extend this measurement and allow a more comprehensive view of reliability over many fixed-function devices.”

At the least, it means that margins can be squeezed. “The trend the industry was taking, before actually focusing on having good aging models and implementing an aging simulation flow, was to insert a lot of margin,” says Ramadan. “They were leaving a lot on the table, which they cannot afford any more. By doing some aging simulation, they are able to tighten the margins to compete in the market without taking on too much risk. They will leave some, but not as much as they did before.”

There remains hope within the CMC. “Back in 2018, the CMC released a standard that supports an aging simulation flow through the Open Modeling Interface (OMI),” he says. “There is more development to include additional models in that flow. It has gained a lot of adoption from different design houses, and most importantly from different foundries. This interface is simulator agnostic, meaning that foundries do not need to create a different interface for different simulators. We have seen a lot of pressure from design houses and foundries to provide an aging interface. And more and more foundries are currently starting to adopt the standard and OMI interface.”

Conclusion
While the mechanisms that contribute to aging are understood, the industry continues to struggle with the creation of models that provide sufficient accuracy. Part of the problem is there has not been enough time to collect data that can be used to assess those models and to fine tune them. That process is ongoing. Until the accuracy of those models is fully understood, design teams either have to leave some margin on the table, or they have to incorporate adaptive schemes into their devices to be able to mitigate any aging problems when they arise.

Related
Design Issues For Chips Over Longer Lifetimes
Keeping systems running for decades can cause issues ranging from compatibility and completeness of updates to unexpected security holes.
Performance And Power Tradeoffs At 7/5nm
Experts at the Table: Security, reliability, and margin are all in play at leading-edge nodes and in advanced packages.
Making Chips To Last Their Expected Lifetimes
Lifecycles can vary greatly for different markets, and by application within those markets.



1 comments

Charlie Slayman says:

Hi Brian – Great article! I’d like to offer you a complementary invitation to IRPS 2021 (irps.org) where these topics are “front and center”.

Regards,
Charlie Slayman
IEEE IRPS 2021 Vice-General Chair

Leave a Reply


(Note: This name will be displayed publicly)