As advanced-node chips are added into cars, and usage models shift inside of data centers, new questions surface about reliability.
Reliability is becoming an increasingly important proof point for new chips as they are rolled out in new markets such as automotive, cloud computing and industrial IoT, but actually proving that a chip will function as expected over time is becoming much more difficult.
In the past, reliability generally was considered a foundry issue. Chips developed for computers and phones were designed to operate at peak performance for an average of two to four years of normal use. After that, functionality began to degrade and users upgraded to the next rev of a product, which boasted more features, better performance and, longer periods between battery charges. But as chips are developed for new markets, or markets where there were less-sophisticated electronics in the past—automotive, machine learning, IoT and IIoT, virtual and augmented reality, home automation, cloud, cryptocurrency mining—this is no longer a simple checklist item.
Each of those end markets has unique needs and characteristics, which affects how chips are used and under what conditions. That, in turn, has a big impact on aging, safety, and other factors. Consider the following:
Across the spectrum of electronics, use cases are shifting. This is happening even inside of data centers, which historically have been extremely conservative when it comes to adopting new technologies and methodologies.
“Aging is a function of clock speed and power, but in the past the servers would come on occasionally when there was a job to do, and they sat idle most of the time,” said Simon Segars, CEO of Arm. “When you move to the cloud, the design criteria needs to be different because it’s based on how long it’s used. That raises a lot of questions about how you design for longevity.”
At the start of the millennium, average utilization of servers was about 5% to 15%, a trend that had persisted throughout the 1990s because IT managers were reluctant to run more than one or two applications on a single commodity server in case of equipment failure. Two things happened to change that. First, energy costs began rising. Second, and probably more important, companies reorganized so that IT departments were responsible for their own energy costs rather than the corporate facilities departments. Both factors caused a spike in virtualization software sales to increase server utilization, which meant fewer server racks to power and cool.
The cloud takes that kind of operational efficiency to an even higher level. The goal for cloud operations is to maximize utilization by balancing compute jobs across an entire data center. That can push utilization rates significantly higher for all servers in a data center, not just those in a single rack, or power them down quickly when they are not needed. The approach is energy-efficient, but it has a big impact on degradation and aging of electronic circuits.
“We are seeing an acceleration of aging where the chip breaks down,” said Magdy Abadir, vice president of marketing at Helic. “They may be missing clocks or there is extra jitter. Or there is dielectric breakdown. And anytime something breaks down, there is an avalanche of new things you have to worry about. A lot of aging models advanced in an era where electronics were used sporadically. Now chips are running all the time. Inside of a chip, blocks are heating up, so aging is accelerated. From that you get all types of weird phenomena. A lot of companies have not revised their aging models, either. They assumed these devices would last three to four years, but they may fail sooner. And given that design margins from the beginning can be flimsy, aging can throw them off.”
Utilization trends are shifting inside of automobiles, as well, and that will continue until fully autonomous vehicles replace human drivers. The vehicles are processing an increasing amount of data, some of which is being streamed from sensors such as radar, LiDAR and cameras. And all of that data needs to be processes in a shorter time than in the past with a high degree of accuracy, which puts enormous stress on electronics.
“The reliability for ADAS is a minimum of 15 years, which is a lot different than the 2 to 5 years for modules in the past,” said Norman Chang, chief technologist at ANSYS. “Aging isn’t just about time. It’s also NBTI (negative bias temperature instability), electromigration, which can be thermal-related, ESD (electrostatic discharge), and thermal coupling.”
Fig. 1: Thermal modeling for chip and package. Source: ANSYS
While many of the automotive Tier 1 suppliers are used to building chips to withstand extreme temperatures, as well as mechanical vibration and various types of noise, these kinds of stresses have never been applied to advanced-node CMOS for extended periods of time. Numerous industry sources confirm that carmakers are developing chips at 10/7nm to manage all of this data, working at the leading edge nodes to avoid obsolescence of their designs, which often are built to last several generations of vehicles. The problem is there is very little real-world data to show how reliable these devices will behave over time under any environmental conditions.
“You have to do designs differently,” said Segars. “There is one school of thought that you will need fewer cars because they won’t be idle all the time. But there’s another school of thought that says self-driving cars will be run more and wear out sooner. Everything will wear out eventually. The challenge is to make sure the electronics don’t wear out faster than the mechanical parts, and that requires you do design differently. That includes everything from taking noise more seriously to minimizing current spikes.”
Thinner insulation, thinner substrates
One of the ironies of increasing reliability is that it contradicts five decades of semiconductor progress, where the goal has been to shrink features every couple of years in an effort to reduce the cost. That generally means thinner dielectrics, thinner wires, an increase in dynamic power. Increasingly, it also involves thinner substrates. And at the most advanced nodes, this has resulted in higher leakage current, more noise, more electromigration, and other electrical effects.
“From a circuit perspective, you know you have to deal with process variation,” said André Lange, group manager for quality and reliability at Fraunhofer EAS. “But from a design features perspective, it’s about what might happen to cope with a known defect in a system. If you look at autonomous cars, there is a central processing unit that has to decide which information it will use from which sensor. One of them may be dirty or not working.”
That makes degradation modeling much more complex because it needs to be done in the context of the system. “Most things contribute to degradation of circuits, whether it’s NBTI or more defects per given area or more process variation,” said Lange. He noted that a big challenge is identifying what causes a defect, not sifting through all of the data available, which can be enormous.
Fig. 2: What can go wrong. Source: Fraunhofer
Different approaches
Process variation increases at each new node. Over the past decade, it was smart phones that drove the scaling roadmap (the iPhone was introduced in 2007). Today the biggest users of advanced node technologies are servers for data mining, machine learning, AI and cloud.
The connection between process variation and reliability is well documented, but variation makes it harder to accurately model aging effects. That has produced a number of different approaches to this problem, ranging from sophisticated statistical modeling and simulation to placing sensors on chips or inside of packages.
“With a heat source, you have to keep track of temperature using a ‘random walk’ approach that is both local and global,” said Ralph Iverson, principal R&D engineer for 5nm extraction at Synopsys. “With random walk, the voltage is the average of the voltage around it, so the delta is zero.”
That helps with modeling, but resistivity isn’t always clean at 5nm and beyond, said Iverson. There are surface effects, and data doesn’t necessarily represent the connectivity of copper, which requires more localized data. And this is where a hybrid type of approach is beginning to show up, because this level of uncertainty is difficult to abstract.
“The automotive world has this pretty well dialed up for BCD (bipolar CMOS DMOS), but we’re now seeing requirements and requests for advanced CMOS,” said Mick Tegethoff, director of AMS product marketing at Mentor, a Siemens Business. “We’re seeing more interest from foundries, and EDA companies are simulating aging due to stress. Is it enough? Any type of modeling is an approximation of the real world. So you do circuit simulation and do what you can to build a chip that will last, but then you need to go back to physical testing and do things like put it in an oven to create physical stress. We’re seeing a lot more electronics going through that kind of testing now.”
Analog vs. digital
Most of the aging/degradation modeling so far has focused on digital circuitry. Analog adds a whole different perspective to aging.
“With a leading-edge chip around the engine compartment, aging and process variability are well understood by companies, so they’re not going forward blindly,” said Oliver King, CTO at Moortec. “With analog, though, the effects are more variable. A digital chip will just stop working. But with analog, it may be slightly less good or the circuit is not as accurate, so you have to adjust for that. Analog developers traditionally have not pushed on geometries as hard as the digital side. Electromigration is still an issue, and so is current density. But the aging effects don’t show up quite as much. Still, chips need to be more proactive about the state of repair and whether action needs to be taken.”
Frank Ferro, senior director of product management at Rambus, has a similar view. “With a PHY, the biggest challenge is ambient temperature,” said “As temperature drives, performance drifts, so you need to do recalibration. On the consumer side, there’s something called the ‘Christmas Day test.’ In cold weather, you store a Playstation or other electronic device in the garage, and then you turn it on Christmas morning and the circuits need to be able to go from cold to operational instantly. It’s the same for memory systems in a car or a base station. Aging has an effect on these systems, and you have to recalibrate the system to negate those effects.”
PHYs undergo the same kinds of qualifications as digital components, including burn-in and testing for voltage and temperature variation, Ferro said. But the PHYs are designed to change with those variations, which is difficult to design into digital circuitry-particularly at advanced nodes, where margining has an impact on power and performance.
Analog circuits often are designed based on what are known as “mission profiles.” So a specific function in an autonomous vehicle would represent a mission profile for IP designed for self-driving cars.
“One of the big problems we’re seeing is that depending on how they are being operated, there is not just one use case,” said Art Schaldenbrand, senior marketing manager in Cadence‘s IC and PCB Group. “There are multiple ways a device can fail. So we look at different stresses for how something might fail. BTI (bias temperature instability) might only fail on 10% of devices, but that’s the worst case stress. So we need to have better ways to express degradation. A finFET is going to have different stress than a planar device, so you’re modeling different phenomena.”
Packaging and other unknowns
As slows down, more companies are turning to advanced packaging to improve performance and provide more flexibility in designs. So far, it’s not entirely clear how to model advanced packaging to determine stresses and aging. This is partly due to the fact that there are so many packaging options available that no one is sure which one will win. It’s also partly due to the fact that many of these packages are relatively new, and what goes on inside needs to be characterized over time.
“Package layers may be too close to other components or stresses from the other side,” said Helic’s Abadir. “That needs to be modeled. Even before it ages, it has to be modeled with aging built in becuase the number of effects is increasing. So placement becomes important. If you move things around, you move the frequency of resonance. There are no simple rules to go by. You have to analyze the design, and if you see a problem, you may need to move things around.”
There are other anomalies in complex designs that can affect reliability over time. Some use models, for example, may turn circuits on and off more frequently than others, which puts stress on circuits.
“If something is idle for too long, it will experience different aging than other circuits,” said Jushan Xie, senior software architect at Cadence. “And the smaller a device, the stronger the aging effects. The stress will be higher and the aging will be faster.”
How all of this will be addressed isn’t entirely clear yet. At least some of this will involve new materials and new technologies.
“For the power electronics, this is driving the move from silicon-based devices to SiC and GaN, which can operate at much higher switching frequencies, with higher efficiency, and withstand higher temperatures,” said John Parry, industry marketing manager for electronics at Mentor. “In some applications, this is allowing the power electronics to be brought closer to the motor drive, and so into a higher temperature environment. In other cases, the semiconductor being able to withstand higher temperatures means less cooling is required. However, the semiconductor has to be packaged, and the package then also has to withstand these higher temperatures. There is huge investment in new technologies, such as sintered silver to use as a die attach materials, and clips instead of conventional wire bonds, so the packaging of power devices such as IGBTs is experiencing massive changes in materials, processing technologies, and design.”
Conclusion
What is changing is the awareness that aging, stress and other effects are becoming more problematic as designs move to advanced nodes or are used for extended periods of time in new markets where safety is a factor.
“The baseline is that customers are asking questions today,” said Fraunhofer’s Lange. “Their starting points are different, depending on who you talk to, but there is a higher frequency of questions. Most people are just getting started. They’re seeing higher voltages or higher temperatures and there are some experiments underway to extrapolate for overstress. But understanding how degradation can affect a complete circuit is harder. There is still work to be done for complex chips.”
But as awareness grows, so does investment in solving these problems. Degradation modeling and aging are just beginning to show up on chip designers’ radar. As with power a decade ago, that’s all about to change.
Related Stories
Transistor Aging Intensifies At 10/7nm And Below
Device degradation becomes limiting factor in IC scaling, and a significant challenge in advanced SoCs.
Will Self-Heating Stop FinFETs
Central fins can be up to 50% hotter than other fins, causing inconsistent threshold behavior and reliability problems.
How Reliable Are FinFETs?
Chipmakers wrestle with EOS, ESD and other power-related issues as leading-edge chips are incorporated into industrial and automotive applications.
Quality Issues Widen
Rising complexity, diverging market needs and time-to-market pressures are forcing companies to rethink how they deal with defects.
Ed, thank you for a well written and thought-provoking article. The thermal map jumped out at me… temperature is such a significant driver of circuit performance and reliability mechanisms. I’m guessing the on-chip temperature sensor for the die shown is in a cool spot. Such thermal gradients must be accounted for and addressed by the thermal management solution. If not, such designs will be at great risk for early wear-out I would think.