Challenges are changing for engineering teams, and they are crossing traditional boundaries.
Chip reliability is coming under much tighter scrutiny as IC-driven systems take on increasingly critical and complex roles. So whether it’s a stray alpha particle that flips a memory bit, or some long-dormant software bugs or latent hardware defects that suddenly cause problems, it’s now up to the chip industry to prevent these problems in the first place, and solve them when they do arise.
By the time these systems reach manufacturing — or worse still, when they malfunction in the field — the ability to fix issues is both limited and costly. So systems vendors and foundries have kicked the problem left in the design-through-manufacturing flow, all the way back to the initial architecture and layout, followed by much more intensive verification and debug.
Reliability depends on fixing issues that may crop up at every step of the flow. The challenge at the chip level is ensuring that increasingly complex chips are also capable of functioning throughout their lifetimes in deeply nuanced applications and use cases.
“We’ve gone from the traditional semiconductor concepts of reliability to engineering teams wanting to analyze more on the system side of things, to interactions with things like soft errors, as well as software,” said Simon Davidmann, CEO of Imperas Software. “For example, in automotive ISO 26262 qualification, one of the things that’s really worrying for developers is that due to the small geometries of the silicon, there is the potential for random bit flips in memory caches from cosmic rays, and they want to know if the software is resilient enough. Will the system survive if certain errors occur? With a certain level of randomness, how does the software survive? Will the car keep steering? Will the brakes keep working if the caches get damaged?”
Traditional metrics like bathtub curves, CMP modeling, and SEM pitches constituted the bulk of reliability benchmarks a decade ago. Since then, more metrics have been added from design through manufacturing, and even into the field where real-time monitors can measure how a device is performing at any given time. And there are many more people who are utilizing those metrics.
“One of these interested parties is the material scientist,” said Matthew Hogan, product management director for reliability applications at Siemens Digital Industries Software. “They are looking at electromigration, for instance. ‘What’s the latest metal alloy that we can use that’s harder, that reduces electromigration, that helps with the design, but is also compatible with the rest of the design ecosystem, and the sleeves and the inserts that we use for vias? We might want to use that on certain specific metal layers.’ A couple of years ago, there was a big front page spread [in an industry journal] on how Intel was using metal alloys, and it was going to be the next best thing. There’s been lots of research and ‘the sky is falling’ proclamations for electromigration because the nodes are getting smaller. FinFETs can push current at significantly higher densities, but the wire thicknesses are getting thinner. And yet, we still seem to be able to make chips generation after generation after generation. What’s happening now is the design margins we used to have are being eroded, so we as an industry are trying to understand with greater clarity the actual design margins that we have to be looking at for the successful use of this design.”
That complicates reliability analysis. While the term still defines a set of measurements and statistical techniques for estimating the likelihood a given product, circuit, or device will fail, achieving confidence that it will work consistently and predictably across a broad set of variables is a huge challenge.
“Since there are several mechanisms by which a piece of hardware can fail, there are many different types of reliability tests engineers perform,” said Matthew Ozalas, application development engineer and scientist at Keysight Technologies. “Many common tests are accelerated, whereby devices are subjected to stress conditions beyond normal operation and monitored to infer failure metrics over a much longer period than the test. Some common accelerated reliability tests are high temperature operating life (HTOL), where a sample set of parts are run at a high temperature under electrical operation; high temperature storage (HTS), where a sample of parts are stored in an ‘off’ state at a high temperature; and highly accelerated temperature and humidity stress test (HAST), where a device is subjected to high humidity and temperature levels, possibly under electronic stimulus.”
Other types of reliability analysis subject the device directly to well-known failure conditions, such as electrostatic discharge. “That involves applying a specific number of high-voltage test signal ‘zaps’ to an externally accessible node under normal or modified electrical operation, and then monitoring failure after the stress signals have been applied,” Ozalas said. “If the device passes, the voltage is increased until it fails. Then it’s given a rating.”
Additionally, some mechanical stress tests might be relevant to electronics, like flex and vibration. These are usually more relevant to package or board designs, as opposed to semiconductors — but not always. These tests do add to the complexity of trying to figure out exactly what can go wrong before a device is shipped, and what did go wrong after it’s in the field.
Much of this falls under the general heading of failure analysis. “This is the concept that everything that comes in gets categorized before they actually know what the real value is,” said Siemens’ Hogan. “There’s a push to call it electrically induced physical damage (EIPD), instead of calling it electrical overstress (EOS) or electrostatic discharge (ESD), or something else. If it is put in this category of EIPD, it means that once you figure out the failure analysis, you have to go back and re-categorize that because if you first call it ESD or EOS, people go running around with their hair on fire saying, ‘We’ve got to talk to this team and that team.’ But the failure analysis person, they’re still figuring out the actual cause. So, with this category EIPD, now you’ve got a category that you can actually research, understand, and find the true fault mechanism.”
Further, Hogan noted there have been lots of graphs on failure returns. “’What does that graph of why are we getting these chips back look like?’ We are really big on this idea of verification before the chip goes out to make sure that we’re avoiding those problematic areas, either by leveraging the foundry rule decks, which are brilliant in many cases, or by adding plus-one checks that you have internally. Those foundry rule decks provide a baseline of reliability for you, and then you complement that baseline with your additional checks.”
From a chip perspective, one of the key measures of reliability is signal integrity. This may sound straightforward enough, but there are a lot of moving pieces in a complex system.
Consider what happens with higher data rates in DDR5, for example. “You have a very wide parallel bus that is pseudo-single-ended, in terms of the signaling,” said Rami Sethi, vice president and general manager at Renesas Electronics. “But as you start trying to run at 4.8 gigabits per second, which is the starting point for DDR5, and combine that with the fact that now we’re designing chips that are going to run at 5.6 and 6.4 giga-transfers per second (GT/s), you start running into a lot of challenges around signal integrity and data timing. As a result, we’re implementing techniques that you would see more in the high-speed serial world. The goal is speed and data integrity. Those go hand in hand. Also, there’s the under-appreciated element of the DIMM server model. It’s a multi-drop bus, so, you’re not just going point-to-point. You’re actually going point-to-multipoint to deal with that all of the classic signal integrity problems, and even power integrity problems.”
This will be especially key as system design becomes less deterministic and more probabilistic. That raises the issue of what level of accuracy is needed for a particular application, and how to measure reliability if that accuracy shifts.
“In the server world, the notion of the classic five nines availability and the RAS requirements drives, especially on the signal integrity side, a pretty high bar,” Sethi said. “As engineering teams try to add additional memory or a larger memory footprint to CPUs, this often is done by adding more memory channels. But it’s very difficult to scale beyond the two DIMM slots per channel that most servers operate with today. So what do you do? You add more channels. But that means the physical area that the DIMM slots occupy is much larger area, and they move further away from the CPU on the board just by virtue of having more of them. The signal integrity problems continue to compound as more memory channels are added.”
Vertical segmentation matters
Different industries have different reliability techniques and requirements. Keysight’s Ozalas said in some cases, the tests are the same, but the specifications are more stringent. “In other cases, the tests are different or unique, too. For example, test and measurement products typically have longer operating lives than cellular user equipment (UE). So the HTOL test setup may be the same for an IC used in both types of products. But if the IC is going into a test and measurement application, it will have more stringent specs for mean time to failure (MTTF), which will require design engineers to adhere to different boundary conditions in their design. For space electronics, these parts need to meet higher MTTF specs, but they also need to meet radiation hardening requirements, and test and measurement or cellular UE products are not subjected this these specifications.”
From a tools standpoint, not much changes from one market segment to the next. What does change is how much time is spent with those tools.
“Utilize your automated tools, be consistent,” said Hogan. “Do the same thing every time. But what you are checking for is very different, depending on industry verticals. If you’re doing electronics for one application, you may have different failure modes, and different design requirements, and different reliability checks that you want to do compared to someone else in an adjacent vertical.”
In automotive, for example, the tool chains used might be exactly the same. “But the rule decks and the checks, and the expectations for longevity and how much you care about these variances, could be vastly different depending on how much time you expect this product to be used in the market,” he said. “What is the cost of recalls? Is it a kid’s toy that’s only going to be used for six months, and you really don’t care because it’s a throwaway item? Is it a car that that needs a recall even 5 or 10 years later? Depending on what industry you’re in, the ICs used in a consumer product would have very different care-abouts than automotive, which might be used in functional safety or an infotainment system. So even within automotive, there are these factions.”
The same holds true for different consumer or industrial components, as well as IoT.
“If you’re looking for certain types of analysis for certain type of reliability, you have to define the buckets that you’re going to put things in — the terminology, etc., along with the thresholds of what you determine as unreliable and reliable,” Davidmann noted. “It’s about how well tested and verified is this piece of technology? Is it a prototype? Is it a research thing? Has it been tested in the real world and somehow that will be related to this?”
Davidmann pointed to NASA’s Technical Readiness Level, which assesses readiness on a scale of one to nine. TRL 9 is limited to technology that has been “flight proven.”
Fig. 1: NASA’s technology readiness levels. Source: NASA
Reliability analysis for analog versus digital
Two of the major causes of reliability failure are physics and circuit design, but those are very broad areas with lots of possible permutations, and engineers working on those designs have very different goals and expectations.
“Analog and digital circuits often use the same devices with the same physics, but the designs are different, so they stimulate different failure mechanisms in the devices,” Keysight’s Ozalas explained. “For example, at a high level, both an analog and digital circuit might undergo HTOL testing, but the failure induced by the test could be due to an entirely different mechanism within the semiconductor (i.e. electromigration versus hot carrier injection), because the type of circuit determines the type of stress applied to the device. This means engineers must consider different types of failure physics when designing analog versus digital circuits.”
Even for the same application, reliability needs can change. “Over the last four or five years, there’s been a greater interest in voltage-aware DRC (design rule checking),” said Hogan. “That makes sure tracing space is good for manufacturing, but underneath each of those wires there’s oxide, and you can have time-dependent dielectric breakdown of that oxide on the signals. If I’ve got a 1.5 volt line next to a 1.8 volt line, what’s the spacing that I need from those versus other 1.8 volt lines or 0.95 volts? And the 0.95 volts might be just the minimum manufacturing rule. That’s great. But now if you’ve got a 1.5 volt or even a 0.5 volt signal that’s floating next to those, what extra spacing do you need to avoid that dielectric breakdown and make sure the design is going to be more reliable? That’s more on the functionality of the chip, and what you care about as a designer to make sure that you’re going to capture that.”
This is also one of the biggest differences between analog and digital designs. “For analog, you’re constantly thinking about the subtle design issues, including symmetrical failures,” Hogan said. “How do I have an array of devices for an airbag, for example, that are going to age consistently? You might have to put some dummy devices at the edge of those, such that when you’re closer to wells or other design structures, it’s the dummy devices that have different aging criteria rather than the active devices in the center of that cluster. In analog constraint checking, analog designs must understand these nuanced ideas of making sure there’s symmetry within the design, that you’re taking care of these dummy devices, that the analog structures you’re creating have the right patterns and structures through their implementation so they behave the way that that you want.”
This is very different from the digital perspective, where a lot of the focus is on timing, leakage, and multiple power domains. “How do I manage the power envelope that I have? With so many applications on batteries now, how do I make the battery last longer? Battery technology has not accelerated as fast as anyone would like so what we’ve had to do on the design side of things is be smarter and more elaborate in the way we manage the different powers, power structures, and power domains in the design, turning parts of the chip off, running them at slower speeds,” Hogan said. “There’s lots of innovative thinking on how to stretch the life and longevity out of the structures we have so they can beat the power requirements. But from a reliability perspective, when we go and switch through these different parts of the design, how do we make sure that we’ve got the right structures in place so we can seamlessly be doing those switches and not tripping into design issues?”
Electromigration is another part of the analog reliability analysis equation. “We’re big into voltage drop in electromigration on both the digital and analog sides,” said Marc Swinnen, director of product marketing for the semiconductor division of Ansys. “For analog, we have a dedicated version of a tool that has the same fundamental algorithms and solvers, but is targeted at the transistor level. It looks at the design transistor level and reports out SPICE reports. The inputs, the outputs, and some of the questions you ask are somewhat different.”
That’s just one piece of puzzle, though. “At the chip level, we also look at electrostatic discharge, which is another reliability problem,” Swinnen said. “There’s specific transistor-level checking that needs to occur, and traditionally that’s been done really late as part of the LVS run. But customers really want to do it during the design cycle, so they use an ESD checker.”
Fig. 2: Output life prediction plot showing semiconductor wearout. Source: Ansys
Conclusion
What is different today is just how much of these increasingly critical systems are now dependent on chips. In cars, the most critical functions used to be entirely mechanical. On top of that, the electronics are now doing more than the mechanical systems used to do, such as preventing accidents involving blind spots or failing to recognize brake lights quickly enough.
“Since nearly every system that we care about starts with an IC, we’ve redefined the ‘reliability analysis’ term to ‘reliability verification,’ said Hogan. “Analysis is a review of the results that have happened. You got a chip back to your FEM lab, they pulled it apart for you, and told you what happened. Or you did some simulations, fancy or not, and you’re using that for directional guidance of what may happen. From a verification perspective, what we’re trying to do is encourage the foundries and the design companies to use those learnings and that experience to create design rules that will avoid these problematic areas of design.”
And whereas traditional checks covered quite a bit under the “reliability analysis” terminology, today’s complex systems require a lot of other analyses to make sure they are reliable, including some that go beyond verification.
“Verification is just analyzing for correctness,” said Imperas’ Davidmann. “Reliability is analyzing for correctness over time. How long is this system going to stay up and running? Also, how do you know things are bug-free? Every now and again my iPhone reboots. Why is that? It’s because it’s detected something is not right. You can’t prove software’s not buggy so you write lots of software around that, and include monitors that say, ‘That’s not right. Let me reboot.’ Or you build monitors to help you have more uptime. If my Linux machine crashes, it’s down. If my phone crashes, it comes back. We as an industry have to worry a lot about when it comes to analyzing reliability of systems.”
Leave a Reply