Failure Analysis Becoming Critical To Reliability

Once confined to analyzing returns, it now is shifting left and right as more data analytics are applied to both digital and analog.


Failure analysis is rapidly becoming a complex, costly and increasingly time-consuming must-do task as demand rises for reliability across a growing range of devices used in markets ranging from automotive to 5G.

From the beginning, failure analysis has been about finding out what went wrong in semiconductor design and manufacturing. Different approaches, tools and equipment have improved over the decades, most recently with in-circuit monitoring, new equipment to test and inspect more circuits in parallel, and more data analysis to find patterns and improve coverage. But chipmakers and systems companies now are more dependent on failure analysis for everything from digital to analog ICs and IP, and they are applying it to multiple devices in a package, as well.

“We’ve seen customers looking at the tradeoffs between test cost versus quality, and figuring out how many chips they need to burn in,” said Matthew Knowles, silicon learning product marketing manager at Mentor, a Siemens Business. “We see a lot more dependence on test of operational software and adaptive test, where it can automatically decide what test patterns to use based on Tcl (tool command language) quality data. That’s opening up a lot of opportunity to do things better, because the way people have tested devices in the past is astonishingly unsophisticated. It’s an alternative of extending burn-in, which is capital- and time-prohibitive.”

This is far different from failure analysis in the past, which was largely done with returned items rather than proactively analyzing devices on a continuous basis throughout their lifetime.

“A lot of this stuff couldn’t be done before,” said Keith Arnold, who focuses on early life failures at PDF Solutions. “Now it’s much more data-focused, so we have physical failure analysis, electrical failure analysis and data failure analysis. With all of this, you can start to predict when a failure will occur rather than just figuring out why a device failed. This is just the tip of the iceberg for where this can go, too. If you look across multiple process steps, from ingots to finished products, there are very few people who have systems in place to do this kind of stuff at this point. But if you can truly do early life failure predictions, which is the likelihood of each die failing the field, then you can steer the manufacturing process. You also can develop a more dynamic process because of that.”

Others are taking a similar route.

“Until recently, many of the classical solution providers have relied on simulation for reliability,” said Kiki Ohayon, vice president of business development at OptimalPlus. “They would base their models on design data assumptions. That is no longer sufficient. You also need production data. We are now collaborating with simulation companies to help them understand the blind spot in their models.”

It’s also more difficult to detect failures in multi-chip modules and packages, and in complex systems.

“You need to understand the overall integrated impact on a package and on a chip,” said Ohayon. “If you’re using a device in a different environment, that can have an effect. But you also need to understand that in the context of a system. You cannot analyze this in an isolated manner, which is what happened in the past. We’re hearing from more and more customers that they want to integrate data from different stages of a device’s lifecycle.”

Parametric failure analysis
Alongside of that data, failure analysis techniques also is being applied in more places in the supply chain than in the past. It is shifting far left into the lab and the materials used in manufacturing, and right all the way into post-manufacturing monitoring. This is easier said than done, however. Challenges for analyzing data and preventing failures, particularly in the field, grow significantly at each new process node.

“More people are looking at parametric deltas as opposed to what’s good and what’s bad,” said Carl Moore, yield management specialist at yieldHUB. “These are subtle changes rather than just pass/fail. It could be the delta of shifts in output for two or more tests, or you may be looking at the delta on one pin versus eight other pins. And if you have multiple repeated blocks on a chip, you can compare the results on those blocks. This is a whole new area of analyzing data to define deltas.”

That also opens the door to analyzing analog and mixed signal devices in the context of other chips or systems. “Digital is almost like analog, anyway, at the smallest scale,” said Moore. “But you also can look at parasitic frequencies from switching and RF combined with digital.”

Analyzing analog failures
That doesn’t mean traditional test approaches are less important, but they do need to be put into the context of how a device will operate in the real world. Some of that can be done ahead of time to ensure reliability.

Scan test is still valid for a number of digital parts of the devices that we’re testing,” said Alejandro Buritica, senior solutions marketing manager at National Instruments. “But as devices become integrated with a highly digital controller with a number of analog components, and it’s becoming more of a mixed signal device, then scan will not be able to provide the coverage, or exercise those analog pieces of mixed-signal devices. So you can write to the registers and try to configure the device in specific ways, but you still need to be able to force DC and RF signals through the device to get the measurements that you’re that you’re looking for to verify the performance of the device.”

Some of it also can be done to predict failures in analog, which is a new development in the failure analysis field.

“What we’re looking for are more parasitic frequencies from switching, in addition to RF combined with digital,” said yieldHUB’s Moore. “Analog and MEMS are harder to test, because you’re getting tiny signals from sensors, and those are getting more dense. So for designers who develop a chip, now what matters is that’s around that chip and how that chip will be used. The difference is that with digital you can do this with a highly parallelized test. With analog, it’s not as easy. For power and MEMS chips, you may get 8 or 16 pieces on a tester at once, not thousands, so it’s slower. The reliability of power chips is more challenging than logic, too. Different temperatures on a die cause more stress, so you see more and more analysis of tiny subtleties in the data. There’s also a problem with different power modules integrating with other things. So you may have 10-micron metal lines integrating with 2-micron lines because analog doesn’t need state-of-the-art line widths. Still, they have shrunk over the years. Layouts are tighter and they are now switching at different frequencies. On top of that there are different materials, which can be a factor.”

Beneath all of this are the beginnings of a move toward predictive analytics rather than failure analysis, and this is especially challenging in the analog world.

“One of our customers is developing hydraulic pumps for industrial uses,” said Olaf Enge-Rosenblatt, group manager for computational analytics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “They start with a system that is okay, because there is no failure, and they make an initial measurement. Over time, they learn to detect trends and changes in patterns and signals, and those provide a trend analysis that you can use to identify anomalies. But the first step is to describe a good system, and that means you not only have to look at classical measurements like vibration. Along with that, you have to include all of the conditions and parameters in operation. So you have a shaft rotating at a certain speed. If you change that, you can interpret a change in the vibration and take that into account. But if all you’re doing is looking at a single vibration number, that won’t tell you much.”

Enge-Rosenblatt said the measurements and math are the same as in the past, but the interpretation of that data in context can provide much more insight into what is happening in a device or system. That is especially valuable on the analog side, where data tends to be much less structured than on the digital side.

Equipment changes
The equipment for developing that kind of diagnostic data has evolved, as well. So while microscopy has played a leading role in failure analysis, this field has undergone some significant improvements. Basic optical microscopes have given way to atomic force microscopes, scanning electron microscopes, scanning acoustic microscopes, photoemission electron microscopes, infrared microscopes, stereomicroscopes, scanning SQUID (superconducting quantum interference device) microscopes, and USB microscopes, among other tools.

“Failure analysis evolved from scientists and engineers using microscopes to look at metal lines and to look at quality issues,” said Mentor’s Knowles. “There were optical microscopes, at first, before moving on to SEMs and other types of microscopes. Now it’s evolved to a very sophisticated, multi-staff process, where you may have 28 pieces of equipment and techniques to use, including electrical and electron microscopes, and design information. It’s evolved from someone looking through an optical microscope to someone in a lab with tens and tens of millions of dollars’ worth of equipment. Now the question is how do you use that best.”

All of these tools are still in use. What’s different is they are being inserted at various points throughout the design-through-manufacturing flow, with post-manufacturing data collected and processed either in real-time or, if a failure is not imminent, during routine maintenance. The challenge at this point is being able to structure that data to make sense of it, and that begins to cross over into the problem of access to data, particularly the data in the foundries.

“Our customers give us just enough data to improve sensors,” said Subodh Kulkarni, president and CEO of CyberOptics. “And we offer tools in software where they can do more than just look at the raw data. That data can be used to add intelligence into different layers and, with the use of AI, make things much more predictive.”

More data is better, of course. There are more potential interactions than in the past, and there will be many more as advanced packaging begins to proliferate as a way of improving performance with lower power in a post-Moore’s Law world. And that, in turn, will impact failure analysis and up-front failure prevention.

“A big new focus for us is the extension of front-end factory process control methodologies and technologies into package and printed circuit board (PCB) manufacturing,” said Chet Lenox, senior director of industry and customer collaboration at KLA. “This was the thinking behind our purchase of Orbotech. The overall trend that we are banking on is that the increased emphasis on sophisticated packages to combine die and improve overall system performance will drive more emphasis on quality and process control. Packaging, PCB, and system integration used to be a relatively low-cost low-value-add part of the overall flow, so it wasn’t really conducive for our technologies. But that’s changing rapidly with fan-out based packages for mobile parts and increased use of sophisticated multi-chip modules for high-performance computing. Those processes will need solutions that look a lot more like front-end semiconductor fabs and less like a cheap wire-bonder followed by a macro optical review.”

Failure analysis is getting more sophisticated, timely, and increasingly blending into failure prevention and reliability planning. This represents a dramatic shift for a task that used to resemble after-the-fact forensics. It is now evolving into a state-of-the-art analysis of what can and does go wrong.

“Failure analysis was used primarily if you had a failing chip, so we would go to the lab and they would look at it with a few techniques, mostly optical microscopy and some SEMs, electron microscopy, and maybe some electrical test,” said Mentor’s Knowles. “People would always test their chips with functional test and parametric test. That was pretty good. There always has been a tension between the laboratory doing the failure analysis and the factory itself. There was never enough time. And then, as things got more complex and the cost and time of test were growing exponentially, the industry introduced scan test, and this is where the design-for-test technology comes in. We put the design elements in the scans all over the chip, and then you have the automatic test pattern generation, where there are simulation tools that can understand the design of the chip and say, ‘This is the set of patterns we want that tester to apply.’ That reduced the cost of test quite significantly. What it also did was to provide an opportunity for the failure analysis folks to leverage that for diagnosing issues and understanding the problems at the chip level.”

It’s unlikely that after-the-fact failure analysis will ever disappear, particularly as demand for reliability continue to increase as chips are used across safety-critical types of applications. But the cause of those failures is no longer a months-long investigation. Answers are required in days, and sometimes hours, and that information needs to be looped back directly into the manufacturing process as quickly as possible to limit the number of failures and potential recalls. This is now state-of-the-art analysis, and both the turnaround time and the accuracy of those results are becoming increasingly valuable on all sides.

—Jeff Dorsch contributed to this article.

Data Analytics & Test Knowledge Center
Top stories, special reports, videos, white papers and more

Leave a Reply

(Note: This name will be displayed publicly)