Reducing Costly Flaws In Heterogeneous Designs

Why finding and fixing errors in AI and automotive chips is so difficult.


The cost of defects is rising as chipmakers begin adding multiple chips into a package, or multiple processor cores and memories on the same die. Put simply, one bad wire can spoil an entire system.

Two main issues need to be solved to reduce the number of defects. The first is identifying the actual defect, which becomes more difficult as chips grow larger and more complex, and whenever chips are packaged together with other chips. The second problem is figuring out the root cause of a defect and how to prevent it from occurring again.

Both of these issues need to be set in the context of new device architectures, advanced packaging, increasingly smaller features with more noise and potential for signal disruption, and new applications of technology in different markets. For example, there is little historical data about how some of these chips will perform over time, particularly in automotive, industrial and aerospace applications where the chips may be required to run for decades. And not all defects are severe enough to kill a chip initially, but a latent defect that might not cause a problem in a smart phone might cause a total system failure after a decade or more of abuse in a car or in an AI chip where some circuits are always on.

“There are two big factors in test,” said Ron Press, technology enablement director at Mentor, a Siemens Business. “The first is lifespan. The second is size. Some of these chips have thousands of disparate blocks. The challenge today is identify a block that went bad, and in some cases how to use one of the other blocks as a spare. Processor companies have been doing this for years with multi-core designs, where they decide the performance spec of the chip. There may be four cores, but one doesn’t perform. But now you have thousands of cores and there is no standard way companies are reconfiguring them.”

There also is no standard way of tracing back the cause of defects to a single die on a wafer. While that may not matter in a consumer application, it can have a big impact in safety-critical application. To make matters worse, the problem may materialize at different points in its lifecycle, depending upon how and where a particular device is used.

“Unless you have some kind of ECID (electronic chip identification) on these chips, they cannot be traced back to a wafer,” said Dave Huntley, who heads business development at PDF Solutions and serves as the single device tracking task force leader at SEMI. “In the past you might see one of the advanced chips with an ECID, but that doesn’t take into account all of the other devices around it. And as you add in process steps and assembly, it gets more complicated. So if you’re wirebonding a part, the bonding might be electrically good, but one wire may have had a slightly different profile. If you get any indication of a failure, you need to track that back to the device and track where it fits in the wafer and assembly process. But what’s been missing is single device tracking through assembly.”

What to do about that isn’t always clear, because defects may only show up in some die on a wafer during one step in the manufacturing process. As the number of steps increase with packaging, and as tolerances tighten for each of those steps, it becomes increasingly difficult to pinpoint.

Inside out, outside in
Following the data from the test and inspection side is essential to ensuring quality, but increasingly it’s only one piece of the solution. The other piece involves watching what is happening inside the chip or package, or inside various manufacturing processes. There is a big push toward in-circuit and in-process data collection, and getting all of the pieces to line up hasn’t been easy.

One iteration of this approach involves embedding sensors into a chip, package or system to monitor behavior during production and throughout its lifetime.

“There are two ways to mitigate the risks of a failed package — test each die better in the first place, and select dies that work well together for each package,” said Alex Burlak, vice president of test and data analytics at proteanTecs. “By monitoring the margin to the frequency of millions of paths in each IC, chip and system, vendors gain performance coverage visibility, early yield fluctuation detection, correlation with pre-silicon models, margin and hot-spot detection, on-chip variation, etc., which ultimately can lead to a 10X DPPM (defective parts per million) improvement.”

That data is then inferred by machine-learning data analytics to improve yield and performance.

“Vast parametric data is extracted and inferred from each die,” Burlak said. “Die classification, including what to expect from each die from power/performance point of view, is performed at a very early stage — wafer sort, even before packaging. With this data, you can start optimizing the selection and interaction of the chips in each package or system. Vendors can, for instance, choose the faster dies for the high-performance bin, the slower dies for the low power bin. That way they can avoid having the weakest link determine the power and performance of the assembled package. Of course, they also can mix high-performing dies with low-performing dies to level out the interactions between them, and that way increase yields.”

At the base of most of these solutions is a hybrid approach to test and data analytics. Increasingly, data is being viewed as the common thread between various process and design steps, with machine learning as the best way to interpret and utilize that data.

“Everyone is trying to find things sooner,” said Doug Elder, vice president and general manager at OptimalPlus. “What customers want us to do is collect data earlier in the process as you move up into the fab. If you have a return from an automotive customer, that can cost you $50,000 to $100,000 to deal with an automotive application. If you get it before it goes out the door, it may only $30,000. So your overall yield and efficiency will go up, and your costs will go down. The key is an open architecture that allows you bring in data from any source and run analytics on that at any point in the process.”

That approach is essential, because with new technologies such as 5G, not everything is accessible at every step throughout the design through manufacturing flow.

“The challenge at 5G is that some of the testing has to be done at the system level, rather than back at functional test or earlier in the process,” Elder said. “If you collect that data, then you can look at it and draw some conclusions and feed some of that back into your process.”

Structural abstraction
One of the major considerations in all of this is how to restructure various flows to be able to collect and utilize data from all parts of the flow. This requires breaking down some of the silos that have been created to compartmentalize and optimize various processes and operations. The problem is that many of those silos have become quickly outdated or obsolete as new markets emerge and design approaches, architectures and design priorities shift.

“The starting point is whoever is feeling pain,” said George Zafiropoulos, vice president of solutions marketing at National Instruments. “The guys in the lab are the first downstream part of the flow. So it’s pre-silicon EDA, post-silicon validation, lab bring-up, validation, characterization, and production test. The typical approach has been that everything lives in a silo, and after that you don’t know what else is out there. We’ve seen cases where people stop using a simulator because they don’t have time to understand the results or learn all of the pitfalls.”

Even where there are obvious linkages between different silos, gaps still exist.

“What you need to do is build a base layer, which is basically the infrastructure to use machine learning,” Zafiropoulos said. “Once you have the data platform, you can take data from different sources, clean it and distribute it. Then you can use machine learning, and in the test world machine learning is applicable to modifying test sequences. What AI also allows you to do is figure out whether you need to test all permutations at every voltage and temperature, or whether you just need to test a subset. You don’t want to do 100 tests only to find out that you fail on the last one. If think about automotive, what this allows you do is correlate results across an entire flow.”

And this is where some of the fundamental changes are occurring.

“In the past, automotive has done P-PAT (parametric part average testing) screening, where they use parametric parts average testing to not only screen out based on the spec of devices and their electrical performance, but also any devices that look like they’re outside of the normal distribution even if they are in spec,” said Chet Lenox, director of process control solutions for new technology and R&D at KLA. “That tends to reduce the latent reliability failures using a purely electrical test format. You basically have all of this test data already, but you apply a more stringent requirement. This is not necessarily tighter specs. It’s statistical parts average testing. So you lose some yield — you throw away 1%, 2% or 5% of your die — but you get much more reliable die.”

And that’s where the emphasis is shifting, particularly in markets such as automotive and in devices that are expected to last more than a couple years of sporadic use.

“What’s exciting now is doing the same thing for in-line defect data with I-PAT, which is inline part average testing. You literally take all of the data off of a die, typically used a fast optical inspection tool, and you screen out a few percent of the die purely based on defect data in-line. Even if it passes sort, you don’t want to ship it because it may have latent defect reliability data later, or the test program coverage may not have been good enough to have caught that particular defect. We’re seeing a lot of new techniques in this area, and we are developing the analytical techniques and the tools to screen these defects.”

Human in the loop…sometimes
This dovetails with some broad changes underway in the AI world, where recognition is growing that machine learning and deep learning can be very useful tools, but they don’t always stand on their own. So rather than AI functioning as an independent system, which would require massive amounts of programming and reprogramming, it is most effective when coupled with human expertise and deep domain knowledge.

Machine learning is a key component in all of this because it can be used to identify connections and patterns in data that are far beyond the level of detail a human can follow. But just applying machine learning to a problem doesn’t necessary yield good results. It needs to be used with a purpose in mind, and it needs to be combined with deep domain knowledge.

“The ideal approach is a blend of machine learning and domain expertise,” said Kevin Robinson, director of customer service at yieldHUB. “Machine learning can sift through different dimensions in different ways. But it’s not that simple. If you present this data as output, it’s often not that enlightening, and it’s very easy to get lost. Before you start machine learning, you need to consider what you’re trying to do and what you want out of that data. If you do that right, you get results that can help you prioritize actions. It’s more straightforward.”

The emphasis is on using machine learning as a tool, rather than as a black-box technology that can be used to solve any problem.

“The industry is shifting toward active learning, where you use machine learning to augment processes you are already doing,” said Jeff David, vice president of AI solutions at PDF Solutions. “So if you’re using machine learning to predict the yield on wafers, a human can review the results. Machine learning will take it 80% of the way, and humans take it the last 20%. So basically we’re using machine learning to make analytics better. This approach also helps take the opaqueness out of machine learning and gives control back to the customers.”

This also requires a different way of thinking about machine learning. Rather than thinking of it as a tool, the new approach is that it is a collection of tools in a toolbox.

“If you think about deep learning, it’s good if you apply it to the right circumstances,” said David. “But if you apply it to a shallow data set it does not work well. So you need a good understanding of the machine learning marketplace, what you’re using the machine learning for, and you also need to understand that it’s not black magic.”

Looking to the future
What comes next is a progression toward predictive analytics and ultimately self-correction.

“What you’re doing with SPC (statistical process control) analysis is looking for patterns and trends,” said Tim Skunes, vice president of technology and business development at CyberOptics. “The first decision is if it’s a go/no-go. Maybe it’s out of spec or the build is not good or the package is warped. Next you want to look for patterns in the data to determine if there’s uniformity. You can screen printers through a feedback loop, for example, and make adjustments based on a sea of data. From there you use deep learning techniques to identify the sources of variation and patterns and trends. There may be 50 different solder pastes on a board. You need to figure out which one is causing the problem. This is particularly important for high-reliability parts, such as in the automotive market.”

Test will catch some of these defects, but not all of them.

“We’re working on a project for a company where their end customer is finding defects in silicon that do not show up in test,” said yieldHUB’s Robinson. “They’re not generating enough data yet, but once they do they will be able to use machine learning to look at relationships between siloed data. That could be thousands or millions of potential relationships. The key is being able to separate out issues and drill down and isolate them.”

The ultimate goal is to self-correct problems discovered in the field, and relay them back into production to prevent future problems. But that kind of capability is still in early research for most electronics. Breaking down silos for data and finding relationships in data is an important first step, but it’s just the beginning of a broad shift toward using this kind of data analysis more effectively.

Leave a Reply

(Note: This name will be displayed publicly)