中文 English

Chasing Test Escapes In IC Manufacturing

Data analytics can greatly improve reliability, but cost tradeoffs are complicated.

popularity

The number of bad chips that slip through testing and end up in the field can be significantly reduced before those devices ever leave the fab, but the cost of developing the necessary tests and analyzing the data has sharply limited adoption.

Determining an acceptable test escape metric for an IC is essential to improving the yield-to-quality ratio in chip manufacturing, but what exactly is considered acceptable can vary greatly by market segment — and even within the same market depending upon a specific use case or time frame. The goal has been to reduce the number of failures in the field as chips have become more complex and as they become an essential part for safety-critical and mission-critical applications, but the emphasis on quality has been creeping into other markets, as well.

In the 1990s, quality engineers set the limit for desktop and laptops at 500 defects per million (DPM). With volumes of 1 million units per week, a computer system company can easily detect escapes. Today, automotive OEMs are demanding 10 ppm for much more complicated devices, even though car makers may find it challenging to measure escapes at this DPM level. Finding those escapes involves a deeper look into data, which in turn requires investments in data management, data analysis tools, and the engineering effort needed to make this all work.

With every test-time reduction decision, test content deliberation, and responses to test escapes, engineering teams determining test content process must grapple with the constructive tension found in the yield/quality/cost triangle, which is required to determine the test content process. And fundamental to all of this is having enough good data.

“We have an interesting problem within the semiconductor industry in that generally production yields are extremely high, which means there just isn’t that much fail data,” said Keith Schaub, vice president of technology and strategy at Advantest America. “So how do you develop a model to detect failures when it almost never fails? That’s a difficult problem. You must come up with some creative data-blind techniques, where you try to get the model to look for something different than the norm or out of the ordinary.”

The primary driver of these predictive models to detect test escapes has been feedback from customers. The reason is that “out of the ordinary” failed parts may be perfectly good in a customer system. If test escapes go down and the yield impact to failing good parts is minimal, then the new test metric is good enough.

How much data do you need?
In responding to customer returns, product, quality, and yield engineers revisit the yield/quality/cost triangle tradeoffs. Quality issues need to be addressed, and if this means that some good parts get thrown out, the quality engineer generally deems this needs to be an acceptable loss to make the customer happy.

This may sound odd, yet it makes sense to quality and yield engineers. To begin with, yield is assessed in terms of percentage, while quality is measured in ppm.

Moreover, to effectively chase test escapes, engineers need enough production volume to have the feedback from the end customer’s system. The more escapes, the less production volume engineers need to determine whether an issue exists. From there, assessing whether a new test adequately screens for test escapes requires just enough volume. Those numbers do not have to be the same.

No test is perfect. All of them are likely to fail a few good dies or units. Those false negatives commonly are referred to as overkill. If the fallout from a new test is in the range of 100 ppm, a yield engineer won’t blink an eye. A 1,000 ppm could be the battle line upon which a yield engineer and quality engineer have arguments. Yet in response to a customer failure, the quality engineer normally will win. If the yield loss is too excessive, then the product engineer needs to investigate other possible tests to distinguish bad from good parts.

Ratio of bad to good parts failed
How many good parts get thrown out when you apply a test? You can only measure this if you bother to look.

A system test, or an engineering characterization level of a parameter, remains the final arbitrator of labeling a true failure. Consider two different real-world scenarios involving false negatives. The first revolves around a measurement of I/O timings. In comparing an ATE determination of pass/fail parts for timing measurements with the characterization of a bad part, it was found that a ratio of 1 true fail to 2 true good existed. The second involved implementing an outlier detection technique to detect escapes. The escapes measured on order of 100 ppm. The outlier detection technique caught the escapes and failed approximately two times good as measured by a system-level test. By coincidence, both examples found a 1 fail to 2 good part ratio. For the second example, to detect 100 ppm customer fails results in an approximate 300 ppm total fails, with 200 ppm as yield loss.

So how much data do you need to determine a test limit or predictive model that distinguishes between good and bad parts?

“The short and simple answer is, ‘How accurate do you want to be?’” said Jeff Roehr, IEEE senior member and 40-year veteran of test. “You can start implementing lot-based adaptive test limits after about 30 parts, if you can accept 10% error. The accuracy improves significantly (about 1% error) when the sample reaches 300 parts.”

These numbers assume a Gaussian distribution for the parameter of interest. Such errors would change if the distribution is bimodal, for example.

If engineers have previous product history to base their test methods on — i.e., always do static part average testing on this product — they can be comfortable with 30,000 units, which has an approximate error of 0.01%.

It’s not always necessary to have a large data set to verify the effectiveness of a new test screen. Engineers can have confidence even with smaller data sets if they have feedback from a customer system. What is required, though, are unique IDs.

Ken Butler, strategic business creation manager at Advantest America, highlighted the differences between large SoCs and analog products. “With large SoCs, there is almost always an electrical chip ID (ECID) available, so you can track it all the way through manufacturing. For analog devices, ECIDs are less common because the die sizes are extremely small, and you just can’t afford the die area to do it,” said Butler. “So for outlier analyses, you often have to run open loop, meaning that you don’t have specific failing chips you can use as targets to develop your outlier screens. In such situations, you will want to use as many wafers as possible to determine your screening parameters. But not every IC product line has a large amount of material available, so you use whatever you have. The concern is that if you create a screen based on multiple wafer lots, the likelihood that you’re going to see enough process variation in the sample is low. Then you’re likely to miss some defect mechanisms that you might otherwise catch with more data.”

The challenge, then, is that failures happen at such a low incidence that you need enough volume to discern they exist. Once you know they exist, you can study them and figure out what makes them different from good units. In the case of test escapes that impact the customer, the failures may seem random, which makes it seem impossible to determine a test screen.

Determining a test to detect escapes
For 100 ppm, a customer needs only a volume at min of 30,000 units, although 300,000 units provides engineers more confidence in the magnitude of the problem. This provides enough information to go into the detailed data analysis needed to determine “one of these things is not like the other.”

The number of publicly documented cases for how to manage test escapes is extremely limited. This is understandable, because such stories expose both the IC vendor and the end customer. But the value cannot be overstated. These cases provide the evidence that outlier detection testing works, even when engineers cannot find the physical evidence.

“In 2005 we were having a field return problem with a product that represented an escape of 100 ppm. Our analysis indicated that these field returns just simply didn’t work in the customer system, yet it passed all of our tests applied on ATEs,” said Roehr. “System-level test (SLT) was not part of our production flow and we couldn’t afford to add SLT. We did isolate the nature of the field return to know that an extensive engineering characterization could distinguish the field returns from parts that passed both SLT and ATE-based tests. We couldn’t afford the test time to run that engineering characterization type test on our ATE.”

So now the question is whether some other test parameter can be used to distinguish the field returns from good parts that passed system-level test?

“We started digging into the data,” Roehr said. “This is one of the first cases where we found that when you look at the parts on a wafer — lot-by-wafer-lot basis, or wafer-by-wafer basis — we can start to see something. If you looked at the part across the spec range, you don’t see a problem. But when you looked at the individual parts within a lot, there were a few parts in a lot that don’t look quite like their sisters even though these parts were well within spec.”

He noted that failure analysis on selected parts never determined a definitive defect mechanism, and he surmised that the change in behavior was due to a timing-related failure — a signal path with a bit more delay. In addition, a small sample of parts that failed the new test was run through the system test. Not all parts failed the system test, but enough of them failed to provide confidence there now existed a sufficient screen to detect all the field returns.

ROI for data collection and analytic platforms
Looking at all test data for a differentiator can provide engineers with a magnet to find needles in a haystack, which are the test escapes. Yet without adequate investment this may not be possible. With test escape stories similar to Roehr’s, other product engineers said it may take 9 to 12 months before they learn of a test escape issue. Then they need to delve into the test data archives. To do so with ease requires an investment in data collection, storage and analytics. In addition, due to data alignment issues and business barriers to data sharing, this is an easier task for product engineers at IDMs than at fabless companies.

“Segmented supply chains and the lack of data sharing are still general data management gaps to be overcome in the classic data flow: Customer design to foundry to OSAT to customer. To help address this today, we are seeing more “turnkey” manufacturing options for fabless customers,” said Mike McIntyre, director of software product management at Onto Innovation. “These build options help with data consolidation, but unfortunately these options are limited both in breadth of supported technologies, application, and number of participants.”

Semiconductor data analytic companies sell their yield management platforms to fabless companies, foundries, IDMs, and OSATs, because these customers want to understand their respective role in the IC performance and quality. Rarely can anyone upfront predict the new outlier detection techniques that will be needed for a product.

The question engineering managers ask their teammates wanting to invest in outlier detection upfront is, “What’s the return on investment?” It’s a challenge to know this up front with no prior engineering experience showing the value. The cost side of the yield/quality/cost test triangle comes up. Managers want to know what money will they save if their team spends the engineering effort to find outliers up front? Another question engineers will ask is how they will know these outliers are true failures, when it’s nine to twelve months for feedback from system application.

Industry sectors with products that have safety concerns may warrant an upfront identification of potential outlier tests. For these products, the risk mitigation has a return on investment. For large SoCs going into computing systems and ASIC devices with lower part volumes, it is harder to justify because the ROI is not as clear.

“We can improve DPM by getting rid of outliers. Well, how much does it really improve quality?” stated Phil Nigh, R&D test engineer at Broadcom. “So, let’s look at testing a typical digital SoC/ASIC. How much additional DPM so you detect by avoiding outliers? My experience has been maybe as much as 10%. And 10% is not a lot. I would state a lot of customers would not be able to measure that 10% DPM change—for relatively low-volume products.”

Conclusion
Test escapes from customer returns will continue to occur, and product, yield and quality engineers will need to respond. With today’s yield and test data analytic platforms assessing test data for possible outliers that may impact a customer system is now possible. Identifying them ahead of time seems pointless to most product engineers because they already are applying all known tests.

Test data analytic platforms can identify test parameter combinations that show distinct population differences. However, most engineers remain skeptical without proof that it fails in a customer system, and ultimately DPM only can be measured at the end customer system. Not all outliers will be indicative of a part that will fail a system.

Related stories
Adaptive Test Gains Ground
Demand for improved quality at a reasonable cost is driving big changes in test processes.

Part Average Tests For Auto ICs Not Good Enough
Advanced node chips and packages require additional inspection, analysis and time, all of which adds cost.

Too Much Fab And Test Data, Low Utilization
For now, growth of data collected has outstripped engineers’ ability to analyze it all.

Data Issues Mount In Chip Manufacturing
Master data practices enable product engineers and factory IT engineers to deal with variety of data types and quality.



Leave a Reply


(Note: This name will be displayed publicly)