Using Analytics To Reduce Burn-in

Data-driven approach can significantly reduce manufacturing costs and time, but it’s not perfect.

popularity

Silicon providers are using adaptive test flows to reduce burn-in costs, one of the many approaches aimed at stemming cost increases at advanced nodes and in advanced packages.

No one likes it when their cell phone fails within the first month of ownership. But the problems are much more pressing when the key components in data warehouse servers or automobiles fail. Reliability expectations of complex SoCs have only grown over the past few decades, and demand for known good die in packages is increasing, particularly with the emphasis on chiplets.

Semiconductor makers strive to meet these expectations at the end of the manufacturing flow by accelerating defect mechanisms. Burn-in has been a standard test step that screens out early-life failures. The problem is that it’s expensive. So by combining adaptive test flows and statistical post-processing data analytics, engineering teams can significantly reduce that cost — sometimes to zero.

Over the past two decades engineers have exploited data from wafer probe to:

  • Modify burn-in recipe to reduce burn-in times;
  • Screen for reliability failures at wafer sort, and
  • Reduce the percentage of devices receiving burn-in.

These techniques have been applied to both ASIC devices and complex SoCs. Work began back in the early 2000s, prompted by Iddq testing becoming less effective due to the increase in quiescent background current with each tick on the CMOS process node roadmap. Subsequently, the ability to have a single Iddq pass/fail limit created tension between yield loss and escapes. Engineers turned to adaptive test flows and statistical post-test analysis to balance the scales.

Just like the performance binning adaptive test flows in the 1990s, those early burn-in reduction test flows involved lots of custom development. In the past decade, data analytic companies have eased the customization burden for engineers by providing standardized statistical analysis reports, tools that execute the dynamic test limits on ATE, and the capability to connect data between wafer, burn-in, and package-test steps. This enables even small companies to leverage manufacturing data and adaptive test flows to reduce burn-in costs.

“We have a fabless AI startup that is a user of our entire analytics platform, and they chose it specifically because it allowed them to focus their efforts on building the best chips and systems. They relied on our products and services to collect, clean, and manage all of their data, and quickly deliver information and insights to their engineers,” said Greg Prewitt, director of Exensio Solutions at PDF Solutions. “There is no reason why any semiconductor company, IDM or fabless, public or private, could not take full advantage of big data and adaptive test today.”

To appreciate the progress made in applying adaptive testing to increasing IC reliability and reducing burn-in cost, one needs to understand why burn-in has been a necessary expense for complex digital devices.

Accelerating failure rates
All devices under sufficient stress will wear out. Seven years has been a typical life-time spec for the microprocessors that AMD and Intel develop. An IC device may last longer than seven years, but it’s never a guarantee.

No chipmaker waits seven years before launching a newer and better version, but that’s the generally accepted lifespan for servers. Electronics reliability engineers use the high-temperature operating life (HTOL) process to understand the early life failures and the functional lifetime of the part, as measured in months and years. Reliability engineers often refer to this as the bathtub curve. HTOL makes use of the fact that solid-state device wear-out mechanisms (aka aging mechanisms) can be accelerated by applying temperatures and voltages above their normal operating range for extended periods of time. Aging mechanisms for CMOS include negative bias temperature instability (NBTI), hot carrier injection (HCI), electromigration (EM), and time-dependent dielectric breakdown (TDDB).

Fig. 1: The ‘bathtub curve’ hazard function (blue, upper solid line) is a combination of a decreasing hazard of early failure (red dotted line) and an increasing hazard of wear-out failure (yellow dotted line), plus some constant hazard of random failure (green, lower solid line). Source: Wikimedia

Fig. 1: The ‘bathtub curve’ hazard function (blue, upper solid line) is a combination of a decreasing hazard of early failure (red dotted line) and an increasing hazard of wear-out failure (yellow dotted line), plus some constant hazard of random failure (green, lower solid line). Source: Wikimedia

Defects that only manifest themselves early in the evaluation process are called infant mortality failures. In CMOS, oxide pin-holes and narrowed metal lines for electromigration are examples of physical defects that often result in infant mortality failures.

For all silicon products, engineers use HTOL for new product introduction evaluations. For large SoC devices, using HTOL as a production step has been part of doing business. In the latter context, engineers refer to this test flow step as the burn-in.

Burn-in module consists of a temperature control chamber and PCBs that can control the power to the IC devices. To place a burn-in step into a production test process results in the following costs equipment, factory footprint, energy, and manufacturing time. Burn-in chambers have significantly lower through-put than the test cells (combination handler, ATE, and associated software) used for wafer probe and final test as there exists less parallelism. The test flow initially requires ATE testing prior and after burn-in- see figure 2. Costs of ATE and burn-in chambers run at 7 and 6 figures respectively.

Fig. 2: Production burn-in flow. Source: Anne Meixner/Semiconductor Engineering

Fig. 2: Production burn-in flow. Source: Anne Meixner/Semiconductor Engineering

Such costs have motivated engineering teams to reduce these costs, or eliminate them altogether.

Choosing what and how to burn-in
Using data from wafer test, engineers have modified the burn-in recipe, identified the parts most likely to fail after burn in, and fully eliminated burn-in. To support such decisions in CMOS, devices test engineers primarily relied on Iddq test measurements. To understand its relationship to early life failures requires an understanding of Iddq testing.

A defect in silicon can manifest electrically in multiple ways. Relying upon burn-in to accelerate a failure permitted stuck-at-fault (S@0, S@1) testing to detect the failures afterwards. As CMOS became the predominant process for computing devices, the use of Iddq testing to screen for failures became part a test engineer’s toolbox. It detected failure modes that stuck-at-fault testing missed, and this included early-life reliability failure modes.

Iddq is the measurement of quiescent current. It is measured after an input stimulus has been applied, but not during its application. Defects result in elevated Iddq values. Starting around 1985, product and quality engineers began using Iddq testing at wafer test to achieve 0% production burn in. For those process nodes, defects resulted in at least one order of magnitude higher Iddq values than defect-free devices. So with relative ease, engineers could set a pass/fail limit to successfully screen reliability failures and not cause significant yield loss.

As mentioned, shrinking process nodes made it less effective because the quiescent current increased and the distribution of quiescent current became wider. Engineers responded to these facts in creative ways to keep using this measurement as a screen and at least one engineering team cleverly used the increased current to reduce burn-in times. Both used adaptive test methods and flows to achieve their goals.

Leakier parts mean higher power and hence, higher thermal resistance, which in translates into lower burn-in times. Intel engineers used this property to lower burn-in times. In a 2006 ITC paper, Intel researchers described evaluating each die’s static current and other wafer test measurements to determine the optimal burn-in recipes (time, temperature, voltage). Next, an automated feed-forward test flow directed the die into several distinct buckets, each with an optimized burn-in recipe. Segregation into buckets by their static power reduces the required stress time and reduces the overall variation in stress temperature within each bucket.

Intel achieved a stunning reduction in burn-in time – greater than 90% for a high-volume 90nm product. Yet the feed-forward test flow was not the sole contributor to this reduction time. A new burn-in equipment cell enabled this level of segregation per burn-in board with a slot architecture, which permitted individual burn-in control of power and times in the burn-in chamber. For the latter, the new cell obviated batch processing, so the continuous handling of burn-in boards further optimized the burn-in recipe buckets.

Fig. 3: Intel’s adaptive burn-in recipe flow. Source Anne Meixner/Semiconductor Engineering

Fig. 3: Intel’s adaptive burn-in recipe flow. Source Anne Meixner/Semiconductor Engineering

The wide variation in Iddq currents posed a problem for test engineers who wanted to increase its effectiveness. Even with the introduction of delta-Iddq current test techniques by the early 2000s, it became significantly difficult to balance yield and quality.

Unlike stuck-at tests, with Iddq engineers have a numerical value to check against a limit. With a numerical value, then you can apply more advanced statistical methods to discern defects.

Fig. 4: Adaptive test flow to downgrade die that are highly likely to fail burn-in. Source Anne Meixner/Semiconductor Engineering

Fig. 4: Adaptive test flow to downgrade die that are highly likely to fail burn-in. Source Anne Meixner/Semiconductor Engineering

In its 2002 VLSI Test Symposium paper, LSI engineers and a PSU researcher shared how they used post-processing of wafer test data and wafer sort maps to identify likely reliability failures and customer visible escapes. For burn-in related testing, they looked at Iddq data to identify parts that would most likely fail burn-in. Reporting the results on 0.18µm products, they described a test flow that required making decisions regarding burn-in after wafer sort and prior to final test.

The writers noted that having a single-threshold test limit for Iddq resulted in edge die being marked as fails. However, they were just faster die rather than defective. Plotting Iddq versus speed measurements, they observed, “Clearly the outliers are visible but setting the limit on the tester without causing high yield loss becomes difficult.” They resolved their dilemma with post-processing the wafer-level test data with a number of statistical analytical methods to determine limits (aka, virtual test).

Using statistical methods, they downgraded die that passed all the simple test limits to parts deemed suspicious based upon their wafer position and wafer test result population. For both burn-in and yield their results were compelling. To assess the improvement on burn-in reduction, they ran an experiment using 14 lots and a total of 60,105 die passing wafer sort, subjecting all downgraded die and a sample of non-downgraded die to a 24-hour burn-in. Of the 171 burn-in failures, their statistical downgrading method identified 168 of them.

As these knowledge of these methods spread, engineers at other companies such as IBM and Texas Instruments began to apply them to their products. They found them attractive due to the cost savings by eliminating burn-in or reducing the percentage of product going through burn-in. They did so despite the engineering investment to apply the complex statistical analytics and to create the customized tools to manage the product flow through the factory.

Balancing overkill and underkill
So how did the LSI team do it? They used multiple variables and rigorous statistical analytic methods to distinguish between good and bad die.

Establishing a single variable pass/fail limit from a parametric measurement always has the statistical risk of Type I and Type II errors.

In IC manufacturing. engineering teams never use the statistical terms. Equivalent terms one would hear are:

  • Type I errors = OverKill, failing good parts, yield loss.
  • Type II errors = UnderKill, passing bad parts, escapes.

Minimizing overkill and underkill in a test manufacturing process respectively pits yield versus quality and reliability. With advanced CMOS process nodes, setting the a single limit at production launch neglects two facts- distributions shift with the health and maturity of a manufacturing process and there exists a higher variation around individual measurements. For the latter, engineers might call it the problem of dealing with noisy data. Using statistical analytical methods permits using multiple measurements and die attributes (geo-spatial) to identify outliers. Still overkill and underkill remains present with these more sophisticated statistical pass/fail decisions, the risks become smaller.

Burn-in results combined with wafer and final test data enabled engineers at Texas Instruments to build very targeted statistical analytic models as described in their IRPS 2006 paper co-authored with Rob Daasch of PSU.

“Burn-in data combined with the Iddq data is a very rich source of information. Not only just in terms of voltage stress and response to a burn-in stress, but particularly because you can do things with outlier identification techniques. You can do pre-stress and post-stress Iddq measurements, compute deltas, and look for movement,” said Ken Butler, IEEE fellow and former test systems architect at Texas Instruments. “Then you can run all that data through an outlier algorithm to pick out the subtle mechanisms because those are the ones that are going to pop up when you get into burn-in. In order to eliminate your burn-in you have to predictively eliminate the devices that are likely to fail burn-in.”

This isn’t perfect, of course. “You never catch everything,” Butler said. “In the early days (circa 2002 to 2006) when our goal was burn-in avoidance or burn-in minimization for big digital SoC devices we could use unique die ID to track the die all the way through burn-in and final test. If it failed after burn-in, you looked at the wafer data to develop a screen. For example, here’s five burn-in failures that occurred on this wafer, go and find a correlating parameter that would allow me to predict those failures.”

Others agree. “The increased complexity of manufacturing data and its volume, together with the need to increase quality, result not only in the need for efficient data management platform, but also require complex analytical solutions,” said Alon Malki, head of data science at National Instruments. “Correlating burn-in results to a small set of testing parameters has become insufficient for screening purposes. The engineers now need to consider analyzing thousands of parameters from multiple stages of the manufacturing process. To deal with these new challenges, we must look at the entire product’s lifecycle of data.”

For advanced process nodes (< 14nm) Malki noted that applying advanced analytical methods to the Big Data that manufacturing process can reduce burn-in costs by up to 40% (recall that for 90nm Intel achieved 90% with simpler analytics and adaptive test flow). But to be achieve those savings requires taking into account the full life cycle of a device, from model creation through distributed deployment to continuous monitoring of model performance, and it needs to be able to quickly adapt to change.

While IDMs have the engineering resources to develop custom tools, this investment requires continual maintenance and development. So companies that specialize in providing the framework and tools for these analysis methods, and smart manufacturing flows, have continued to grow.

“The manufacturing process is increasingly complex and spread across multiple facilities and operational groups,” said PDF’s Prewitt. “Just as this distribution creates logistic challenges, it also complicates the timely collection and alignment of data sources and types. Solving these challenges benefits from collecting data directly at the processing tool, automated timely data transport, and establishing a single data repository for these related but disparate data sources to be coalesced into a single source of truth for product engineering.”

Conclusion
For complex SoC devices, the burn-in step has been required to meet the high reliability demands of end users. It’s a costly manufacturing step that test engineers like to eliminate, yet engineers responsible for reliability metrics cautiously watch-over such elimination. This tension between yield loss and quality runs throughout the whole manufacturing test flow process.

Meeting both metrics in an economically manner is the third point in the triangle of yield/quality/cost. Together, test and reliability engineers can use adaptive test flows and sophisticated statistical analytics to effectively meet their respective metrics of interest. Historically, these engineering efforts only could be achieved by silicon companies with large engineering teams.

“One could argue that it’s even maybe worse in a low volume situation, because you’ve got all the overhead of creating the test setup, maintaining the equipment and everything like that,” noted Butler. “Maybe that’s sustainable when you can amortize that over a much larger volume of material. But now if you’re a small, you have to create all that stuff.”

Yet everyone should benefit from these methods, and that has changed. In the past 5 to 10 years, analytic platforms that comprehend silicon manufacturing test expertise deliver the analysis tools to engineers.

Complex manufacturing designs, coupled with complex IC designs, necessitate using more than one variable to see that one of these things is not like the other. Engineers equipped with data, analytic tools, and more automated test processes now can do that for burn-in.



Leave a Reply


(Note: This name will be displayed publicly)