Employing more stress testing at the wafer level improves quality while reducing burn-in time and cost. So why isn’t it happening?
Considered something of a necessary evil, burn-in of IC packages during production weeds out latent defects so they don’t turn into failures in the field. But as AI and multi-chiplet packages become more common, and concerns about aging circuitry heighten, shifting stress testing to the wafer level looks increasingly attractive from a quality, throughput, and cost standpoint.
The shift is motivated by the higher quality this approach can deliver when combined with advanced outlier approaches.
“There is a perception from automotive customers that package-level burn-in is required to ensure automotive quality,” said Chen He, senior director of product enablement and fellow at NXP Semiconductors. “We showed that an advanced wafer-level stress methodology, using enhanced high voltage stress tests and advanced outlier screening, can achieve superior acceleration and detection of latent defects with less reliability risk like thermal runaway compared to package level burn-in.” [1]
Part of the drive to replace burn-in is the need to place only known good die (KGD) in multi-die packages, such as those used AI and mission-critical assemblies. “As the technology continues to scale down, and with the emergence of chiplets, package-level burn-in will become either technically infeasible or prohibitively expensive,” said He. “Shifting left to a wafer-level stress methodology, including wafer-level stress and ML-based outlier detection algorithms, will be the right direction.”
While many of these designs are highly targeted, the approach is essential for many domains. “Even though AI devices in data centers are not mission-critical per se, screening out those early infant mortalities before they get into the data center is important because they are going to change the process in a year, anyway,” said Davette Berry, senior director of business development and customer program management for Advantest’s Test Solutions Group.
The majority of AI assemblies and mission-critical devices are burned in, while consumer devices rarely are burned in.
“Burn-in is a risk calculation based on the end product’s quality level, DPPM level you need to meet, the process node, the maturity of the process, etc. So your burn-in strategy is driven by these factors,” said Nitza Basoco, technology and market strategist at Teradyne. “For high-performance computing, for instance, you have individual control of the environment around that device, and that’s where our test platforms shine. You might need to deliver a large amount of power into the device, but this is carefully controlled using ATC (active temperature control) and loading to prevent thermal runaway.”
Cost is always a consideration here. “In recent years, time spent on stress, especially during post-package burn-in for data center products, has been on the rise. Intel and Siemens are focused on generating and applying effective stress stimuli to the products during manufacturing,” said Suriya Natarajan, technologist, Tessent Division of Siemens EDA. “The test side is innovating to detect more marginal defects during time-zero while proactively monitoring for impending aging failures. Delay fault models can reveal these marginal defects through timing-aware transition tests and timing-aware cell-aware tests.”
Interestingly, the industry has the tools to perform most stress tests at the wafer level. But because the complete assembly must pass reliability testing with all the solder bump connections, many chipmakers prefer to keep burn-in in production. “We’ve had the technology for years,” said Claude Castiglione, director of global test services at Amkor Technology. “It can be done, but we just don’t see customers moving in that direction. A couple of customers have inquired about it, and we do it to some level, but package test remains the dominant approach.”
New technologies, such as SiC and GaN devices and co-packaged optics, typically undergo burn-in to provide insight into device failure modes, which are then fed back to designers and the fab. Power devices operate at higher temperatures and must be burned-in at higher temperatures, as well, relative to silicon CMOS devices.
Still, a look at recent improvements in outlier approaches, testing platforms, burn-in, and system-level test provides insights into why burn-in strategies are changing, albeit gradually.
Not your grandparent’s burn-in
Burn-in is still the go-to screening method for catching early life failures in automotive and other mission-critical devices. It has been a mainstay in semiconductor processing for decades. By stressing parts with higher voltage and temperatures than they would normally encounter during actual use, burn-in mimics the aging process caused by defects and variability in the fab process.
What’s changing now is that new technologies like advanced ADAS systems in cars and AI accelerators are so complex that reaching the necessary quality levels requires a rethinking of stress methodologies. Add to that the threat of device damage due to over-stress (see figure 1), and shifting stress tests to the wafer level starts to make more sense.
Fig. 1: Optical image of device after burn-in shows delamination in the failure I/O bond pad area. Source: NXP Semiconductors
“Burn-in is essentially a fault analysis tool that shows where in the reliability lifecycle a part will fall out,” said Castiglione. “Normally you wouldn’t burn-in every part, but when you want to optimize your fault coverage you do burn-in test on every device. If after about three months you see very little fallout, customers might go to 10% or 20% burn-in for quality monitoring,”
But before the devices go into high-volume manufacturing, typically they are qualified.
“Most manufacturers of semiconductor devices want to qualify new devices, whether it’s a new process or a new design. So they go through this high temperature operating life (HTOL) test early on,” said Advantest’s Berry. “What’s changed is adding burn-in to a production flow. So once the process is qualified, do you continue burning in parts? Is that economically viable for the market that your device is serving?”
What are HTOL and burn-in test?
When a device is first undergoing validation, prototyping, and yield ramping, engineers perform high-temperature operating lifetime tests. For automotive applications, HTOL takes place at 125°C for 1,000 hours (42 days) using an applied voltage bias that exceeds the nominal range.
Conversely, burn-in during production traditionally has been performed on the assembled package, which sits in burn-in sockets on trays in the burn-in oven for typically 24 to 48 hours (potentially longer depending on quality target). Burn-in precedes final test or system-level testing.
Fig. 2: Distinct uses of burn-in (catching early life failures) and high temperature operating lifetime (HTOL) testing. Source: Aehr Test Systems
Burn-in systems are available to process wafers or packaged devices. “If you look at that AI chip from a very large brand company, you have their logic, their GPU, and then you have memory around it. And the reality is, all of those have to get some level of burn in or stress,” said Vernon Rogers, executive vice president of sales and marketing at Aehr Test Systems. “So if I do burn-in at the package level and I have a failure, versus burning it in at the wafer, it’s our understanding that you could suffer a 10X to 100X cost penalty for finding the fault later rather than sooner.”
The distinction between high temperature operating life test (a type of burn-in test) and burn-in testing is that HTOL focuses on new devices or processes, while burn-in ensures the maximum number of latent defects are caught in assembled packages.
The goal of burn-in is to eliminate these early failures without significantly reducing the lifespan of the reliable components. In other words, after devices are burned-in they are shipped and sold. Burn-in also ensures the quality of a chip from lot-to-lot, or even fab-to-fab. Unfortunately, it also can damage devices, either through thermal runaway (including the effects of device self-heating), an ESD event, or over-voltage.
With the massive number of pins on today’s devices, poor connection in the socket also can result in resistive failures. “You need solutions that are compatible with high-volume manufacturing,” said Castiglione. “For instance, one customer had a device with 7,000 pins and was using an elastomer in burn-in. There is tremendous force being put on these devices, and some materials are better able to withstand production test than others.”
High pincount testing relies on the condition of the test socket. “With the increase in AI, GPU, and other high-speed devices, the need to measure the goodness of these coaxial sockets has also become ever so important,” said Glenn Cunningham, director of test and characterization at ModusTest. For example, a shorts/leakage test method using a non-conductive device simulator will check for shorts/leakage between the signal/power pins to the shielding block. This method grounds all other points in the socket and measures for shorts/leakage to all points in the array.” Known good sockets are an increasing challenge in burn-in and test.
Which defects burn-in catches
The types of defects engineers are finding depends on temperature and voltage. In general, voltage biasing is more efficient because it follows the power law according to:
AFv = (Vstress/Vuse)N
Because N ranges between 20 to 40 for most defects, a 1.33X increase in voltage corresponds with up to 90,000X faster acceleration to failure. Many defects, such as bridging and gate leakage, require different levels of bias based on where the defect occurs in the device — at the transistors, metal contacts, middle-of-line interconnects, or global interconnects (see figure 3).
Fig. 3: Acceleration factors for defects at different levels and nodes. Source: NXP Semiconductors
For instance, voltage acceleration can precipitate source/drain to gate leakage defects, contact-to-poly (metal gate), poly-to-poly, and metal-to-metal bridging, which eventually break down the neighboring dielectric (time-dependent dielectric breakdown, TDDB).
But via and metal voids are more effectively driven to failure by high temperatures alone, making wafer-level baking more efficient than the later package-level burn-in. These electrical opens fail by stress migration. The other advantage at the wafer level is that higher temperatures (200°C to 300°C) can be reached relative to the package-level bake temperatures of 150°C to 175°C.
Eliminate/minimize burn-in
Getting away from package-level burn-in is a major shift to production floors. Changes to existing flows and equipment is costly and must meet or surpass the reliability of existing approaches. Makers of electric vehicles, and military and aerospace systems, are unlikely to change testing procedures without the verification of even higher reliability, lower cost, or both.
“It is non-trivial to switch from package-level burn-in to wafer-level stress across a production line,” said He. “The wafer-level stress and testing need to be designed to have sufficient coverage for latent defects, and a large-volume validation needs to be executed to prove that DLBI can be removed without affecting quality. It took NXP several years to explain the theory and collect large volumes of silicon data to convince our automotive customers to accept the approach. Afterward, the field performance data of NXP products switching to wafer-level stress methodology have demonstrated its effectiveness with both improved quality and test cost from hundreds of millions of devices shipped over years.”
NXP’s wafer-level stress methodology employs enhanced high voltage stress testing (eHVST, both static and dynamic) for voltage-sensitive defects and high-temperature bake for via and interconnect void defects. Pre- and post-stress testing indicates the failure mechanism.
Next, the advanced outlier algorithms process the test data using die location on the wafer map and statistical screens to identify which devices should still undergo package-level burn-in, including die at the wafer edge, memory-repaired dies, and marginal parametric outlier dies. Outlier approaches can be multi-variate or ML-based, such as the ESD human body model, charge device model, and latch-up stresses. The feedback loop in figure 4 further reduces the number of devices undergoing package-level burn-in with time.
Fig. 4: Typical production flow (above) includes advanced outlier screening and package-level burn-in (PLBI), versus a machine-learning based flow where most packages do not require burn-in. Source: NXP Semiconductors
On-chip monitors also play a critical role in outlier detection. “There is a need to “shift-left” in testing, especially for chiplet-based designs. You need to be able to do smart testing and outlier detection at wafer sort to avoid detecting a defective chiplet device after assembly, due to its cost. Also, there’s the additional challenge of testing the die-to-die interconnect that can only be done after assembly, where the ATE no longer has access to these interconnects from the outside,” said Nir Sever, senior director of business development at proteanTecs. “By leveraging our on-chip agents, we monitor signal timing, per lane, in mission mode to identify marginal lanes which could be candidates for lane repair on the ATE and in-field. This is done with no impact on area or the signal, and during mission-mode.
Controlling burn-in is becoming more challenging, though. “Having adequate heat dissipation during burn-in is critical to prevent thermal runaway,” said Amkor’s Castiglione. “Customers use sensors in the heat sink to perform active temperature control. The sensor signal tells the fan to turn on or turn up, depending on the conditions.”
Tight process control is essential. “With the smaller process geometries, the same design on the same wafer could vary by 30% in power consumption when it is just idling,” said Rogers. “Literally, we’ve seen up to 30% variation due to manufacturing process variability. So now your part requires special consideration, because if one device runs hotter than the other, we have to control each die to make sure the power, the temperature, and the voltage are the same. It gets to be very tricky. We need to adjust the power level dynamically because there’s a thermal junction temperature that customers want to test at, so each and every die is monitored during burn-in and customers collect and process that data.”
Advantest’s Berry adds that when burn-in is implemented in production, some companies opt to use that test time more fruitfully by adding functional testing. “We’ve noticed that some customers are designing DFT into their chips to allow the stimulus of all of the inside structures to be done through a conventional port like USB or PCIe, whereas typically burn-in had been done through JTAG. That is a big change. Where conventional burn-in used a few digital pins and some power supplies, now I’ve got the ability to stimulate my entire structure inside the chip through a conventional USB or PCIe port. So as long as you are doing burn-in, a relatively long process, you can add more testing capability.”
Advances in DFT, such as built-in self-test (BiST), help implement the burn-in strategy. “This architecture usually required few or no interface pins, and also requires little or no data I/O with the device being burned in,” said Adam Cron, distinguished architect at Synopsys. “In addition, high signal toggle rates of internal nodes are also usually required. BiST satisfies these constraints.”
BiST and DFT approaches potentially can reduce post-package burn-in time if used effectively during wafer-level stress, said Siemens’ Natarajan. “However, the need for burn-in is dictated more by the health of the process technology and product complexity. Also, BiST and DFT approaches can save stress test time if they can be shown to eliminate the component of functional tests used in stress – both at the wafer-level and during burn-in.”
Advanced fault models play a key role in reducing defective devices per million. Of these, cell-aware models are commonly used in industry, but there are others such as cell neighborhood test, interconnect bridge and open test, total critical area test (measures the quality of test patterns not in terms of test coverage but by the detected critical area), and user-defined fault models.
“The challenge is to find the right combination of ATPG (automated test pattern generation) patterns and defect-oriented test patterns to achieve the maximum defect coverage within the budgeted test patterns, vector memory, and test time,” said NXP’s He.
GaN and SiC burn-in
The environmental stress test for GaN and SiC devices is performed at higher temperatures and voltages, and there are greater extremes of deviation with power devices. Burn-in occurs in the 180°C to 200°C range, which in some cases requires different wafer-level or package-level burn-in ovens.
DRAM stacks in high-bandwidth memory are heating up during production burn-in because of the proximity to the SOC or GPU. “The only challenge with DRAM now is it’s required more and more power because they go into AI assemblies,” said Rogers. “As much as 3,500 watts must be taken out of a wafer at a time.”
A well-thought-out design for testing strategy is becoming critical, as there is an increased desire to test devices not just prior to shipment but also during field use. “DFT is an implementation automation feature in EDA tooling that is easy to enable and results in circuitry that satisfies most burn-in requirements,” said Synopsys’ Cron. “In this sense, design time and costs are saved, as this architecture for some implementations is also being used for other manufacturing test purposes and in-field test applications. Leveraging the same gates for functional and manufacturing test uses is a big savings. So is amortization of tooling investments, some of which are required to ship a product.”
Co-packaged optics burn-in
These are early days in the world of co-packaged optics, but companies are already working to determine how and where to burn-in the photonic devices. This is important as most lasers are made from compound semiconductors, which can have a variety of defects in the crystalline lattice.
“Co-packaged optics is a bit of the wild west right now because there are so many implementations,” said Aehr’s Rogers. “From a photonic standpoint, the laser is clearly a very high failure rate device. With compound semiconductors, whether it’s indium phosphide or gallium arsenide, the lattice structure just has problems, so it has to be burned in. You need a handler to do burn-in. And some people actually want to attach it and do the burn in there to get a feel for the whole integration. There’s SOA (silicon optical amplifier) in there, too, and there’s a debate whether those should be burned in or not.”
Conclusion
Burn-in is a mainstay in semiconductor manufacturing, and several factors favor moving package-level burn-in up to wafer level burn-in, with or without wafer-level stressing. While the change to process flows and approaches is large, the payoff is better quality parts and perhaps some throughput advantages. Given the speed at which new AI devices and multi-chiplet modules are being rolled out, revised testing approaches, including more DFT, enhanced outlier approaches, and high-voltage stress testing are providing a steppingstone toward better quality.
Reference
Leave a Reply