Mission-Critical Devices Drive System-Level Test Expansion

SLT walks a fine line between preventing more failures and rising test costs.

popularity

System-level testing is becoming essential for testing complex and increasingly heterogeneous chips, driven by rising demand for reliable parts in safety- and mission-critical applications.

More and more chip manufacturers are jumping on the SLT bandwagon for high-volume manufacturing (HVM) of these devices. Unlike ATE and packaged device testing, SLT mimics actual semiconductor system operation. It is only through SLT testing of devices, peripherals, and software together, under real-world conditions, that companies can drive down escape rates to acceptable defective part per million (DPPM) levels.

The semiconductor world includes more mission critical applications than ever before. What started out in aerospace and high-reliability military devices has spread to HPC servers, high-end mobile devices, cloud storage, automotive, and AI applications.

“Products that require very high-quality screening in high volumes are turning to system-level testing,” said Sri Ganta, director of business development for SLT at Advantest. “Previously, it was more bench-level or random screening at the system level, but now it’s become part and parcel of the regular production screen.”

Interestingly, there’s increasing interplay between wafer probe and system-level testing, though they test differently. “Extensive testing at probe ensures high-quality KGD and lowers scrap cost as much as possible,” said Peter Reichert, system architect for system-level test at Teradyne. “SLT confirms that the end product works and gets the last bit of DPPM by exercising all the in-field test and resiliency features.”

At the same time, SLT requires longer test times — on the order of minutes to more than an hour, versus tens of seconds for ATE. For this reason, parallelism is essential to reduce overall cost while mitigating the handling limitations of pick-and-place technology. “It comes down to how you scale this SLT, and how do you best mimic the specified operating conditions. We offer integrated test cells, which have high-speed pick and place handlers and can run up to 720 parallel test sites,” said Ganta. At-temperature and at-speed operations are also desirable. “With active thermal controls, users can accurately maintain a given set-point temperature, whether hot or cold, at the individual site.”

Differing test times and test requirements drive the optimization of the overall test cell. “For very-low test times of several seconds to a minute, your traditional functional tests type of handlers offer the best approach. This might be simply just running a fast boot program,” said Ganta. “But at mid-level to long test times of several minutes to an hour, especially when you add a lot of these test requirements to meet this zero DPPM, the ecosystem changes. Now it’s no longer just horizontal footprint, you also need to leverage the vertical space too.”

Challenges within SLT
Traditional test platforms, like wafer-level or package-level test, can provide very good test coverage at the individual component level, but not the more cohesive integrated test coverage offered by SLT.

Still, concerns over SLT’s lack of fault coverage means first-time customers may question the system test’s value. “The lack of a fault coverage metric creates ambiguity around ‘what should be included in a test program,’ and if SLT is worth it or not,” said Teradyne’s Reichert.

“There are some tools that can do some sort of fault grading, but it’s very challenging and it’s not very accurate,” said Ganta. “ This is based on a simulation model that crosses into the actual functional test patterns. And then you grade how much of this logic is actually interface logic and is actually tested based on the functional patterns.”

SLT goes beyond tests of interconnectivity to true functional mode of operation where at-speed at true at-speed frequencies happens.

However, as more chipmakers embrace of SLT in product, test engineers also require flexibility, such as the ability to combine burn-in with SLT in a semi-automated fashion to improve long-term reliability. Alone with the need to provide parallelism, testing is morphing to a more dynamic environment to ensure fault-free operation before systems move into cars, mobile phones, cloud/edge storage, and AI computing systems. In fact, engineers in different segments are learning from one another.

“I look at the situation more as a convergence of automotive and data center. Cars are becoming data centers on wheels, while reliability is more and more important for data centers. It’s a real ‘you-got-chocolate-in-my-peanut-butter kind of thing,’” said Thomas Koehler, product marketing manager for automotive at Teradyne. “Automotive devices are adopting system-level test, which is something learned from the data center world. In addition, both automotive and data center devices have effectively made the end product — the car or server — a test insertion on its own. Structural test and other capabilities are becoming accessible from ‘mission-mode’ data interfaces.”

Faults increasingly occur in the interfaces between devices, and that is where SLT really shines. “Even if you get 99.9% coverage at the individual core levels, that remaining 0.1% could be in this interface logic,” said Ganta. “For digital devices, the digital blocks are often tested by the structural tests (ATE), and then you have analog blocks that are tested by specific analog or RF tests. But the fault coverage on the interface logic between these blocks or components is often overlooked, and that’s where some of the critical defects lie, because these cores will be in a different power domains and voltage domains.”

Indeed, Reinhart notes that the new fault models in testing are exacerbating coverage calculations. “Newer faults are more analog, making it harder to actually measure coverage,” he said. “It isn’t simply ‘stuck’ or ‘bridge.’ There are things like path delay and transition delay, and timing errors that can happen under some workloads versus others.”

This is driving up the cost for doing scan test. “To get convergence on new fault models, there is a need to run even more scan vectors (and presumably, longer BiST tests),” Ganta noted. “SLT provides the lower-cost/higher-parallelism method.”


Fig. 1: Unlike ATE, system-level testing does not scale with more complex processes. By catching failures using different methods, ATE and SLT are complementary. Source: Teradyne

Shrinking technology nodes, and movement toward 2.5D and advanced packaging, also encourage growth in fault models. “There are multiple dies in a package, and these complex devices have multiple power domains, as well as variable frequency scaling operating modes,” Ganta said. “So the industry is coming up with these new fault models to address complex defects associated with, for instance, the 4nm node and finFET transistors.”

Footprint, test speed, cost
“SLT operating efficiency is impacted by several interconnected factors that affect cost of test per device (see figure1),” Reichart said. “In theory, the higher the site count per tester (and test floor area), the lower COT (cost of test) per device. But it’s limited by the application board size — like MAP, for instance, where the board size is small making super-high parallelism possible. In the meantime, COT is balanced by factors like test change-over time, footprint per site count, and asynchronous testing, which allows independent test flows on each device.”

Changeover time affects test system uptime, while footprint per site count can be dictated by the application. “For segments like HPC, where the board size is larger, a lower site count per footprint area exists,” said Reichart.

Conclusion
System-level testing is a cost-effective means of improving quality and driving toward zero defective parts, especially with regard to mission-critical devices. But in practice, cost depends on things like die area, package cost and the test strategy of the company.

“In general, if there are common data format for each test insertion (wafer sort, final test and SLT), each company can analyze and decide which tests are more efficiently run at each insertion based on their scrap cost, test cost and cost of quality, and then decide the best test methodology,” notes Reichart. The earlier that chip makers discover the defects, the better it can be at reducing package scrap cost. “Final test may be shifting either to wafer soft to discover defects earlier, or to SLT to take advantage of the high parallelism. Easy portability of tests to/from SLT is a key enabler in the ability to shift left or right.

In the future, Ganta says, machine learning may be instrumental in optimizing test platforms. “ML or AI-type algorithms can help choose and optimize overall test cost and strategy, starting from design and closing the timing to probe, to package test and burn-in test, and on to field testing. Such algorithms will help with the dynamic decision-making. Of course, there are data security and risk mitigation issues, but the industry is working toward a uniform test environment where you can actually feed forward or feed back all the data to make those decisions.”

Related Stories 
Software-Driven And System-Level Tests Drive Chip Quality
A new system-level test for SoCs is gaining traction because it catches problems not detected at wafer probe and package test.

System Level Test — A Primer: White Paper
System Level Test (SLT) is becoming essential as semiconductor geometries shrink.

Optimizing Scan Test For Complex ICs
New techniques for improving coverage throughout a chip’s lifetime.



Leave a Reply


(Note: This name will be displayed publicly)