What Data Center Chipmakers Can Learn From Automotive

Higher quality, lower cost, and faster time to market are requirements for both as rising complexity in vehicles overlaps with defectivity concerns in data centers.

popularity

Automotive OEMs are demanding their semiconductor suppliers achieve a nearly unmeasurable target of 10 defective parts per billion (DPPB). Whether this is realistic remains to be seen, but systems companies are looking to emulate that level of quality for their data center SoCs.

Building to that quality level is more expensive up front, although ultimately it can save costs versus having to fix problems in the field. But it’s also much more test-intensive and slower to screen every facet of every product, both of which drive up costs. The question now is whether this level of screening can be done quickly enough, and inexpensively enough, for all of these chips.

Today, the accepted escape rate for consumer and data center ICs is in the range of 300 to 500 defective parts per million (DPPM). But even meeting those goals has been challenging. Recent reports from Meta and Google indicate the escape rate is as high as 1,000 DPPM for just the ALUs in multi-core SoCs.

In screening IC products for defects, chipmakers for both automotive and hyperscaler data centers continue to push and tug at the yield-quality-cost triangle from different directions.

Automakers have lowered the maximum number of defective parts per billion from 10,000 to 10. But chips used in vehicles also are becoming more complex, including some 5nm SoCs for the central compute in a vehicle. Still hitting quality targets for manufacturing these devices, especially at the low ASPs that automotive OEMs are demanding, will be difficult.

In contrast, makers of hyperscaler SoCs have more cushion in the selling price, but they are under extreme pressure to churn out their devices more quickly. Screening for defects in billions of transistors and contacts, and 10-plus kilometers of interconnect, is a massive challenge under any circumstances, let alone a tight market window. And the power/performance goals for these chips require continued scaling, which is becoming significantly harder at each new node. Third-order effects are now first-order effects, and design margins are so tight that defects can cause mathematical miscalculations and other compute execution errors.

“The complexity of data center CPUs, GPUs, and AI processors is usually much higher than that of standard processors in terms of transistor counts and permutations of software circuit operations. A ‘test hole,’ due to a missing permutation of test vectors, is commonly suspected as the main reason why a silent data error occurs,” said Keith Schaub, vice president of technology and strategy at Advantest America. “Simply put, it is impossible to exercise the device under test with all the possible permutations, operating modes, and external signals that might affect operation (e.g., by causing interrupts), as well as noise that can be exaggerated with certain internal states.”

Converging problems
Automotive IC suppliers have decades of practice in detecting the subtle failures found in customer returns. This is driven by a mindset of delivering zero defects. With dogged persistence, engineers track down failures and identify additional screening techniques, which they have achieved in the past using longer design cycles, mostly at mature nodes. A 2021 paper authored by NXP automotive engineers [1] illustrates the detailed nature of such detective work. For a 16nm finFET microcontroller, engineers needed a new fault model to detect a failure occurring at 5 DPPB.

Automakers’ concerns also drive more inspection, test content, test conditions, DFT, and now in-system detection of failures. 

“For automotive chip providers, the cost of failure is clearly defined financially, with product and end-user impacts. It has driven a high level of rigor with respect to both functional safety and test requirements. A great example of this is the mission profiles used in automotive that are used to flow down requirements from OEM all the way down to each SoC,” said Marc Hutner, senior director of product marketing, proteanTecs. “These concepts are being extended into the data center to ensure the hardware, software, and environment power, performance, and reliability envelopes. These requirements will then drive further adoption of enhanced DFT, visibility, and diagnosing hardware for in-system use.”

Others agree. “By and large, the more exhaustive test flow used by automotive IC suppliers with multiple insertions at high and low temperatures, as well as burn-in, has been the traditional path to greater reliability,” said Ken Lanier, director of strategic business development at Teradyne. “The other thing that automotive devices have relied on is more parametric testing, such as IDDQ. All of these are applicable to data center devices, except cold insertion. Data center devices likely don’t require a cold insertion.”

During manufacturing, screening for defective parts usually comes down to inspection, electrical test, and analytics. To be effective, inspection needs to be at 100% of production. Judicious use of test conditions and more complex ATPG assists in detecting marginal behavior. Finally, applying analytics to determine the appropriate limits for outlier detection algorithms helps engineers manage the yield-quality-cost triangle.

Fig. 1: Manufacturing screening opportunities. Source: Semiengineering/A. Meixner

Fig. 1: Manufacturing screening opportunities. Source: Semiengineering.com/A. Meixner

100% inspection
In both wafer and assembly factories, inspection is used after selected manufacturing steps on a sample of wafers/die/units. Those results can guide process/equipment improvements and determine if gross misprocessing occurred. With the growing need to detect defects earlier, engineering teams can perform inspection on 100% of production wafers and parts during manufacturing. This requires faster physical inspection that can accurately distinguish good from bad images.

For mature semiconductor processes automotive suppliers have been leading the way with using optical wafer inspection to identify suspect die. NXP and onsemi engineering teams presented results at the 2019 Automotive Electronics Council Reliability Workshop [2,3]. Combining inspection data with electrical test data can enhance distinguishing relevant defects from benign ones, as noted in a 2020 European Test Symposium paper from onsemi [4].

But for advanced CMOS process nodes, optical is not enough. It requires detection of defects that are not visible. With automakers starting to use finFET technology for AI/ML, there’s now a need for both industry sectors to add techniques to obtain higher product quality.

“If you look at ADAS chips, the optical techniques are not going to work very well,” said Andrzej Strojwas, chief technology officer at PDF Solutions. “You don’t have the sensitivity. And if you reduce the pixel size, you are seeing all the nuisance defects. Starting with the bleeding-edge nodes before you see the eventual reliability failure, you’ll see leakages between gate to drain source contact.”

The need to detect leakages at the device level has prompted the use of e-beam metrology techniques for inspection purposes [5]. This approach can measure leakage via a voltage contrast. For the most advanced SoCs comes particular interest in defect inspection in the 10+ billion contacts. This requires analyzing physical design for layout patterns vulnerable to DFM variability, creating e-beam measurable test structures placed near actual circuitry, and making fast measurements — within two hours — to enable in-line inspection.

The need for 100% inspection doesn’t end with wafers. Inspection now includes individual die after singulation and bump applications, the bonds between dies and package substrates, and the package-to-board bonds. Automotive IC suppliers lead data center suppliers in shifting to these higher levels of inspection.

“I have heard from automotive customers that they’re already pushing their suppliers to do that, if it is not already,” said Frank Chen, director of applications and product management at Bruker. “And that’s something data centers can adopt. They definitely can see a lot of commercial gain — especially with more investments in AI — that having higher quality and adopting the 100% inspection automotive already has been pushing for a couple years.”

As a well-established tool in assembly houses, optical is fast and relatively cheap but only inspects surface defects. X-ray imaging offers the ability to see through layers, which is good for inspecting package to substrate bump interconnect.

“Bruker’s X-ray imaging product line (formerly SVXR) is fast enough to support in-line inspection,” said Chen. “But it is more expensive than optical methods, so factories need to consider the total cost of ownership perspective. There’s a certain threshold on device complexity and cost of defects for 100% inspection to make sense commercially.”

With high ASPs, data center products in advanced packaging fall into this category. However, there remain some concerns with X-ray technology.

“It’s great to have a high-speed tool now, but a barrier for some companies is the concern on dosage impact to high-bandwidth memory dies surrounding the processor die,” noted Chen. “With X-ray imaging, that’s still under investigation, and there are techniques to mitigate this risk such as energy filtering and reducing exposure time. Meanwhile, it is still valuable for setting up the process on a thermocompression bonding (TCB) tool with a dummy or sacrificial memory die. In an advanced packaging fab there can be hundreds of TCB tools to qualify daily, so there needs to be a solution fast enough to scan through samples made by each TCB tool.”

Testing for 10 DPPB
During the test process, suppliers of automotive ICs use advanced ATPG tools, multiple environmental test conditions, burn-in insertions, and outlier detection algorithms. These techniques have long enabled them to strive for 10 defective parts per million. Not all of these are used by makers of advanced SoCs bound for data center OEMs.

First, adding test content that more precisely targets small delays has benefits. But due to its expense, it needs to be selectively applied.

“Slack-based transition delay ATPG aims to cause the transition to follow the path with the least amount of slack,” said Adam Cron, distinguished architect at Synopsys. “It’s technically not a path delay test, but it’s trying to be one for every node in that signal path. And so that means the smaller the defect you have, the bigger the opportunity to detect that defect. Adding cell-aware logic gate fault models gets you a couple more defect types than a traditional transition fault model would. This is especially important with the multi-gate logic cells, which have re-emerged in finFET technologies.”

Fig. 2: The benefits of slack-based cell-aware ATPG. Source: Synopsys

Fig. 2: The benefits of slack-based cell-aware ATPG. Source: Synopsys

With automotive test processes, engineers apply test content at a multitude of environmental conditions — voltage, temperature, frequency — because they either change the defect behavior or reduce design margin.

“Some spot defects result in relatively small changes in behavior that will push a chip to near failure, but not actual failure,” said Rob Aitken, distinguished architect at Synopsys. “These marginal parts can be pushed over the edge with changes in test voltage or temperature, with different defect mechanisms being more or less susceptible to extremes both high and low of both. Existing or targeted patterns applied at multiple test conditions can help identify those.”

To drive out early life failures, engineers use either burn-in (high temperature and voltage stress) or wafer voltage stress.

“Another technique automotive suppliers have used to improve quality is voltage stress testing. With this technique, all device pins are brought at or near their process voltage limit, followed by post-stress leakage measurements. The post-stress leakage measurement is compared with the same measurement pre-stress,” said Thomas Koehler, product marketing manager for automotive at Teradyne.  “Currently, large digital devices typically only do Vdd stress testing. While I/O stress testing has been an increasingly valuable technique for automotive devices, it is not yet clear whether this will be as valuable to data center IC testing due to the difference in process technology used.”

Some defect failure modes are exacerbated by temperature, and automotive suppliers are required to test at both cold and hot temperatures.

“The automotive suppliers test over wider temperature ranges,” said PDF’s Strojwas. “The wider temperature range might not be necessary for data centers. But we are finding there is actually a pretty good correlation between failure mechanisms and final-test hot or cold conditions, and these actually may affect the early infant mortality.”

Outlier detection algorithms
In the early 1990s the Automotive Electronics Council (AEC) noted passing parts that failed in vehicles were electrical test statistical outliers. This motivated the use of outlier detection methods to screen for potentially defective parts.

“The key thing to bear in mind is that automotive manufacturers don’t just have to find which devices are bad now,” said Teradyne’s Lanier. “They also have to find devices that are likely to fail in the future. This is where the burn-in and outlier detection come into play. Again, DC parametric data plays a key role here, since it provides statistical insight that a pass/fail functional test would not. Measurements that focus on leakage currents or unacceptable behavior under stress are key.”

Used in nearly all CMOS defect test programs, leakage tests lend themselves well to statistical outlier analysis. “Dynamic part average testing (D-PAT) and leakage provides a good example. If you have an unexpected leakage path it should show up as an outlier in your leakage distribution,” said Mark Laird, senior staff application engineer at Synopsys.  “A dead short would fail your leakage test, but a soft defect causing an elevated leakage — that’s definitely a part you want to filter out.”

While data center suppliers have yet to apply outlier detection algorithms, high-volume consumer IC suppliers are doing just that.

“These automotive algorithms are already applied to consumer products,” said Dieter Rathei, CEO at DR Yield. “For instance, the big cell phone manufacturers like Apple require similar quality checks. People don’t accept it if their smart phone breaks down. The big players are aware of this, and that’s why they ask their suppliers to use part average test (PAT) algorithms. We have Apple suppliers among our customers. They use YieldWatchDog for exactly that purpose.”

There are a number of outlier detection techniques, including good die in bad neighborhood clusters, cluster detection, cluster boundaries, D-PAT, Z-axis PAT (Z-PAT), and nearest neighbor residual. “All those techniques are equally as good for any market segment that requires a higher quality,” said Laird. “It is interesting to see what the tradeoff would be when applying these outlier techniques for data center ICs, because that’s always the balance. How much yield are you willing give up pre-package versus your reliability? The economics of the data center are very different from automotive.”

In looking for that balance between yield loss and detecting test escapes the development of more sophisticated algorithms is ongoing.

In the past 10 years, we have received a lot of requests to add particular features to our quality module. It’s a very sophisticated tool with configuration options for the classic algorithms, and then for some algorithms that we have improved on. For instance, the classic AEC algorithms require normal distributed data, and electrical test data are rarely normally distributed,” said Rathei. “That’s why we made variants that don’t require a Gaussian distribution for the test data. Then there are algorithms like Z-PAT, cluster analysis, and value shift. For value shift analysis you can look at low-temp versus high-temp measurements, or pre-burn-in versus post-burn-in measurements. If there’s a value shift that’s outside the expected range, we flag those devices.”

Conclusion
Automakers have long demanded higher quality and reliability for their components than desktop and data center computer makers. But these two worlds are now dealing with many of the same problems. Hyperscale data center operators, such as Meta and Google, have reported subtle faulty behaviors that seep through the ATE and system-level test flows.

The big questions now are whether they can apply the same level of scrutiny that automotive suppliers execute and whether the automakers can deliver that kind of quality assurance for their most advanced chips at an acceptable price point. With the complexity of computing in the data centers arriving in cars, both suppliers need to increase their efforts to screen as many potentially defective parts (time-zero and latent) as possible, and they need to do that quickly and efficiently, while keeping costs under control.

References
[1] J. Corso, S. Ramesh, K. Abishek, L. T. Tan and C. Hooi Lew, “Multi-Transition Fault Model (MTFM) ATPG patterns towards achieving 0 DPPB on automotive designs,” 2021 IEEE International Test Conference (ITC), 2021, pp. 278-283

https://ieeexplore.ieee.org/document/9611352

[2] Anilturk, O. et al., “Inline Part Average Testing (I-PAT) to Reduce Escapes from both Gaps in Test and Latent Reliability Defects: Continuing Feasibility Study Results at NXP,” Anilturk, Second European Automotive Electronics Council Reliability Workshop, October 15, 2019.

[3] Bruneel, G. et al., “Implementation of I-PAT Using High Speed Defect Screening,” Second European Automotive Electronics Council Reliability Workshop, October 15, 2019.

All presentations from 2019 Automotive Electronics Council Reliability Workshop can be found at this link: http://www.aecouncil.com/AECWorkshop.html

[4] A. Coyette, W. Dobbelaere, R. Vanhooren, N. Xama, J. Gomez and G. Gielen, “Latent Defect Screening with Visually-Enhanced Dynamic Part Average Testing,” 2020 IEEE European Test Symposium (ETS), 2020, pp. 1-6,

https://ieeexplore.ieee.org/document/9131593

[5] Marcin Strojwas, et al. “Advanced High Throughput e-Beam Inspection with Direct Scan,” NANOTS 2021 Conference

https://www.pdf.com/resources/nanots-2021-advanced-high-throughput-e-beam-inspection-with-directscan/

Related Stories

Auto Chipmakers Dig Down To 10ppb
Driving to 10 defective parts-per-billion quality is all about finding, predicting nuanced behavior in ICs.

The Race To Zero Defects In Auto ICs
100% inspection, more data, and traceability will reduce assembly defects plaguing automotive customer returns.

Why Silent Data Errors Are So Hard To Find
Subtle IC defects in data center CPUs result in computation errors.

Screening For Silent Data Errors
More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.



2 comments

TX-RX says:

The far lower volume of Automotive Wafers makes it possible to perform many extra quality control measures.
Attempting to do this on a high volume SoC would grind production to a halt or require a massive investment in tools and almost certainly the construction of new cleanroom space.

Allen Rasafar says:

Thank you for sharing this wonderful read.

Leave a Reply


(Note: This name will be displayed publicly)