Fundamental changes are needed in failure analysis to keep pace with changes in chip technology.
Failure analysis (FA) is an essential step for achieving sufficient yield in semiconductor manufacturing, but it’s struggling to keep pace with smaller dimensions, advanced packaging, and new power delivery architectures.
All of these developments make defects harder to find and more expensive to fix, which impacts the reliability of chips and systems. Traditional failure analysis techniques — optical fault isolation, electrical probing, and scan-based test — were sufficient when transistors were larger and interconnects were more accessible. But as nodes shrink to 2nm and below, and as chiplets, backside power delivery, and hybrid bonding gain traction, those approaches are no longer sufficient.
Debug cycles are getting longer, yield learning is slowing, and manufacturers are struggling to diagnose failures buried deep within multi-die packages. Electrical probing and scan-based test methods also have become less effective as interconnects shrink and disappear beneath packaging layers.
“The challenge that we have is that failure analysis is not advancing as fast as the technology itself,” said Lesly Endrinal, silicon failure analysis engineering lead at Google. “We’re limping along with what we have, and while great engineers are making it work, the reality is that our ability to isolate defects is becoming increasingly constrained.”
At the same time, manufacturing costs are skyrocketing, so high-value multi-die packages using techniques like hybrid bonding make early defect detection essential to avoid scrapping good chips with the bad. Without a shift in strategy, failure analysis procedures risks becoming a bottleneck in semiconductor scaling, driving up costs and limiting the industry’s ability to bring next-generation devices to market.
“The node size is shrinking really fast, and so is the pitch size, which makes optical inspection increasingly difficult,” says John Hoffman, computer vision engineering manager at Nordson Test & Inspection. “If you shrink the pitch by a factor of two, inspection time can increase by a factor of four, but fabs still expect the same throughput. Keeping up with these demands without exponentially increasing cost is one of the biggest challenges we face.”
Failure analysis gets an upgrade
The FA techniques that were developed for simpler, planar CMOS structures are proving inadequate in today’s complex landscape. Failures now can be buried deep within multi-die stacks or hidden under novel power delivery schemes, making them much harder to detect and analyze.
The nature of defects also is changing. Defect mechanisms that once followed predictable failure patterns are now behaving differently. For instance, silent data errors are intermittent errors that may only become failures under particular workload, power, and/or thermal conditions. This shift in defect behavior necessitates a corresponding shift in failure analysis strategies. Without this, FA risks becoming a major bottleneck in semiconductor scaling, ultimately driving up costs and hindering the industry’s ability to bring next-generation devices to market.
Design for test (DFT) strategies can help minimize such risks. “As heterogeneous integration gains momentum, ensuring high-quality known-good die (KGD) is critical, as the cost of discarding a packaged part post-integration is prohibitive,” says Nilanjan Mukherjee, senior engineering director for Tessent at Siemens EDA. “A well-planned DFT strategy must be implemented at both the die and package levels to effectively test and repair high-speed interconnects, including TSVs, minimizing failures and improving yield.”
Design for debug
Historically, FA has been treated as a secondary concern in semiconductor design. This has led to significant hurdles in diagnosing and isolating defects, particularly at advanced nodes and with the adoption of complex packaging architectures such as chiplets and 3D-ICs.
Google’s Endrinal argues that the industry needs to move beyond the traditional concepts of design for test and diagnosis (DFTD) and explicitly incorporate debug into the design process. “It shouldn’t just be ‘DFTD’ anymore,” she said. “It needs to be ‘DFTDD’— design for test and diagnosis and debug. For too long, failure analysis has been an afterthought in chip design. But now, as architectures become more complex and harder to debug, we have to start embedding diagnostic features from the beginning.”
Advanced chips rely on built-in testability features that, together with diagnostic tools, aim to detect and isolate failures. However, without robust debug capabilities, engineers are left without the means to fully characterize and understand the root causes of these failures within the chip. This limitation is becoming increasingly problematic as transistor density continues to increase and advanced architectures create new obstacles for traditional fault isolation techniques.
“Testability is not enough. You also need diagnosability,” says Nitza Basoco, technology and market strategist for semiconductor test at Teradyne. “Designers must consider how defects will be identified and analyzed long before a chip reaches production. This means incorporating accessibility into the design process, ensuring visibility into package interactions, and planning for fault isolation at the earliest stages.”
This would allow engineers to circumvent some of the limitations inherent in physical failure analysis and move more directly to advanced techniques like nano-probing. In addition, the development and implementation of new fault models, such as cell-aware and layout-aware models, can enhance the resolution and accuracy of defect localization. But these approaches require a deliberate co-optimization between design and failure analysis tools.
“What we lose in advanced architectures is observability and controllability,” says Endrinal. “You need to be able to see inside the chip, and without that visibility, diagnosing failures becomes significantly harder. The key is intentional design. Chips must be designed with debug in mind from the outset to ensure we can still isolate and analyze defects effectively.”
Without proper debug mechanisms in place, the ability to effectively isolate and rectify yield issues will be severely compromised. This inevitably will lead to project delays and escalating costs. DFTDD is not simply an enhancement to DFTD. It is a fundamental necessity for ensuring the continued reliability of semiconductors in the years to come.
“Design for testability allows us to detect failures,” Endrinal notes. “Diagnosis allows us to locate them. But debug features allow us to extend the learning further and understand what is going on inside the chip to characterize it.”
In the absence of integrated debug capabilities, failure analysis teams often find themselves reliant on external test methods that offer only a limited view of the failure mechanisms at play. This is a growing concern, particularly for 3D-stacked architectures and chiplet-based designs, where failures can involve multiple dies, buried interconnects, and complex thermal and electrical interactions that are not fully captured by conventional test methods.
“The increasing complexity of semiconductor designs — high-speed interconnects, 3D stacking, and heterogeneous integration — is driving both the use of our existing diagnosis solutions and research into emerging fault models,” says Marc Hutner, director of product management for Tessent Yield at Siemens EDA. “Detecting faults early helps reduce field reliability issues and improves overall product quality.”
A significant challenge in modern FA lies in the difficulty of analyzing failures in advanced packaging and chiplet-based architectures. Unlike monolithic SoCs, which provide direct access to scan chains and test structures, chiplets rely on buried interconnects that are inaccessible to traditional electrical test methods. This makes it extremely challenging to diagnose failures using standard scan-based techniques, forcing manufacturers to explore alternative solutions, such as real-time embedded monitoring and in-situ diagnostics.
“A lot of failure analysis techniques rely on line-of-sight methods, but with 2.5D and 3D packaging, identifying defects deep within a stacked structure is increasingly difficult,” says Basoco. “Without standardized approaches, FA remains the ‘Wild West’— each design introduces unique obstacles that engineers must solve on a case-by-case basis.”
These emerging challenges underscore the urgent need for a fundamental shift in failure analysis strategy. Some promising approaches involve prioritizing data-driven analytics, embedded diagnostics, and AI-driven defect detection to effectively address the complexities of modern semiconductor devices.
The financial implications of improving failure analysis are more significant than ever. With skyrocketing manufacturing costs, the ability to diagnose defects early in the production process has a direct impact on profitability, time-to-market, and overall production efficiency. The industry’s shift toward chiplet-based designs, hybrid bonding, and ultra-high-density interconnects introduces new failure mechanisms that often remain undetected until late in the production cycle, making them considerably more expensive to rectify.
“FA has always been an afterthought, and that has to change if we want to have good yield learning in the future,” says Endrinal. “When new architectures are designed without considering how defects will be detected, it becomes exponentially harder to debug failures once production starts.”
Failure analysis must transition from a reactive to a proactive role, playing an increasing role in yield optimization and process refinement. The industry needs to move away from reactive debugging towards preemptive defect detection. This can be achieved by leveraging real-time monitoring, in-depth data analysis, and AI-enhanced diagnostics to identify and address potential issues before they escalate into major problems.
“Test quality remains one of the biggest challenges at smaller nodes, particularly with new technologies like gate-all-around (GAA),” says Siemens’ Mukherjee. “Many defects escape manufacturing test because current methods struggle to replicate in-field environmental conditions and software workloads. Addressing this requires new test models, better stress testing, and the ability to correlate sensor data for predictive maintenance of complex heterogeneous ICs.”
Furthermore, failure mechanisms such as contact resistance drift and warpage require increasingly precise characterization. Detecting these failure modes before the final assembly stage is critical to preventing substantial yield losses.
“Contact resistance has a direct impact on planarity, interconnect reliability, and overall package performance,” said Jack Lewis, CTO of Modus Test. “By measuring resistance at extremely high precision, we can detect bonding issues, warpage, and non-wetting defects that traditional methods often miss. This level of characterization is critical as advanced packaging pushes toward finer pitches and higher-density interconnects.”
Adding to the complexity, failure analysis teams now must manage an exponentially growing volume of test and process data. A single 5nm chip can contain billions of transistors and thousands of power and signal routing layers, generating terabytes of data per wafer. This data deluge makes traditional failure analysis workflows inefficient and poses a significant bottleneck to yield improvement.
“The biggest shift we’re seeing is that analysis — both defect identification and debugging — is now based on deep data,” said Nir Sever, director of business development at proteanTecs. “Unlike conventional methods that relied on external test results, deep data allows us to see inside the chip in real time, capturing how and why defects occur instead of just identifying them after the fact.”
To address these challenges, failure analysis needs to evolve beyond its reliance on conventional fault isolation tools. It must embrace the integration of real-time monitoring, embedded sensors, and AI-driven analytics directly into chip designs. Instead of solely depending on pass/fail criteria at the final testing stage, manufacturers are increasingly turning to in-situ diagnostics, deep data analytics, and AI-assisted defect prediction to anticipate and resolve failures much earlier in the production process.
Fig. 1: Outlier detection machine learning flow. Source: proteanTecs
“AI and machine learning are invaluable for correlating data across multiple tests, identifying anomalies, and pinpointing failure trends,” adds Basoco. “The key is collecting enough meaningful data. Without it, even the best AI models can’t provide actionable insights. The challenge is integrating sensors and data collection mechanisms that enable AI-driven defect detection in increasingly complex semiconductor packages.”
The future of failure analysis is data monitoring
As semiconductor technology continues its relentless advance, FA needs to keep pace. Emerging architectures, such as CFETs, backside power delivery, and advanced 3D integration, undoubtedly will introduce new challenges that traditional failure analysis methods are ill-equipped to handle. Simultaneously, the increasing complexity of devices and the industry’s push for higher reliability necessitate a shift towards real-time monitoring, predictive diagnostics, and AI-driven failure analysis.
Semiconductor manufacturers are moving towards integrated failure sensors that continuously monitor the health of the chip during operation. “With chiplet-based designs, most I/Os aren’t exposed at the package level, making traditional failure analysis methods extremely limited,” Sever explained. “The only viable solution is embedding ‘eyes’ inside the chip itself — monitoring interconnects and power integrity in real-time, rather than relying on limited external access.”
Artificial intelligence (AI) and machine learning (ML) are also poised to play a pivotal role in the future of failure analysis. As the volume of test data continues to grow exponentially, manual failure analysis is becoming increasingly unsustainable. AI-driven tools will be instrumental in helping engineers correlate massive datasets, identify patterns, and predict failures before they impact yield.
“At these nodes, failure analysis isn’t just about collecting more data — it’s about making sense of it,” said Guy Gozlan, AI/ML lead at proteanTecs. “We use machine learning to move beyond raw defect detection, creating a personalized model for each chip. Instead of just flagging anomalies, we predict what should be happening, compare it to real-time performance, and amplify the defect signal above the noise.”
This AI-driven approach will be crucial for high-frequency devices, where signal integrity failures can be intermittent and difficult to isolate using conventional methods. High-speed circuits operate in the GHz and THz ranges, so even subtle mismatches in impedance, crosstalk, or power noise can trigger unpredictable failures. AI-powered failure analysis tools have the capability to analyze frequency-domain data in real-time, flagging deviations before they escalate into catastrophic failures.
“AI and machine learning can help process the massive amounts of test data generated at high frequencies, identifying patterns and predicting defects before they become costly failures,” said Adrian Kwan, business development manager at Advantest. “The key is integrating robust data collection methods and real-time analytics to improve both yield and reliability.”
EDFAS Roadmap: Guiding the evolution of failure analysis
Failure analysis (FA) is struggling to keep pace with the rapid advancements in technology. Shrinking geometries, complex packaging architectures, and the emergence of backside power and 3D-ICs are pushing traditional FA techniques to their limits. To navigate this challenging landscape, the industry has developed the Electron Device Failure Analysis Society (EDFAS) roadmap for guidance. This roadmap outlines the key challenges in failure analysis and provides a framework for innovation.
“The EDFAS Failure Analysis Technology Roadmap is a crucial step in addressing the growing gap between semiconductor complexity and FA capabilities,” says Endrinal. “Our traditional methods are no longer sufficient, and without an industry-wide approach to advancing FA, we risk falling behind in yield learning and defect resolution. This initiative is about bringing together the entire ecosystem — chipmakers, tool vendors, and research institutions — to ensure that failure analysis evolves in tandem with semiconductor technology.”
The EDFAS roadmap acknowledges the limitations of current methodologies and highlights the need for new failure isolation techniques, improved automation, and greater integration between design, debug, test, and failure analysis tools. It also underscores the importance of collaboration between foundries, EDA vendors, and tool providers to ensure that failure analysis keeps pace with emerging architectures and new materials.
Conclusion
Failure analysis is undergoing a profound transformation, from a post-mortem process to one that can pinpoint root cause deconvolution, improve yield optimization and reliability, and help control costs. As devices continue to increase in complexity, traditional fault isolation techniques are proving less effective, forcing manufacturers to adapt. There is a steady trend toward embedded real-time monitoring, AI-driven analytics, and non-destructive imaging in failure analysis strategies.
To maintain a competitive edge, semiconductor companies must adopt Design for Test, Diagnosis, and Debug (DFTDD) principles from the earliest stages of development. This requires a cultural shift within the industry, where collaboration between design, test, and failure analysis teams becomes standard practice rather than an afterthought.
The future of failure analysis will be shaped by new technologies like AI, machine learning, non-destructive imaging, and real-time failure monitoring. The EDFAS roadmap highlights the importance of industry-wide collaboration in developing next-generation failure analysis methodologies.
“Failure analysis is becoming more vital than ever,” adds Endrinal. “At advanced nodes, defects are harder to isolate, debug cycles take longer, and traditional methods are breaking down. If we don’t rethink FA strategies now, yield learning and quality control will suffer in the coming years.”
Related Reading
Yield Management Embraces Expanding Role
From wafer maps to lifecycle management, yield strategies go wide and deep with big data.
Simulation Closes Gap Between Chip Design Optimization And Manufacturability
Rigorous testing is still required, but an abstraction layer can significantly reduce errors in the fab while optimizing device behavior.
Leave a Reply