AI is enabling timelier and more accurate data, but AI-based command and control has yet to appear.
IC manufacturers are increasingly relying on intelligent data processing to prevent downtime, improve yields, and reduce scrap. They are integrating that with fault detection and classification (FDC) to trace faults to their cause.
Today’s FDC systems feature better sensors, variability control, and both predictive and prescriptive modeling. In the future, FDC will enable real-time decision-making using tools like LLMs and agentic AI.
A big part of what has changed in FDC revolves around fault prediction. “Traditional FDC wasn’t predictive,” said Jon Herlocker, vice president and general manager of Tignis, a Cohu Analytics Solution. “It relied on humans and their process/equipment expertise to identify a precondition to a fault and then monitor for that.” Such approaches were not automated, engineering-intensive, and slow to respond. “Modern fault detection systems leverage machine learning to continuously analyze equipment signals and recognize precursors to a fault, effectively predicting the fault before it happens.”
Engineers use FDC to respond in real-time to changes that have the greatest impact on known good die production. With current tools, FDC can take advantage of AI capabilities to better classify killer defects while speeding root cause analysis to its source.
Modern FDC also enables better outlier detection, a critical step in differentiating marginal die from good die. “The biggest improvement has been with the shift from univariate to multivariate analysis. Having the ability to work across multiple variables improves sensitivity, which reduces false positives,” said Joe Fillion, director of product management at Onto Innovation.
FDC is inherently tied to electrical testing, which determines yield. At the same time, there is an increasing need for real-time processing to determine whether an outlier is good. “Another key change is real-time data handling. This provides immediate and predictive fault detection to better protect the product. It also gives the ability for dynamic (and almost in-situ) recipe adjustments – particularly when using run-to-run control,” said Fillion.
“Where we see opportunity is for a class of problems that really benefit from real-time data and real-time processing,” said Regan Mills, president of Teradyne’s Product Test Group. “You obviously make better decisions about whether the device is good or bad. Or maybe you’re speed grading. But you need to make a decision about that device as fast as possible, using information you’re getting from that device, as well as information from its peers over time. So you’re using aggregated data in a way that typically hasn’t been done before.” This is performed using edge computing resources at the testing site.
Faster reaction is made possible by the growing level of computing resources in the fab, but many experts caution that the benefit is not industry-wide. “With traditional FDC, you would find a problem six or seven weeks after it occurred, at wafer test,” said Aftkhar Aslam, CEO of yieldWerx. “But by then you’ve got all this material that’s already gone through fabrication. It’s bad and cannot be reworked. That is the major improvement with today’s 300-millimeter fabs. The equipment going into the fab includes high-performance data centers that can ingest this data and make decisions in real-time. But we cannot say this has helped the industry overall because the older fabs do not have that capability.”
And while semiconductor fabs have long utilized FDC procedures and analyses, the new frontier in FDC is occurring in assembly and advanced packaging. “Where we’re seeing the biggest implementation for FDC is in advanced packaging, among TSMC, Intel, Samsung, even less well-known names that are building these multi-chip advanced packages,” said Jonathan Holt, manager of Fab Applications Solutions at PDF Solutions. “This encompasses all the complexity of building a package on another substrate, precisely placing multiple components, fabricating through silicon vias, and face-to-face bonding — they have to have FDC and real-time process control.”
FDC progress: a brief history
Early fault detection and classification efforts involved a great deal of manual effort by process engineers, gathering data from multiple sources like wafer processing, metrology, different test insertions, etc. FDC on any given tool was accomplished by first tracking the minimum, maximum, and mean of sensor trace data. Then the engineer assigns thresholds for alarms and monitors going forward.
But placement of thresholds is not as easy as it seems. “Finding the right thresholds is challenging, because if the thresholds are too tight, you get spammed with alerts, and if they are too generous, faults are missed,” said Cohu’s Herlocker.
Deploying FDC across an entire factory in this manner can take years to choose the right correlations (i.e., contact resistance and sensor) and the best thresholds. “Furthermore, equipment and processes are changing regularly, so FDC systems need regular staff to monitor and adjust thresholds or else the alerts become a nuisance and are ignored,” Herlocker added.
Indeed, in the big data era, the chance that data is ignored is likely because engineers are busy running their processes and equipment. Also, because engineers needed to capture anomalies as soon as possible in production, they turned to continuous monitors of time-series data.
“There was still the need to control the tool at the point of use, so we began taking the full trace data from the sensor and then putting guard bands around that at the tool to make sure it’s not drifting,” said PDF’s Holt. “So now I don’t have to extract the features. And I may miss data with a mean or slope or min or max, but I would capture it in the full time series data trace, which is also used for tool-to-tool matching in the fab.”
This time-series data also needs to include preventive maintenance steps. Recording this complete lifecycle requires large amounts of data and data storage. “For example, we looked at the trace data for 30,000 wafers at TSMC and it was 100 terabytes. They run 250,000 wafers through each of their factories. So you’re talking about petabytes of data that you have to analyze and maintain when you go to time series data monitoring. That is a challenge to support,” added Holt.
FDC initially involved making single variable correlations (univariate), which quickly proved insufficient. Engineers then devised multivariate models to correlate complex relationships, which improves sensitivity and reduces the number of false positives.
The ML/AI revolution
As quality levels increased, so did the need to adopt ML-based modeling. Machine learning, which is a subset of AI, also can be used to analyze large datasets, identify patterns, and improve fault detection accuracy.
“Traditionally, FDC relied on static thresholds and SPC,” said Onto’s Fillion. “Over time this has been replaced by ML, and eventually AI to model more complex, nonlinear relationships between process conditions and variables with faults and errors. ML models enhance both accuracy and timeliness. Additionally, advanced computational ability allows prescriptive and predictive modeling as well as anomaly detection.”
Predictive models forecast what might happen, while prescriptive models recommend specific actions to achieve optimal outcomes based on those predictions.

Fig. 1: Communication among elements of FDC, digital twins, analytics, third-party systems, and MES in factories. Source: PDF Solutions
Now, the biggest change involves incorporating AI tools to speed analysis and catch anomalies or outliers that humans could miss. “There have been very significant improvements around deep classification, using AI and ML to look for signatures and connect the dots that would never have been previously connected,” said YieldWerx’s Aslam.
Remote command and control of fab equipment became a must-have feature during the COVID-19 pandemic, when engineers could not travel to troubleshoot problems on-site, and it has been utilized since then. A connection between EDA and ATE tools is especially powerful.
“Our Remote Connect tool allows users to remotely connect to the ATE to use EDA tools, SoC code debuggers, and custom bench scripts,” said Richard Fanning, lead software engineer at Teradyne. “This allows teams to investigate issues with the tool and expert of their choice, streamlining the process of getting the best person using the right tool in real-time. We have worked with industry leaders to make this integration as simple as possible. Customers have expressed to us that the main way for them to cut back on faults early is to eliminate differences between the design bench setup, simulation, and the ATE test program.”
Sensor data analysis and digital twins are especially important for leading-edge devices. “The number of advanced sensors that are required to support these advanced nodes is exploding,” said John Behnke, general manager of Smart Manufacturing at Inficon. “You just can’t make a sub-2nm process tool without embedded sensorization. AI augmentation of smart sensors is a real requirement in FDC. And then, the ability to actually integrate that information across the supply chain or in the factory is becoming more and more important. Also, digital twins are becoming more accepted, and we see the need for them to communicate with each other.”
Digital twins are expensive to build, especially when they include the critical sensors on process equipment. A lack of standards also has impeded acceptance. NIST and SEMI are in the process of setting standards for digital twins. “I’m sure they’re going to recommend some standards, possibly for communication. But they’ll include all the FDC SEMI standards, probably all the way up through EDA and all the IoT stuff,” said PDF’s Holt. “But then, how do you transfer that data at the speed that’s needed? That may require multiple protocols, and it’s probably going to be containerized with a certain amount of security and overhead required.”
Outlier detection
The main advances recently in outlier and anomaly detection have come in feature extraction and signal isolation, rather than the core algorithms themselves. “For these systems to be successful, it is essential to expose focused data signals to outlier detection algorithms to reduce the number of detections,” said Ryan Stoddard, director of data science at Tignis, a Cohu Analytics Solution. “Modern feature extraction algorithms, such as deep learning auto-encoders, allow extraction of subtle trace shape differences that are not easily detectable by simple statistical features. LLMs can easily search historical OCAP datasets and expose key signals that have most often been early warnings of prior issues. Collectively, a domain expert can deploy these new technologies to make anomaly detection systems more focused and broadly relevant.”
The concept of shift left is becoming increasingly important for detecting faults earlier in the process flow, particularly for chiplet-based modules. “We can apply machine learning technologies to predict failures at an earlier stage by utilizing historical data and analyzing recent datasets. However, it involves a tradeoff between avoiding overkill and ensuring accurate fault detection,” said Kotaro Hasegawa, leader of applied research at Advantest. “One application of machine learning is outlier detection using Dynamic Part Average Testing (DPAT), where we dynamically adjust the limits based on various trends observed in test results. This approach is already heavily used in automotive devices, but now we’re also applying it to other devices,” added Hasegawa.
DPAT is a manufacturing technique that uses statistical analysis to dynamically set test limits, rather than using fixed ones, to identify and remove outlier parts that are likely to cause quality and reliability issues. DPAT removes process-related outliers. But companies are going beyond DPAT to improve product quality and yield.
“Part average testing is a critical aspect of IATF16949 — you must have it for automotive manufacturing,” said Boyd Finlay, director of Solutions Engineering at Tignis, a Cohu Analytics Solution. “Also, Out Of Family (OOF), Good Die Bad Neighborhood (GDBN), Statistical Bin Limits (SBLs), and Statistical Yield Limits (SYLs) are all needed. And I don’t hear anybody talking about Part Average Predictive Test or Virtual Test. Replacing the actual test with a prediction seems like an opportunity to reach zero defective parts per billion.”
When yield excursions do occur, it’s important to react quickly — and to know how to react. “So the question is really around, ‘When do you react?’ And how much of a financial impact would a yield excursion have on your organization? If it’s a quarter percent, what is the cost associated with that,” said Marc Hutner, director of Product Management at Siemens Digital Industries. “You’re taking a whole series of measurements as part of your test program, and there’s the standard binning that you’re doing associated with it, but then you’re analyzing the data and asking, ‘What happened here?’ You could do something like retest. Or, after you’ve taken a look at the data, you might realize there is something going on in the setup.”
That could be something as simple as a faulty contact to the DUT. “For logic devices, we’ve defined a workflow for ATPG scan that includes how the test data gets collected and formatted,” Hutner said. “So we provide patterns that have both drive data and expectations. We then ask the customer to format the data, either in a standard data test or STDF, and then that feeds directly into our volume diagnosis workflow, where it can be analyzed to figure out where the yield problem exists, and then a human looks at the reports.”
When selecting an appropriate model for a dataset, engineers can develop supervised or unsupervised models. “There are many cases where unsupervised learning makes sense, because the processes and the technology have been known for years,” said YieldWerx’s Aslam. “But there are new technologies like AI chips and photonics. In these cutting-edge areas, so much of the processing is new that I wouldn’t trust unsupervised learning, so it does not apply across the board.”
Conclusion
The ability to interface with LLMs or other forms of AI on process tools will eventually facilitate even faster reactions to yield excursions in manufacturing. FDC systems that are mature on fab tools are making their way to assembly and testing lines, where tool-to-tool matching on die pick-and-place or singulation tools, for example, will ensure better process control.
FDC is still a process that engineers must direct, monitor, and react to, while also performing retraining exercises to keep the models relevant. Increasing levels of cooperation among multiple companies are enabling rapid progress in this critical area of smart manufacturing.
Related Reading
IC Equipment Communication Standards Struggle As Data Volumes Grow
Timely engineering fixes rely on communications standards, but data inconsistencies are getting in the way.
Predicting And Preventing Process Drift
AI/ML are increasingly vital for good yield and reliability, but there are still plenty of pitfalls to avoid.
Pressure Builds On Failure Analysis Labs
Goal is to find the causes of failures faster and much earlier — preferably before first silicon.
IC Test And Quality Requirements Drive New Collaboration
Tight integration of test equipment, monitors, and analytics are beyond the scope of one company, accelerating data sharing and the breakdown of silos.
Leave a Reply