High-Quality Data Needed To Better Utilize Fab Data Streams

Engineers require timely and aligned data with just the right level of granularity.

June 10th, 2025 - By: Anne Meixner

Fab operations have wrestled with big data management issues for decades. Standards help, but only if sufficient attention to detail is taken during collection.

Semiconductor wafer manufaFcturing represents one of the most complex manufacturing processes in the world. With each generation of process improvement comes more sophisticated fab equipment, new process recipes, and exponential increases in measurement and operational data.

To manage such a complex environment, fabs rely on factory automation systems and equipment process control at each process step. An alphabet soup of analysis methods and operations software supports the detection of defect anomalies, the prediction of impending maintenance needs, and material management. That analysis requires high-quality data, without which engineering teams and ML algorithms could misdirect efforts to improve yield and maintain equipment. But while communication standards for data and operations assist in meeting data quality expectations, they only go so far.

A factory’s data management system needs to make the most of the fab equipment data generated. “Most important is the continuous monitoring that this system is operational and all sources are still in communication with the larger data retention system,” said Mike McIntyre, an independent consultant. “This is closely followed by the requirement that the data being ingested is of good quality.”

Data quality encompasses several attributes that include the parameter measured, its accuracy, precision, date/time stamp, and additional metadata for context. Such attributes assist when merging data from the myriad of fab equipment sensors, hundreds of process steps, and activity in the sub fab. Without the existence of industry standards the merging of fab data becomes a messy, gargantuan and nearly impossible task.

Standards are tedious to develop, yet they save time and effort when managing fab data generated on order of petabytes per day. SEMI standards committees document in great detail the requirements with respect to formats, data rates, and now variable names.[1] Engineering teams need to fully understand the meaning of data sent and received to support operations, as well as decisions based upon interpretation of data. That understanding is necessary to create reliable and maintainable AI/ML-based algorithms that guide process control, interpret fault detection and classification (FDC) data, and set maintenance alarms.

Fig. 1: Semiconductor fab automation systems and associated communication standards. Source: James Moyne, IRDS Factory Automation Chapter co-chair

Fab automation standards are often categorized as one of the following:

Interface A covers communication to fab equipment/tool, e.g. data, diagnostics and control;
Interface B covers communication between software applications within a factory, or
Interface C covers communication between factory systems and external world, i.e. remote.

How much data is enough?
Fab engineers need equipment data for process control, predictive maintenance, and anomaly detection. However, more data is not always better. It is more important to focus on having salient information at the appropriate granularity to meet the engineering objective. Otherwise, the data networks and associated data storage systems will include reams of unused data that potentially delay time-to-detection and rectification.

The relevant questions here are what data is needed, and at what frequency. Analyzing fab equipment performance relies upon data from a myriad of sources.

Data generated in the sub-fab has been an underutilized source. Such data can be culled from the pipes carrying water, clean dry air, chemicals, gases, and their associated systems. Likewise, useful data comes from the power utility systems that provide low, medium and high voltage power to various process tools. Factory data is collected from the sub-fab systems, but connecting it to associated equipment tools that use them is not so easy.

“Fabs could benefit a lot from having much tighter equipment design and integration between sub-fab and tool,” said Boyd Finlay, director of solutions at Tignis. “We see this at customer sites a lot. We have deployed ML anomaly detection models to fleets of tools where we are measuring the ‘health of the line.’ We are seeing real fleet and tool cluster anomalies that are 100% traced back to known utilities issues. So it makes sense that the sub-fab data publication ought to be streamed through the main process tool. Doing so would avoid the additional data engineering work that so often just doesn’t get done in fabs — especially older fabs with more limited resources.”

In general, the balance of data frequency and data storage needs to be carefully considered. The data collection rate needs to be determined by domain experts. In addition, filtering data with meaningful statistics results in more actionable information for decisions made by either ML-based algorithms or engineering teams.

“An important part of understanding these systems is knowing the required data collection rates. You’re dealing with chambers that have time constants that are sometimes on the order of seconds. But some data needs to be collected quickly,” noted James Moyne, associate research scientist at University of Michigan, IRDS co-chair and technology contractor at Applied Materials. “For instance, if you’re looking for electrostatic chuck issues that might cause defects on the wafer because of sparks, you’ll need high-speed data collection on the order of 2 kHz. But if you’re monitoring pressure in the chamber to determine when you’ve reached a settling point, collecting at 1 Hz is probably sufficient. Many tool collection strategies employ a broad-brush approach — for example, a 100 Hz data collection rate on everything. A better solution is more selective, focusing on the nitty gritty of what information is needed, and thus which sensor values you want to get quickly and which ones are not so important in terms of speed.”

Others agree that a strategic focus on the necessary information is beneficial.

“It’s essential to catch information, not only raw data,” said Dieter Rathei, CEO of DR Yield. “Sensors in the fab can provide measurements at a high frequency. I’d rather collect some meaningful statistics from a sensor, like averages, minimum and maximum values, along with any abnormal observed values, than 15,000 repetitions of the same pressure measurement in a chamber. But all these measurements or statistics really only become useful if we have the full context of these measurements — what wafers have been in the chamber at the time, how they were aligned towards the sensor, and so on.”

Advanced process control of sensitive process steps demands a significantly higher data rate than is listed in current equipment data acquisition (EDA) standards. Consequently, engineering teams use software applications that cull real-time data from these equipment log files.

“We never standardize log files,” said Tignis’ Finlay. “And yet, the advanced fabs have to parse these regularly for higher-speed data insights and additional parameters that do not get published via GEM300 or EDA (a.k.a. Interface-A). Also, SEMI standards on the tool side are only as good as the OEM software execution. There is a high degree of variability across tool makers, even in the advanced process application space. Some advanced 300mm tool software was first architected in 1997.”

Data quality
With any data set and model building the old saying, “garbage in, garbage out,” typically holds true. Big data scenarios in a wafer fab heighten this concern.

An ongoing challenge with making the most of the fab data generated is data quality, which depends on data alignment. Alignment requires people to understand the data details in order to merge disparate data sources with context. Otherwise, there can be confusion and misinterpretation, which could prove unforgiving when building a complex algorithm that makes real-time decisions.

In big data circles, data quality is referred to as data veracity, which is defined as follows:

“The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis.” [2]

As fab engineering teams build more complex models that need to be maintained (i.e., updated/corrected), the necessity of data quality sometimes gets lost.

“A big problem in semiconductor manufacturing is we focus too much on the AI/ML model building aspect of the data lifecycle and not enough on the other aspects,” said University of Michigan’s Moyne. “The data lifecycle starts when you extract, transform, load (ETL) the information. Then you do your data processing and model development, and this is where there is a heavy focus on AI/ML. Then, after model verification and validation, you deploy the model and then maintain it. But in the beginning of this process, if the data going into your system has low quality, your solution probably won’t work, and it definitely will be difficult to maintain. One piece of bad data destroys 10 pieces of good data. We realized this issue in the SEMI standards community and came up with the E151 guideline and E160 specification for data quality. As we move forward with more complex and predictive solutions, including digital twins, we will have to focus more on the data quality aspect if we hope to have effective solutions that are maintainable.”

Other industry roadmap efforts spell out the need for data quality.

The factory integration chapter for the International Roadmap for Devices and Systems (IRDS) considers big data veracity. [3, p. 35] It highlights the challenges of data time stamps, equipment vs. “human entered data,” data merged from different sources collected at different data rates, as well as the different levels of data context richness.

Data alignment
Measurement data without the appropriate context severely limits the reliable actions that can be taken by an engineering team or an ML algorithm. Discussions with multiple industry experts highlight the need for finely tuned data alignment to ensure that accurate and precise metadata is recorded. This requires standards experts and engineering teams that fully comprehend the details. Otherwise, misinterpretation can result.

All data has a time context. The attributes of accuracy, precision, and granularity matter for a data packet and its time stamp, but there are tradeoffs.

“It becomes very expensive to carry time stamps around with every single piece of data,” noted Brian Rubow, director of solutions engineering at PDF Solutions’ Cimetrix Connectivity Group and co-chair of the SEMI NA Information & Control Committee. “Originally, before we finished the EDA standards, we planned to time stamp every single piece of data, but it just killed the throughput. We couldn’t do it.”

The expansion of data buffering options with EDA Freeze 3.0 will enable more data sent per packet. But tradeoffs remain, and those require an understanding of seemingly mundane details.

Rubow said that engineers and technicians typically ask the following questions:

What’s governing the data?
How is this time stamp actually saved?
What is the data format?
How old is each piece of data that I’m attributing to this time stamp?
What’s the precision and accuracy of the time stamp?
When collecting data from different sources, are they buffered with the same time stamp data?

Data alignment goes beyond time stamps. Attention to granularity and context captures the highly interactive nature of semiconductor wafer manufacturing.

“Time stamp matching is only the starting point. Once the data is ‘paired’ and ‘parsed,’ it still must be aligned for granularity to the sources that need it. These operations can and do become complex in fully integrated factories,” observed McIntyre. “For example, FDC trace data needs to be summarized, likely across multiple uses. FDC data needs to be summarized for the recipe, the recipe segment, the material being processed, and the maintenance logs, just to name a few. Data frequently has to be converted from time-based to material-based [format], and within the material it would need to be applied or aggregated across die, region, wafer, cassette, batch, tool, etc., just to name a few.”

Consistent naming helps data alignment and naming conventions are about to be defined in SEMI standards.

“A challenge with SECS/GEM interfaces is that implementors can name variables, events, and exceptions any way they want, even when these variables are defined by SEMI standards. This results in a lot of back-and-forth trying to figure out if similarly named items are actually the ones defined in the SEMI standard,” said Albert Fuchigami, senior standards specialist at PEER Group. “We are defining well-known names (WKN) associated with these items. By standardizing the WKNs to use, we make it easier to link to the content defined in SEMI standards, and we simplify the integration efforts between the fabs and equipment suppliers.”

Fuchigami added that the affected standard groups have updated standards to incorporate WKNs via two steps. First, they have commenced updating the SECS/GEM interfaces to define WKNs. So far, the following SEMI standards have been updated: E30, E40, E87, E90, E94, E116, E157. [4] Then they will update standards to require using WKNs. For instance, in the upcoming EDA Freeze 3, the metadata standard, SEMI E164, will require WKNs.

Conclusion
Fab engineering teams and ML-based algorithms require the delivery of quality data specifically geared to solve the problem at hand. SEMI standards are providing guidance and a common language in the form of well-known names for the industry. Nevertheless, data preparation goes beyond standards to frequency of collection to data format consistency with the appropriate level of granularity.

“The single best practice I see is to have proper data matching and have that data either represented or aggregated to the needed physical analytical level,” observed McIntyre. “The more of this work that can be done in the background by intelligent systems, the more it will be used by engineers to problem-solve.”

References:

Related Reading
Integrating Digital Twins In Semiconductor Operations
The industry must collaborate to develop a common understanding of digital twin technology across various hierarchical levels.
IC Equipment Communication Standards Struggle As Data Volumes Grow
Timely engineering fixes rely on communications standards, but data inconsistencies are getting in the way.
Aftermarket Sensors Boost Yield In Wafer Fabs
More data improves throughput, helps extend the life of equipment.

Anne Meixner

(all posts)
Anne Meixner is a contributing editor at Semiconductor Engineering. She has 30+ years in the semiconductor industry. She became fascinated by defects in the semiconductor manufacturing process as a young engineer at IBM. Over that period, she has focused on test methodology with emphasis on mixed-signal and analog DFT and test. As a technical communicator she takes complex ideas and explains them in consumable understandable pieces. Meixner has worked at IBM, Carnegie Mellon University and Intel. She holds three U.S. patents and her peers have recognized her work with two best papers at IEEE International Test Conference. She founded The Engineers’ Daughter in 2015 to consult on semiconductor testing and to coach engineers.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

High-Quality Data Needed To Better Utilize Fab Data Streams

Anne Meixner

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Recent Comments

About

Navigation

Connect With Us

High-Quality Data Needed To Better Utilize Fab Data Streams

Anne Meixner

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored