Infrastructure Impacts Data Analytics

Gathering manufacturing data is only part of the problem. Effectively managing that data is essential for analysis and the application of machine learning.


Semiconductor data analytics relies upon timely, error-free data from the manufacturing processes, but the IT infrastructure investment and engineering effort needed to deliver that data is, expensive, enormous, and still growing.

The volume of data has ballooned at all points of data generation as equipment makers add more sensors into their tools, and as monitors are embedded into the chips themselves. The resulting data either needs to be cleaned, discarded — and often some of each — in order to understand the value of that data.

In the case of advanced packaging, where there are multiple chips in a package, this is a complex decision with lasting repercussions.

“The biggest change is the management of data growth,” said DeukYong Yun, member of the Amkor Automation Team at Amkor. “We are seeing more data generated for the last 6 months than over the past 15 years. The primary drivers are unit-level traceability and real-time equipment parameter data as IC designs become more complex and require extensive detail levels of data for quality control,”

Most big data discussions focus on AI/machine learning (ML). But without the engineers building the data pipes, harmonizing the data between silos, and guaranteeing the data integrity, an ML application is worthless.

Data needs to travel from its source to an accessible database before an engineer can leverage data generated from semiconductor manufacturing steps. Throughout the supply chain there are networks and clusters of IT infrastructure that support management of data. It’s nearly impossible to even visualize a simple final test failure pareto without a robust data management system. Outdated infrastructure impedes prompt reactions to yield and quality issues. Leaks within the transfer of data can result in missing, misaligned, and inaccurate data. And such data integrity issues can interrupt the cadence of regular reports, while misdirecting an engineering team’s attention.

“Many companies today still regard data management as needing to be siloed for the efficiencies of the first line users of data,” said Mike McIntyre, director of software product management at Onto Innovation. “Many companies continue to look at data and data retention as a cost to be minimized. As a result, they place data in the most inexpensive repositories and data hierarchies possible.”

This style of data management impacts the ability to connect data between manufacturing steps. As a result, there is a push to connect data silos in a centralized structure.

“There exists a strong drive toward centralized data management and high data quality,” said Paul Simon, director of analytics for Silicon Lifecycle Management at Synopsys. “We see not only higher data volumes. We see more data types and broader data collection,”

The IT infrastructure necessary to support the centralization of data has hardware and software components. The hardware stores, processes and moves data between the generation points to the point of engineering access. The software is used to monitor, manage and secure that data.

Building that infrastructure has a visible cost. Maintaining it often has been an invisible cost that is all too often ignored.

“Every system you build, every piece of code you develop, and all the people who start to use it, creates technical debt,” said Michael Schuldenfrei, fellow at National Instruments. “And there is a tremendous underestimation of the amount of effort it takes over time to build or maintain these systems,”

Big data management design
With all this drive toward centralization, semiconductor companies need to actively attend to their data management and the supporting infrastructure.

“All aspects of data management – data generation, data ingestion, data storage and data consumption – must be considered during the design phase, and must be domain-driven,” emphasized Rao Desineni, director of analytics for manufacturing and operations at Intel.

Domain-driven knowledge impacts much more than the database storage framework. It needs to be considered when selecting the hardware solution for storing the data in a manner that facilitates the subsequent data requests and analysis to support the engineering objective.

“What kind of schema database schema are you going to use to describe that data?” asked Schuldenfrei. “How are you going to build the processes that load that data? What is the performance you need from that data store in order to solve the use cases that you want to solve? And are you going to be able to do that without replicating all of your data all the time?”

With some facilities generating petabytes of data on a daily basis, that last question is particularly pertinent.

“More and more data than ever are being generated and retained,” said Onto’s McIntyre. “Case in point: Fault detection and classification (FDC) and IoT data had nearly zero presence in yield analytics 15 years ago, yet today these data feeds are considered must-have sources for line control and overall problem solving. Retaining terabytes of data in an organized and structured environment was considered excessive 15 years ago, whereas today it is an absolute requirement.”

The ability to store and manage this amount of data requires an infrastructure of storage, computers, routers, and networks. There may be storage local to the equipment cell/station. The infrastructure ports the data from an equipment’s point of generation to a data storage system, then moves the data to a centralized data storage in the factory/facility, such as a manufacturing execution system (MES). This infrastructure, in turn, manages the data quality, integrity, and security.

That’s the general idea. But exponential growth in data has caused factories to reconsider storage options for data to balance immediate needs and the long shelf life. The latter supports the mission critical ICs archival requirements, and it supports engineers and data scientists looking for trends over longer time periods (months and years).

“Storing large volumes of data presents both technical challenges and cost overhead,” said Greg Prewitt, director of Exensio Solutions at PDF Solutions. “We are migrating customers to distributed and multi-tier backend data stores to meet their data retention objectives.”

This is not dissimilar to the way many enterprise data centers have operated in the past. In some cases, old data is still stored on tape, while more recent data is stored on spinning media. Data that needs immediate access usually is stored on solid state drives with high-speed optical interconnects. What’s changing is not so much the basic buckets as the volume of data that needs to be sorted, and the speed with which that needs to occur.

Factory IT departments do recognize this shift to managing more data because they need it for improved operations. “For example, machine logs and parameters are essential elements for data processing and analysis,” said Amkor’s Yun. “Our recent efforts have involved building the scalable data management platform and infrastructure to keep up with the speed of data growth. In the past, typical infrastructure refresh/extension used to be a four-to-five year cycle. Now it should be revisited every year.”

Storing data has become less expensive, but having more data to store brings on other challenges.

“In the last 10 to 15 years, raw storage capacity is no longer considered a cost barrier,” said Doug Suerich, product evangelist at PEER Group. “The limiting factor now is we have too much data, and it’s the ability to actually process it intelligently. It’s just gargantuan volumes of data because engineers hope that machine learning can help them chew through it and then find those needles in the hayfield.”

With the pressures of volume and processing of data semiconductor companies have begun to move their data management to cloud supported technologies.

Who’s in charge?
Managing that infrastructure poses a dilemma for both small and large companies. It’s now a requirement, but it’s also not as easy as it sounds.

“There is an underestimation on how complicated it is,” said NI’s Schuldenfrei. “The complication arises from multiple factors. First, truly big data is difficult to manage. Second, it takes a lot of expertise and understanding of the data to really structure it properly, in whatever database you’ve chosen. In fact, even making the right technology decisions requires a lot of knowledge and understanding.”

There is general agreement on that last point. “There is the infrastructure team responsible to setup the storage solution, and database admins who must continuously optimize the storage stack,” said Intel’s Desineni. “In setting up the data for consumption both these players often do not have the necessary domain expertise, meaning they depend heavily on the domain experts (the test expert or the product engineers) to define detailed requirements for them, which may or may not happen effectively.”

No data stops analysis, but wrong data creates problems. Hence, there is a need for data quality policies codified in software to assure complete and accurate data.

“A data management system that assumes perfection will result in bad data. Clean data is so important for engineers to base their yield responses on. Yet it requires recognition that things go wrong,” said John O’Donnell CEO of yieldHub. “For instance, network issues can cause an incomplete data load. Simply comparing the amount of data generated by the tester to the value uploaded to MES can detect such an occurrence.”

Maintainability, security, economy
Investing in the IT infrastructure involves building and maintaining a system. It also requires IT professionals, process engineers and product engineers to jointly design a secure and cost-effective system.

“Not only are these critical infrastructures under-invested at the start of a factory’s life, but many companies still treat IT infrastructure like an investment where once purchased, it is then made static and operated until it fails,” observed Onto’s McIntyre. “Advanced manufacturing, with its ever more complicated supply chain, demands that these IT systems be part of a continuous investment and renewal strategy.”

So how many people does it actually take to maintain these systems? The answer is, more than it takes to build them.

“Suppose you need 10 people to build the system,” said Schuldenfrei. “After 5 years, you’re going to need more than 10 people to maintain it. The reason is that you continue to maintain your legacy code, which is getting old and stale. When no one knows how to deal with it, it becomes more expensive to maintain. At the same time, you evolve your next generation, because the technology on which you built your system is going obsolete. You’re constantly increasing your investment. It never goes down.”

Part of this investment is just moving the data. More data is harder to move, both from a raw dollars and a resources standpoint. A petabyte of data involves electronic transmission via cables, optical channels, wireless, and possibly satellites. Only so many bits can travel over a channel per second.

“The challenge just grows with the data,” said PEER Group’s Suerich. “It’s not as simple as buying a faster Internet connection. It’s everything that goes with it, like the straight up capacity of your lines in the factory in various countries. Not every country, particularly where manufacturing is going on, is equally well-connected. Then the security story gets harder when you get these massive volumes of data, because just scanning that amount of data for IP leakage or viruses takes time.”

In the past, many companies would not even consider moving their data to the cloud. But the cloud has emerged as a good option over the past few years because data hosting companies recognized that security would be a critical selling point. Data analytic companies often educate their customers on their security capabilities, which in many cases is better than on-premise security.

“Data is more secure on the cloud versus on the premises,” said yieldHUB’s O’Donnell. “In addition, the massive scalability of databases in the cloud comes at lower cost than an on-premises system.”

Others also have noted the cost impact of on-premises data management. “This quickly becomes really inefficient and very costly,” said Synopsys’ Simon. “Companies look at the cost of data analytics software, yet they underestimate the cost of IT infrastructure. There’s a cost of IT infrastructure plus database plus the management of that database locally. You multiply that by 20 sites and this becomes really costly, even for large companies, compared to a centralized solution.”

Using cloud technology for centralization does not mean that a factory gives up its local data, however.

While people speak of Big Data and machine learning in one breath, such analyses of manufacturing data are still in their infancy. Given the perceived lack of IT infrastructure investment, companies generating the data will need to build up their investment prior to investing in ML.

“Data is the new oil is a colloquially used idiom lately, often to encourage adoption of AI analytics,” said Intel’s Desineni. “The argument should rather be to treat data as an asset first. Massage it, pamper it, store it in beautiful barrels. Splurge on it. AI analytics can wait.”

Engineers effectively utilize the available data to solve yield and quality issues, and to optimize manufacturing and test steps. Yet even today, simple statistical correlations between two manufacturing steps or between two parameters can be inaccurate due to not having complete and clean data. Resolving that issue requires a greater investment in the whole data management system.

“Solutions all start with proper and efficient data organization,” said Onto’s McIntyre. “The recognition that data storage costs are minimal when compared to the costs of slower yield ramps, excursions, and redirection of human resources to problem solving, will go far in covering the capital investment needed for data retention. There also needs to be a recognition that data retention without understanding of its content and use is ineffective.”

Fab and test engineering teams know the data they want. IT departments can successfully partner with them — as long as upper management invests in data infrastructure.

Related Stories

Uniquely Identifying PCBs, Subassemblies, And Packaging

Difficult Memory Choices In AI Systems

Much Smarter Manufacturing


Andris Ogrins PwC Strategy& says:

A great article incorporating the views of multiple experts in the field.

The challenges of data management in semicon are definitely covered. However, I feel that several aspects here are not touched sufficiently, which are the key achieving breakthroughs in data-centricity in a field which is already world-leading when it comes to automation.


1. The key reason why data management in semicon is so difficult (and difficulty “scales” exponentially) is because business logic and data models are hard-coded in software as opposed to explicitly described, e.g. as knowledge graphs.

2. Overall vision of cross-boundary data centricity is rarely described and maintained, as are ROI estimates for investing into data management. Without quantifying possible gains (and tracking them!), management finds it difficult to allocate investments.

3. Also management-related, breakthroughs in data management need to be linked with a vision for changing the way how semicon manufacturing and R&D works. Additional data will be useless if actual business processes are not changed concurrently.

All three of these are hard to change, but can give tremendous advantage to those who do. Examples of Facebook, Amazon, Tesla are quite telling.

Anne Meixner says:

Thanks for the compliment and I appreciate your detailed observations.

Leave a Reply

(Note: This name will be displayed publicly)