Making The Most Of Data Lakes

Why data organization and a well-designed data architecture are critical to effectively using manufacturing and design data.


Having all the semiconductor data available is increasingly necessary for improving manufacturability, yield, and ultimately the reliability of end devices. But without sufficient knowledge of relationships between data from different processes and computationally efficient data structures, the value of any data is significantly reduced.

In the semiconductor industry, reducing waste, decreasing cost, and increasing product quality continues to drive chipmakers to improve equipment, processes, and defect screens. Data from across the supply chain, and particularly from various manufacturing processes, can assist engineers in understanding what needs to be improved and exactly how to do it.

This has always been the promise of data optimization and analysis, and over the past two decades the prospect of connecting disparate data has been evolving. There is a well-recognized benefit in connected data, but there are also a couple of challenges that need to be addressed to transform it into a usable format — including organizationally adequate documentation of data relationships and computationally optimized data structures to minimize data movement.

The value of reducing data movement is probably the better understood of the two, and one being dealt with by everything from hyperscaling data centers to autonomous vehicles. There is simply too much data being generated during various manufacturing steps, and in the field, to try to move all of it to a single location. Data must be prioritized up front so it can be partitioned according to however and wherever that data will be utilized. That requires an understanding of how that data will be used, ingested, stored, and consumed. This is made more difficult by the fact that no one person has the expertise to design all of these pieces.

Joe Pong, senior director at Amkor, noted that to be successful requires input from three different sources:

  1. End users (typically engineers), the consumers of the project, who will provide use cases for a company and guidance.
  2. Data scientists, who will help with devising and exploring the treatment of data through sophisticated statistical graphs, and
  3. IT experts, who design and implement data pipelines and computing architectures to ensure performance of the project is acceptable and the retention of data is sufficient.

Others point to the need for two additional types of expertise. One is a domain expert, who understands and conveys the domain context that needs to be represented in the data. The second is a knowledge engineer, who knows how to make that context machine-readable so that all relevant available sources can be integrated clearly without ambiguities.

Before any analysis can be performed, the relationships between data sources needs to be captured in a manner that permits both automation and exploration. That requires an expert who documents the relationships between the various data sources to support subsequent analysis.

“All this data comes in different layers, and you have to find the relationship between them,” said Andrzej Strojwas, CTO at PDF Solutions. “For instance, I may find that five in-situ data sources on a specific fab equipment will be very relevant for a particular inline measurement. Then, this inline measurement will have an impact on a particular IC’s performance. In the semantic model, I want to make sure that at each of those layers we have a representation of the data that will allow me to build the relationships between the various data sets.”

But with data generated and stored at each point in the data chain, combining it into one massive database is neither practical nor necessary. The familiar relational database architecture’s performance chokes on the huge amounts of data that engineering objectives choose to analyze. Thus, it’s paramount that engineering teams know the relationships between various data types in order to extract maximum value with minimal storage and computing cost.

For data analytic platforms to effectively use all this, the data the engineering team needs to comprehend the use cases that drive which data is being pulled to perform the analysis. This, in turn, determines the algorithms used and the associated data structures needed to maximize the computing performance on the chosen hardware.

Integration requires a data lake
Data management and data analysis necessitates understanding the data storage and data compute options to design an optimal solution. This is made more difficult by the sheer volume of data generated by the design and manufacturing of semiconductor devices. There are more sensors being added into equipment, more complex heterogeneous chip architectures, and increased demands for reliability — which in turn increase the amount of simulation, inspection, metrology, and test data being generated.

Connecting different data sources is extremely valuable. It allows feed-forward decisions on manufacturing processes (package type, skipping burn-in), and feedback in order to trace causes of excursions (yield, quality, and customer returns).

Fig. 1: Multiple data sources for semiconductors. Source: PDF Solutions

Fig. 1: Multiple data sources for semiconductors. Source: PDF Solutions

“An understanding of the semiconductor manufacturing process and relationships throughout are essential for some applications,” said Jeff David, vice president of AI solutions at PDF Solutions. “For example, how can I use wafer equipment history and tool sensor data to predict the failure propensity of a chip at final test?  How does time delay between process and test steps determine what data is useful in finding a root cause of a failure mode? What failure modes are predictable with which datasets? How do preceding process steps affect the data collected at a given process step?”

This is complicated by the fact that trade-offs need to be understood in the context of the engineering goals. “Domain expertise and experience with datasets, either individually or as a collective team, will drive architecture needs,” David said. “This spills over into defining the requirements properly, which will also drive the architecture decision. For example, how much data do I need in order to train a model effectively? Do I need to partition the data across test programs, or is partitioning across chip product sufficient to meet my needs? And how quickly do I need to make a prediction to make it useful — 1 hour, 1 minute, or 10 milliseconds?”

One of the key metrics with data models and computing is time to prediction. “We’re exploring the different options for using collected data,” said Eli Roth, smart manufacturing product manager at Teradyne. “For instance, how can you execute something in real-time to make a real-time decision? For some cases it involves basic math, and in others it involves CNN algorithms. But for other options, we also are looking at unsupervised algorithms that maybe are running in a data lake. We are looking at big data for trends that we wouldn’t spot by limiting the data scope.”

The expansion of the data chain into in-field data is being enabled by internal circuitry measurements, which improve the engineer’s understanding of system performance.

“From my perspective, we are too early to say that data should be stored in a particular way,” said Gajinder Panesar, fellow at Siemens EDA. “My preference is to have it in a general format. For what we’re looking at, it’s basically coming down to time-series data. And as we learn more about how to use this across the whole spectrum of data — fab to in-field — we refine things. When we get to that point, then we’ll be able to add semantics to guide analysis. ‘With this data, when this set of rules is applied to it, you get this.’ Or, ‘You treat it as this metadata, and that means something else in the context that you’re using it for.'”

But to effectively leverage all the data, it also needs to be accessible. Moreover, it requires a flexible organization to simultaneously support established relationships between data sources and permit exploring new relationships. This stands in contrast to the traditional data warehouse, where the trend is to store all data in its native form in a data lake.

Think of a data lake with multiple data streams feeding it. Storing in native format does not negate the need for clean and reliable data. It also doesn’t obviate the need to store the associated schema and metadata. A data lake includes structured, semi-structured, and unstructured data.

“All in place” can take different forms, as well:

  • A single data lake;
  • A data lake with multiple data ponds;
  • Multiple, decentralized data lakes, and
  • A virtual data lake to reduce data movement.

A data lake approach facilitates integrating data from different data sources because it’s a schema-on-read data framework. Extracting only the data needed for an analysis results in faster computation. In addition, it doesn’t lock engineers and data scientists into only looking at certain combinations. That’s especially useful because exploring what distinguishes a set of failing dies from passing dies often requires a large dataset consisting of multiple parameters.

Capturing data relationships
The appeal of a data lake is the schema-on-read, which makes knowing the possible relationships essential. Otherwise, you end up with a data swamp. Multiple industry experts noted that saving data without understanding the relationships, context, and meta data is ill advised. Basically, engineering teams need to document their data model before any extraction or computation. Thus, the need for a knowledge engineer and domain expert working on each team.

“It’s not just sufficient to store all this data in a data lake. A semantic model for combining this data is required,” said Kimon Michaels, executive vice president of products and solutions at PDF Solutions, at a recent workshop. “We need to understand the spatial, time, and hierarchical relationship across the data, such that the advanced analytics can be applied to different combinations of this data, and turn that disparate data into actionable information.”

There is growing consensus on this point. “Without some kind of context, data is worthless,” said Alan Morrison, an independent data technologies consultant. “The reason data is not frequently shared is that the context is sparse and not designed to live and evolve organically. How many times have you looked at someone else’s spreadsheet and couldn’t figure out what the spreadsheet was about? At a minimum, the rows and columns were under-described. In general, the less perishable data needs to be, the better it needs to be described. The description and predicate logic need to include set theory and graph theory.”

In other words, a data lake without a map is not very helpful. “Throwing your hook into a data lake without knowing these relationships is like saying, I’m going to throw my hook in the ocean and try to catch a fish,” said Mike McIntyre, director of software product management at Onto Innovation. “If I know where the fish breed, their migratory path and their feeding grounds, my odds of catching a fish go up tremendously.”

Context is captured at the point of generation and at the points of interoperability.

“The question is where to find the data,” said Michael Ford, senior director of emerging industry strategy at Aegis Software. “We don’t want to have people searching in haystack. When you have people running around looking in different systems and databases, it could take hours or even days to find things. In our approach, we automate that. We build contextualization as we put things together. The data has already been enriched through this contextualization. There is a defined ontology within the organization so that we understand exactly where things are, how things are related.”

Fig. 2: SLM analytics platform showing data layers. Source: Synopsys

Fig. 2: SLM analytics platform showing data layers. Source: Synopsys

So it’s not just about the data. It requires a data architecture to make that data useful. “We have one database that we maintain, develop, and support. Then there’s a semantic layer above that, between data ingestion and the database that we can configure,” said Paul Simon, group director of silicon lifecycle analytics at Synopsys. “The semantic layer is a document that describes how all kinds of data get mapped into a data structure of the data model in this database. We can keep the database relatively fixed, so we only have to change semantics for data ingestion to load only that specific data.”

There are myriad ways data relationships can be stored in a semantic layer.  A few semantic technologies exist that help machines understand the data.

Knowledge graphs can be the most effective and scalable means of constructing and sharing, meaning by creating interacting contexts,” said Morrison. “Applied semantics on its own could be any method that adds machine readable meaning to data. The challenge is scaling meaningful data, as well as using just enough semantics to achieve your objectives.”

Fig. 3: Underneath a deep learning algorithm is a knowledge graph and a data lake. Source: A. Meixner/Semiconductor Engineering

Fig. 3: Underneath a deep learning algorithm is a knowledge graph and a data lake. Source: A. Meixner/Semiconductor Engineering

The more you know upfront about the use case, the better you can organize your data.

“Our platform imports data obtained by embedded agents, strategically placed during design, and provides advanced analytics in the cloud or at the edge. The agents monitor performance margins, application stress, material distribution, supply voltage, and additional parameters. By distributing agents at extremely high coverage and having ML data structures that understand and comprehend the map, the algorithms can provide a detailed understanding of performance and reliability, with cross-stage correlation, precise RCA and valuable predictions,” said Marc Hutner, senior director of product marketing at proteanTecs. “That’s why it’s difficult to perform the analysis without designing the data structures purposefully. It’s difficult to back annotate measurements after the fact.”

Data structures and computing
Once data is pulled from a data lake, it needs to be stored in a data structure for the computation. As computing hardware options become more tailored to big-data applications, there are questions about what are the optimal data structures. A data structure defines data organization at a much lower granularity than a data schema. It directly impacts the performance of the computer hardware. Effectively defining it requires a shift away from users defining the use cases to data scientists and data engineers, determining the optimal data structure within a database.

“Individuals in manufacturing who understand equipment and facilities are not usually the same people who understand the larger data structures, or more importantly how that information is used in the pursuit of knowledge about the factory’s overall behavior,” said Onto’s McIntyre. “This gap is filled by others well versed in operations and methods. Also, just because an individual knows about automation and data structures does not necessarily imply they have sufficient knowledge or expertise in the analytics that must be attached to these data structures for them to be effective.”

Fig. 4: Levels of data organization. Source A. Meixner/Semiconductor Engineering

Fig. 4: Levels of data organization. Source A. Meixner/Semiconductor Engineering

Data structure defines the data organization, optional index, and the algorithms that dictate the basic operation — store, access, and update.  In The Periodic Table of Data Structures, the authors point out, “…a data analytics pipeline is effectively a complex collection of such algorithms that deals with the increasingly growing amounts of data. Overall, data structures are one of the most fundamental components of computer science and broadly data science. They are the core of all subfields, including data analytics and the tools and infrastructure needed to perform analytics ranging from data systems (relational, NoSQL, graph), compilers, and networks, to storage for machine learning models and practically any ad-hoc program that deals with data.”

Data structures can be optimized for read, write, or space. Commonly used data structures consist of arrays, lists, sorted lists, binary trees, multi-way trees, and hash tables. For read-intensive workloads, data scientists often choose binary trees. For write-intensive work loads they favor LSM-trees. And while a hash table may be sufficient for tracing a product’s equipment genealogy, a different data structure might be more optimum for training a deep learning network.

Larger data sets have spurred engineering teams to rethink where that data gets processed, and whether some of that processing is moved into memory. The reason is that computation speed has increased faster than the speed at which data can be moved. Optimizing performance requires reducing latency caused by moving large blocks of data as well as assessment of the algorithm’s impact on data movement at the data structure level.

As Hutner noted, “Data organization is a critical aspect to algorithms working efficiently and effectively. Designing the algorithms with the types of problems we want to solve in mind needs to come before the data structure.”

“The likelihood that a single off-the-shelf data structure fits such arbitrary and sometimes unpredictable data access patterns is close to zero, yet data movement is the primary bottleneck as data grows and we do need more data to extract value out of data analytics,” write the authors in The Periodic Table of Data Structures.

Data experts agree with that observation. “Amkor tried a few on-prem big data solution/platform, such as Hadoop/Casandra/Mongo DB, but these were not able to meet data scientists’ expectation for speed and scalability,” said Pong. “So we adopted a cloud-based big data solution that entails cloud-native data collection and APIs to enable real-time data process and scalability without being delayed by infrastructure limitations.”

Others report similar optimization paths. “Our domain experts work very closely with the data architects to make sure our data models are designed in a way that we actually achieve the best performance for particular use cases,” said Synopsys’ Simon. “If the data models and the semantics are not correct, then the performance is slow. The value of a good analytic system lies not in the UI, but the semantics to support aligning the data and mapping to a data model that performs correctly.”

Vast amounts of data are available for just about anything within the semiconductor design, manufacturing process, and in-field levels. Leveraging all this data continues to spur vendor companies to provide solutions to their customers. For companies to effectively use data solutions, they need to support everyday engineering oversight with established reports, and the detective work for comprehending possible insights in this multi-parameter space.

“Rather than just trying to weed out the bad devices if the problem was big enough, you would get a bunch of engineers to figure it out,” said Ira Leventhal, vice president of applied research and technology at Advantest America. “But we don’t have the luxury of being able to wait on that anymore. It’s really important to get on these issues right from the start. And this is where we see a lot of potential in terms of having all this data up in the cloud within the ACS infrastructure and being able to crank on that and find correlations and relationships in the data that humans wouldn’t be able to pick out. Things are just too complex for the typical path of a human looking over the data and saying, ‘Yeah, that’s the problem.’ The problems are often multifaceted, and very complex in nature.”

To effectively execute with a data lake requires documenting the data relationships to enable leveraging the data and designing efficient data structures that meet the analysis and time-performance objectives.

Related Stories
Infrastructure Impacts Data Analytics
Gathering manufacturing data is only part of the problem. Effectively managing that data is essential for analysis and the application of machine learning.

Too Much Fab And Test Data, Low Utilization
For now, growth of data collected has outstripped engineers’ ability to analyze it all.

Changing Server Architectures In The Data Center
Sharing resources can significantly improve utilization and lower costs, but it’s not a simple shift.

Sweeping Changes Ahead For Systems Design
Demand for faster processing with increasingly diverse applications is prompting very different compute models.

Shifting Toward Data-Driven Chip Architectures
Rethinking how to improve performance and lower power in semiconductors.

Will In-Memory Processing Work?
Changes that sidestep von Neumann architecture could be key to low-power ML hardware.

The Periodic Table of Data Structures

Leave a Reply

(Note: This name will be displayed publicly)