Preventing Failures Before They Occur

Combining data can help to predict failures rather than just replacing equipment after it fails.


A decade or so ago, when MEMS sensors were in the limelight, one of the touted applications was to install them on industrial or other equipment to get an advance warning if the equipment was approaching failure. Today, in-circuit monitoring brings the same promise.

Are these competing technologies? Or can they be made to work together?

“Almost all advanced tool manufacturing companies rely on sensors, including MEMS sensors and circuit sensors, within their tools to perform monitoring and provide signals when a tool approaches a maintenance interval,” said Mike McIntyre, director of software product management at Onto Innovation.

The key to both approaches is software. Analytics packages help to transform the sensor or monitor data into operational information that can be used to optimize uptime. But systems integrators will be responsible for unifying much of that software.

Historically, it’s been treated as an unfortunate fact of life that at some point, a home appliance, an industrial pump, an automobile — pretty much any mechanical system — will go down with little or no warning. If one were to know ahead of time that some component was about to fail, replacement of that component could take place in an orderly fashion at a relatively convenient time.

With consumer items, unexpected failures are typically an annoyance that injects unplanned activity into an otherwise routine day. But for industrial equipment, it can mean unexpected downtime for production lines, and that has a significant cost. And with safety-critical equipment like automobiles, if the appropriate safety mitigations aren’t adequate, there is a risk of injury or death.

Julia Fichte, manager, global application marketing, power and sensor systems division at Infineon Technologies, provided an example of the kind of industry that can suffer from this. “Facility managers and building owners who have to deal with the HVAC equipment are actively asking for a solution,” she said. “Those users seek, first of all, to maximize the equipment uptime, but also to optimize the performance of the equipment. They also want to make sure the equipment lives much longer than it’s made to do now. They want to save both maintenance and operation costs.”

Watching as failure approaches
This motivates the notion of both predictive and preventive maintenance. The idea is that sensor data generated during the operation of the equipment will provide clues as to whether a breakdown is imminent.

There are two broad categories of items to be monitored. “The first category is components that will fail through normal wear and tear,” said McIntyre. “These components are expected to last, day in and day out. They are usually maintained on a timed cycle of maintenance. The second category is consumables. These components are expected to have a defined life in the equipment.”

Predictive and preventive maintenance can complement scheduled maintenance, but they also can replace it. If equipment shows no sign of needing replacement at its scheduled time, why waste a good unit when it can be replaced at some future date when it truly is worn out?

This is evident in chip manufacturing, which has long been generating data from both internal and external monitors. “When it comes to mixing old and new equipment in the fab, there are standard preventive maintenance schedules, which assumes that all equipment is used the same way,” said David Park, vice president of marketing at PDF Solutions. “It provides a general rule of thumb. But it’s like comparing a grandmother who drives her car once a week to church on Sundays, and someone who drives like a ricky racer, starting and stopping 100 miles every day on the freeway. The averages will work out, but the individual cases will cause you problems. The grandmother changes her oil every 7,500 miles, while the ricky racer is going to blow a head gasket sooner rather than later. So one of the reasons for collecting all this data and having all these sensors is to do predictive maintenance as well as preventive maintenance.”

This is one of the big shifts underway in how data is used. “Based on the collected data from sensors, smart algorithms can predict the type and time of the failure so that you can trigger maintenance activities before these failures occur,” said Fichte. “It is superior to other maintenance techniques because it allows you to optimize your maintenance costs by avoiding replacing equipment too early.”

“We’re going to be able to say, ‘This cell needs a new calibration, and it’s going to be offline for a day,’” said Benjamin Lobmueller, director, cloud solutions at Advantest. “‘When you do that, go check on these parts as well. And maybe exchange these two elements here, because they have a high likelihood that they’re going to come down, as well.’”

Wear-out can be affected by the operating environment and external factors, including software, so it’s important to understanding the context. “You have operating conditions, you have software, you have software updates,” explained Uzi Baruch, chief strategy officer at proteanTecs. “Software is becoming a super-component, so it impacts the stress of the system a lot.”

The needed data can be generated by at least two different kinds of technology, external sensors and in-chip monitors and sensors. “There’s a lot of value in combining the different signals from the different kinds of monitors inside and outside,” noted Baruch.

These two approaches arise out of different ecosystems. The trick is unifying them so they work together to generate a single credible alert.

External sensors work for mechanical systems
Sensors took the lead on this idea back when MEMS technology was in the spotlight early in the millennium. The idea is that various sensors can monitor some mechanical aspect of equipment for evidence that all is not well.

Temperature and pressure may be two indicators of deviation from normal. In addition, moving parts create vibrations, and accelerometers can measure that vibration to detect when its pattern changes.

These approaches have been around for a while, but they ran into challenges. The “normal operation” profile was established by looking at aggregate devices, and their behaviors simply weren’t uniform enough. What was normal for one device might be off for another. The result was a huge number of false alarms that had a dampening effect on acceptance.

“Oftentimes, you weren’t really sure whether those insights could be actionable,” said Lobmueller. “And the reason was always the same: Did you get the right data? Could you trust that sensor reading? If you have a have a sensor reading from an engine that tells you one cylinder is running three degrees hotter than the other five cylinders, you probably have something wrong with the gasket running in that cylinder. However, oftentimes it’s that the sensor is specified to run plus-one to plus-three degrees,” meaning it could be a false indication based on normal variation.

What’s more effective is learning the behaviors of specific equipment. For that cylinder sensor, it means learning what “typical” is for each individual sensor. For an accelerometer mounted on a motor, that might mean running for a while to learn what “normal” looks like for that individual motor.

At present, system designers are responsible for building the software stack that will both learn what “normal” looks like and then decide whether to fire an alarm. That may be a solved problem for legacy external sensors. “You have pre-trained models and algorithms that work in these situations that you can use off-the-shelf today,” said Lobmueller. “There is not that much rocket science in that anymore.”

In-chip monitors get a better view of electronics
More recently, in-chip monitoring has emerged as a new technology. The kinds of monitors and internal sensors available vary by physical structure, type of data generated, and how the data is processed by analytics. Analytics are a much more prominent notion now than was the case when external sensors started doing this, so many of the in-chip approaches come with analytics tools.

These monitors can look at low-level voltages, temperatures, performance, bus behavior, traffic, and other characteristics of an operating chip. They’re not intended only for preventive maintenance, but that is one of the applications they enable. This has become more important for semiconductors in a time when chips are expected to function for longer lifetimes, and where those aging mechanisms eventually will cause a failure.

“In-chip monitors can help to measure aging effects like degradation of transistor performance and interconnect electromigration,” said Andy Heinig, department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “The advantage is the possibility of measuring these things directly in the system in the context of environmental conditions like temperature and mechanical stress.”

One indirect way of monitoring aging is to look at some exemplary circuit that doesn’t have a function in the chip. “You can have aging monitors where you’re monitoring an aging control circuit to know what it looks like over a period of time,” said Randy Fish, director of marketing for silicon lifecycle management in Synopsys’ Digital Design Group.

But in-circuit monitors let you look directly at the active circuits and their environment. “The other way to do this is to monitor the activity of devices that are actually in use in the design, so you can monitor the environment to see how much switching is going on, etc., to indicate aging,” Fish explained.

Fig. 1: An example data visualization page looking at degrading delays as an indicator of aging. Source: proteanTecs

Fig. 1: An example data visualization page looking at degrading delays as an indicator of aging. Source: proteanTecs

Internal and external sensors can have complementary functions. While an accelerometer may assess the health of a mechanical system, it can’t peer inside the electronic chips to see how they’re aging. That’s where the in-chip monitors shine. Then again, those in-chip monitors are of little use when trying to decide if a set of bearings is wearing out.

But having originated in different eras through different ecosystems, they are aligned in intent, and that’s about all. But a new semiconductor fact of life might make integration of these two approaches easier — newer silicon processes need analytics for managing yield.

A semiconductor data flood
The lifecycle of a silicon chip is being more closely watched than ever before. Every phase of its existence generates data. It starts during chip design with reams of verification data that establish the desired contours of operation. It continues through the manufacturing of the chip, test and assembly, and then throughout its lifetime.

All of the data up through packaging and deployment principally serves as learning for yield improvement. Data generated during manufacturing itself helps to ensure that the equipment building the chips is operating in spec. The data can even contribute to predictive and preventive maintenance of that equipment.

Fig. 2: An example of a data visualization page for correlating reliability failures with yield. Source: PDF Solutions

Fig. 2: An example of a data visualization page for correlating reliability failures with yield. Source: PDF Solutions

That learning happens thanks to analytics packages that distill the huge number of data points. Given the power of these packages, they can also be used to help anticipate equipment failure based on data generated during live operation. It still may require work to understand which data to use and how to manipulate it, but at least the engines that ingest and process the data are already in place – which was not initially the case for external MEMS sensors.

The good news is that many of these systems can accept data from anywhere. Even in a predictive/preventive maintenance application, live operation can be correlated with inspection and test results from when an individual chip was built.

These systems provide a basis for integrating internal and external sensor data. The software needs to be able to roll it all together to give a single alarm on pending problems even while maintaining the discrete data realms that indicate where the problem lies.

The cloud is well provisioned to handle heavy statistical number crunching, but streaming huge volumes of raw data to the cloud can chew up bandwidth. Response times can be an issue if an automated alarm is needed. That suggests a more local instantiation of the software. The easiest place for that would be in a local on-premise server. Local networks would be far more able to handle the data volumes than the internet would be.

The fastest response would come from integration of the analytics into the system’s software stack. But that would require greatly paring down the software. Many of these packages have broad capability for engineers to view and analyze data to help make decisions. The type of analysis it supports may be automated or ad hoc.

Implemented in the system software, it would need to be specific to the selected analytic algorithm and data stream. The approach used could be developed with the assistance of an off-line system, but it would need to be streamlined for an online implementation. That effort would largely fall to the system integrator.

Some in-chip monitoring companies already make an in-system version available. “It will be up to the vendors of the third-party sensors or monitors to provide software to the chip vendors or system integrators that will allow them to access and make use of the data,” noted Richard Oxland, product manager at Siemens EDA.

Digital twins could take up the load
For systems with critical consequences of failure, digital twins could be used to digest the many streams of data. A digital twin is a virtual model of a physical device. It’s different from a simulation model in that simulation is intended to represent the entire universe of that device. For chips, that could mean fast and slow versions, with all manner of process variation included.

A digital twin represents a specific instance of a device. For a chip, that means that it comprehends all of the historical data so that it knows where it lies on the many distribution curves for the many parameters that characterize the device. For example, a slow device that’s operating slowly might not raise any eyebrows, but a fast device that is now operating slowly might be a cause for concern.

Given actual data from the real device as it operates, the twin can project forward the implications of any data trends or anomalies. “With a digital twin, the data from any number of different sensors or monitors is fed into a central data store to construct a virtual representation of the real system,” said Oxland.

If failure is indicated, then the digital twin can figure that out in advance of the failure of the real device, making corrective intervention possible.


Fig. 3: An example digital twin of a rocket booster. Source: Wilmjakob, CC BY-SA 4.0, via Wikimedia Commons

Fig. 3: An example digital twin of a rocket booster. Source: Wilmjakob, CC BY-SA 4.0, via Wikimedia Commons

Digital twins can be created at any level. A digital twin of a complex piece of equipment could include digital twins of its components. And data can stream from any source, whether in-chip or external to a chip.

That said, digital twins can be incredibly complex, requiring a lot of effort to create. They’re more than just analytics engines, and preventive maintenance is but one of their possible uses. So creating digital twins requires a strategic commitment to build and maintain the models over the long term, churning out one instantiation for every live unit and maintaining them all as improvements are found.

Smarter, more efficient maintenance
It’s likely that the need for both predictive and preventive maintenance will continue to grow. The question, however, is what data will feed it. Right now, external and internal data streams are likely to come together to improve existing predictability. Where legacy external sensors are used, in-circuit monitors can improve results.

Consider accelerometers, which are an external mechanical sensor. “Initially, they will live alongside each other, with accelerometers providing continuity in the dataset and in-chip monitors providing a new level of fidelity that will lead to improvements in the accuracy of the predictive model,” Oxland said.

But as we learn more about these systems, we may find that we can make do with less data over time, reducing system costs. “Whether there is justification to continue deploying accelerometers will depend on whether the extra data they collect is important for model accuracy or for detecting failure modes that are not picked up by on-chip monitors — and the associated economic return,” he said.

Once engineers know how each data source contributes to the overall solution, they can start paring back to be more efficient. That is unlikely to mean external sensors might go away, with in-chip sensors handling everything, because internal sensors aren’t going to be as good at watching the performance of mechanical components. And with internal monitors, as long as they can be hidden in the whitespace, there’s probably no reason to get rid of them. But the ones that cost extra that might be subject to future scrutiny.

Predictive maintenance works
While it’s early days on the more advanced types of predictive maintenance, there are indications it can provide real results. “We have seen that predictive maintenance can lead to 70% fewer equipment breakdowns, and it can also reduce maintenance costs by 25% – and extend equipment lifetimes by 20%,” said Fichte.

But, historically, that requires a fair bit of effort by a wide range of engineers. “It requires a lot of different fields of expertise to successfully implement predictive maintenance,” she said. “One needs sensor expertise to know which sensors to use. You also need to know where to place them, as well as expertise in data analytics. Software expertise is needed so you can develop the algorithms that predict upcoming failures. And all of this information must be stored securely in the cloud, so you need cloud integration expertise, as well.”

This is what newer analytics packages are attempting to simplify. Moving forward, expertise will still be needed, but hopefully less work will be required for each case.


Allen Rasafar says:

This is a comprehensive approach for a predictive, Preventive and Data driven infrastructure to support a high volume manufacturing line.

Leave a Reply

(Note: This name will be displayed publicly)