Data centers and automotive chips begin using on-die circuitry to predict silicon failures.
The chip industry is starting to add technology that can predict impending failures early enough to stave off serious problems, both in manufacturing and in the field.
Engineers increasingly are employing in-circuit monitors embedded in SoC designs to catch device failures earlier in the production flow. But for ICs in the field, data tracing from design to application use only recently has become available, and the methods for tying together that data in useful ways are new. Engineering teams are still figuring out when and how predictive maintenance approaches should be applied to complex electronic systems.
“What I’ve seen done today has been more reactive in nature,” said Randy Fish, director of silicon lifecycle management at Synopsys. “You have what they call catastrophic trip monitors, where if a temperature reaches a certain level you actually shut off the chip because you’re entering a destructive region. That is common. So that is done for reliability reasons. But you’re not really predicting anything in that sense.”
Predictive maintenance is touted as a key application within the realm of silicon lifecycle management (SLM). Identifying changes in performance at the block or transistor level can help detect impending IC failures, prompting a change in a chip or subsystem before it fails or impacts system performance. For SoCs that contain identical IP blocks, in-field notification of impending failure can enable a switch to a spare functional block.
Modern SoCs contain a diversity of in-circuit monitors that could provide time-stamped data, which could then be correlated with system performance – expected from design or measured in the field. Comparisons and correlations also can be made across a fleet of ICs. When the appropriate edge and cloud computing frameworks are in place, detecting IC outliers across a fleet of cars or data centers becomes both plausible and desirable.
Automotive, data centers take the lead
“Hyperscalers and car OEMs will determine the pivotal use cases for predictive maintenance, which will drive data sourcing requirements down to the manufacturers for both chip and system,” said Uzi Baruch, chief strategy officer at proteanTecs.
Automakers and cloud providers rely on scheduled maintenance procedures to keep their complex systems running. In addition, their systems have built-in warnings or responses to some failure modes. Consider a car’s check-engine indicator that lights up when a sensor reading is out-of-spec. Or a data center’s high-speed data bus with 16 lanes that reverts to operating on only 8 lanes when 1 lane fails.
Moving to a predictive or just-in-time maintenance approach is attractive to both parties, though they have different mission profiles and operating conditions. The reliability metrics also differ. Automotive is a longitudinal measurement, while cloud service reliability is measured in real-time.
New game, new process nodes
“In the automotive sector the cost of recall is so brutally high. They design their components for a long life,” said Synopsys’s Fish. “Previously, they designed ICs using process nodes that had been around for 10 years or more. This provided good historical data. With automotive ICs moving to 10, 7 and 5nm, there’s not going to be 10 years of data for them to leverage. Their desire to see what’s really happening is strong, because they don’t have history on those process nodes.”
Put simply, a lack of historical data requires them to learn in the field.
“We all grew up on the implications of the reliability bathtub curve,” said Walter Abramsohn, director of product marketing at proteanTecs. In this curve, device failure rates are high during initial fabrication of a product (infant mortality or pre-production), reach a steady level (bottom of bathtub, production), then elevate again at end of life. “But now, as processes shrink, the curve is closing in and getting steeper. In advanced nodes, you can go from one end of the curve to the next in three years. That means that we are seeing more infant mortality and faster wear-out. Visibility is needed at every stage to capture these phenomena in advance. Predictive maintenance is fundamental to enabling these designs in mission-critical and uptime-critical markets.”
The automotive industry has been planning for this challenge, and its focus goes beyond maintenance.
“There are already standards bodies and automotive OEMs exploring how this data can be used for preventive maintenance,” said Richard Oxland, product manager for Tessent embedded analytics at Siemens EDA. “But also, how it can be used to tune the operating parameters of devices throughout their life, constantly monitoring and adjusting these parameters as the silicon ages? This can help predict when the failures will occur, and help extend the lifetime of the silicon. The utilization of a car over its lifetime will dramatically increase as we change from a personal usage model to a shared usage model. And this will put even more demands on the reliability of the silicon.”
Adjusting operational parameters is also of interest to data center owners. “Hyper-scalers want to be able to squeeze out as much operational performance as possible, Vmin and Fmax,” said Fish. “How much [performance] can they really get out of those parts within the power envelope?”
In addition, predicting a failure means data centers can reduce their downtime. “If you look at the public cloud providers, there is a service level agreement (SLA),” said Guy Cortez, product marketing manager in the digital design group at Synopsys. “There’s money associated with uptime and access time. So when something fails, there’s real money to pay if they can’t meet those agreements with their customers. They want to know in advance if something will fail so they can act upon it.”
Yet despite all the anticipated engineering and operational benefits, barriers do exist. Data ownership issues and the complexity of the supply chain are impeding adoption of predictive maintenance.
“One challenge to overcome is the complexity of the supply chain,” said Oxland. “The end users of the chip will likely be owners of a very complex system of systems. Who owns the sensor data in that scenario? If a single chip self-reports that it is operating out of normal limits, that is a straightforward scenario. The board containing the offending chip can be swapped out or a technician called to replace it. But in a complex supply chain, sensor data must be collected from a large number of chips, potentially from different suppliers, and data must be correlated from all the chips on the same wafer, which might have ended up in systems owned by different operators. The economic benefits of predicting failures need to be balanced against the cost of setting up the infrastructure to support this, and the organizational resistance to letting an outside party (such as the sensor vendor) collect that data and do the analysis.”
There is accessing the data, and then there is understanding the value of new data. Managers who believe in the expected ROI are more likely to invest in the data sharing infrastructure and forge agreements to make it so.
“Some of the challenges are the true efficacy and proving it out,” said Synopsys’s Fish. “For instance, how accurate can you get on predictive analytics? We think there’s significant value there. Getting the data is another challenge. Which data do the semiconductor manufacturers and the IC suppliers own, and are they willing to share all that data? In the automotive world there are a number of players for data ownership. Is it the fleet owner, the Tier 1, or the OEM? In the case of the hyperscaler data centers, it’s a little clearer. Effectively, they can own the whole stack — particularly for the chips they’re designing. The transport of data and sharing of data is not an issue.”
In-field monitoring
Predictive maintenance requires a proactive strategy. “In-field predictive maintenance strategies need to start with the right forward-looking plan. Reliability monitoring measures need to be accounted for from the get-go, so they’re able to provide the right data,” said Marc Hutner, senior director of product marketing at proteanTecs. “So we need to start thinking about optimizing in-field maintenance starting at design, and make sure we have the mechanisms in place. At the end of the day, all maintenance strategies rely on good data, and predictive maintenance starts with understanding the deepest level of parameters and failure mechanisms. This includes visibility of performance margins, application stress, aging effects, random or silent errors, latent production defects and more. Once we have the right telemetry stream, models can be built to actually calculate performance degradation, and that’s what ultimately leads to predictive maintenance.”
A variety of on-die circuitry is available to measure parameters of interest, ranging from transistor level properties to IP block path delays to system work loads. Jon Holt, manager of volume manufacturing solutions and worldwide fab applications solutions manager at PDF Solutions, spelled out typical measurements as:
“The end customer can compare t<0 (wafer sort) to t>0 (system or field) and build a predictive model to determine when the component should be replaced,” Holt said.
Others point to additional factors. “Generally, the measurements we’re talking about fall into three categories — physical (parametric) monitoring, such as PVT or path delays; structural monitoring, which is primarily scan test data from DFT infrastructure; and functional monitoring data, such as information about bus transactions that can be derived from embedded analytics monitors.”
The architecture used to provide, for example, self-repair of a chip in an automobile (figure 1) using BiST shows how preventive maintenance can be implemented.
Fig. 1: System level architecture of an automotive chip using BiST results to repair an IC. Source: Siemens EDA
“It is also common today to have both memory repair and for AI or hyper-scalar type devices, even logical core repair. Using the various mechanisms described earlier,” said Oxland. “The data can be used to repair any failing logic and essentially create self-healing devices.”
Detecting outliers in the field
A vision that several industry experts have noted is the ability to compare data from a single IC with the same ICs in other systems. This enables understanding differences in actual mission profiles and in detecting significant differences in the measured parameters. The latter enables engineers to hunt for outliers and flag them as something to change because they are significantly different. So just like manufacturing test engineers use outlier algorithms like PAT or ratioed metrics to identify potentially failing parts, automotive/data center engineers can use outlier algorithms to flag anomalous IC behavior and take the appropriate action.
“A more sophisticated methodology is to aggregate across many chips over time and correlate changes in physical parameters with information about die location on the wafer, process lot, manufacturing date and so on,” noted Siemens EDA’s Oxland. “This allows IC vendors and system operators to infer the probability of a given chip failing at some point in the future, and also the probabilities for dies from the same neighborhood on the wafer, or same wafer in the lot. We are now seeing this methodology applied by some vendors.”
The goal is to at least isolate, if not eliminate failures. “Ultimately what we’re trying to do is prevent an epidemic,” said proteanTecs’ Baruch. “The most valuable benefits will come from monitoring not just one chip at one time, but the entire fleet of chips deployed in a similar application for their entire lifecycle. So asking how many devices will suffer from the same phenomena is crucial to assessing the liability risk.”
Both the automotive and cloud industry sectors are just beginning to travel this path, but no one is quite there yet.
“The ultimate goal is to be able to reliably predict reliability,” said Fish. “I don’t think anybody’s delivering on that today. The monitors and the information across thousands or hundreds of thousands of components really hasn’t been gathered yet and modeled in that way. There’s a bit of a ramp to get there.”
Conclusion
The value in predictive maintenance over scheduled maintenance is multifaceted. Engineers detect an impending failure sooner, change components when needed, but also optimize the performance of their systems better.
Data center operations and automakers will be early the adopters of electrionics preventative maintenance. Hence they will lead the way in showing how data from within ICs can be used to enable proactive responses. Ramping up is just beginning, but no one company has fully implemented such schemes. Once the investment in business agreements and data infrastructure are in place, engineers can consider the many technical possibilities.
Related Stories
Adopting Predictive Maintenance On Fab Tools
Predictive maintenance cuts equipment downtime while boosting fab efficiency
The Drive Toward More Predictive Maintenance
Using data for just-in-time maintenance for factories and ICs.
Lots Of Data, But Uncertainty About What To Do With It
Sensors are being added everywhere to monitor everything from aging effects to PVT, yet the industry is struggling to figure out the best ways to extract useful information.
Silicon Lifecycle Management’s Growing Impact On IC Reliability
SLM is being integrated into chip design as a way of improving reliability in heterogeneous chips and complex system.
Leave a Reply