Identifying Sources Of Silent Data Corruption

Rooting out the causes of silent data corruption errors will require testing improvements and much more.

popularity

Silent data errors are raising concerns in large data centers, where they can propagate through systems and wreak havoc on long-duration programs like AI training runs.

SDEs, also called silent data corruption, are technically rare. But with many thousands of servers, which contain millions of processors running at high utilization rates, these damaging events become common in large fleets. And while mission-mode testing is catching more SDEs, detecting all corruption errors is proving more complex than expected, requiring potential changes to design, manufacturing, DFT, test, and hardware and software operations.

“Silent data corruption happens when an impacted device inadvertently causes unnoticed errors in the data it processes,” said Jyotika Athavale, director of engineering architecture at Synopsys. “An impacted CPU might miscalculate data silently (i.e., without any indication of the data corruption). Given that today’s compute-intensive machine learning algorithms are running on tens of thousands of nodes, these corruptions can derail entire datasets without raising a flag, and they can take many months to resolve. This can result in massive cost implications. And the complexity and scale of the problem also make it difficult to take proactive measures. Moreover, since chips have long production cycles, SDC fixes can take several years before they are reflected in new hardware.”

Silent data errors are particularly vexing because they do not originate from any one source or mechanism. “There is a plethora of possible root causes when it comes to SDCs,” said Andrzej Strojwas, CTO at PDF Solutions. “People claim that the most likely culprit is test escapes, but a lot of these faults are not going to manifest themselves until they are exercised in real-world conditions. So leakage is one systematic defect you have at the transistor level because of the ridiculous tolerances and all the different layout patterns. The sensitivity to particular patterns can be missed in the testing and become reliability issues. Yet another category is aging, which results in changes in threshold voltages. All of these can be addressed with appropriate test structures.”

SDCs can occur at any point in the silicon lifecycle, which is why many silicon lifecycle management approaches are being applied to address these errors. “To eliminate early life failures, you have to apply stress test to accelerate aging,” said Janusz Rajski, vice president of engineering for the Tessent Division at Siemens EDA. “Next is high quality, deterministic test used in production, but we also use the same test in-system. Sometimes companies will do the test when a core is idle, or sometimes they will do it as preventive maintenance on a regular basis, weekly or once a month, but then there’ll be very thorough tests in-system. So that is a big change.”

Fig. 1: Semiconductor failure rate over lifetime of a device. Source: Siemens EDA

“SDCs are a big problem,” said Rajski. “Data published by several companies already indicate that 1 in 1,000 servers might be affected by this type of behavior. So obviously the implications of this could be even more severe in mission-critical or safety-critical applications. It was first noticed in the hyperscalers because of the sheer number of processing units, but it is happening in other places, as well.”

Also on the testing side, engineers are taking a long look at chip architectures. “There’s a need for what I call architecture-aware testing, because there are really only certain computational elements in the logic chip that can potentially propagate out to the entire network,” said Ira Leventhal, vice president of applied research and technology at Advantest. “So then it becomes a case of, ‘Let’s focus certain test vectors on those particular areas of the core — not just using classical scan and reviewing the results, but also bringing portions of functional test into ATE insertions.'”

The closer a test can come to mission mode, the more likely it is to catch a faulty result from silent data errors. “We can do scan over high-speed interfaces using a LinkScale card on our 93k tool, essentially operating the part like it would be operated in actual mission mode,” Leventhal said. “In this case you would not run the full set of scenarios that you would run at system-level test, instead just focusing on particular regions of the core and exercising the part in certain ways to trigger the SDC problems. This is especially important in heterogeneous integration, where we want to find all problems at the die level and ensure silent data corruption-hardened die. These are the kinds of things that would allow us to get the upper hand.”

But even with an upper hand in test, companies are realizing that cooperation throughout the supply chain is truly needed to solve the SDC problem. While collaboration between device manufacturers, test and DFT companies is enabling better screening and mitigation approaches, long-term strategies are needed because the silent data error problem will only become worse with increasing device and system complexity. For example, Meta is exploring ways to make its applications more fault tolerant to silent data corruptions. “There is a longer term horizon effort around improving and incentivizing architectural solutions and design patterns that are inherently SDC resilient,” said Sriram Sankar, director of engineering at Meta.

Acknowledging the urgency needed behind resolving SDCs throughout the supply chain, the Open Compute Project launched its Server Component Resilience Workstream project, which includes members from AMD, Arm, Google, Intel, Microsoft, Meta and NVIDIA. Last June it awarded funding for six research projects to tackle SDEs. [1]

Others agree that help from the research community is needed. “Doing more of what we have been doing in the past will not significantly move the needle,” said Rama Govindaraju, engineering director at Google, during a recent panel discussion. [2] “We need more research in this space because this needs a more holistic solution, and new ideas, creative ideas, have to be brought to bear. [SDC] is a very, very hard problem, where a lot of research and end-to-end solutions need to be developed.”

Tracing the causes of SDCs goes all the way back to design. “One of the chip design houses we spoke with indicated that even design errors can become sources of SDEs,” said Adam Cron, distinguished architect of Synopsys. “A post-silicon validation tool can generate corner-case workloads for threaded applications. Then, a post-silicon exerciser can be used to find design errors on silicon during manufacturing test and in-field test. These tests can also be used in simulation and emulation to determine whether the design logic is actually wrong. But sometimes it takes real silicon to find these peculiar errors.”

Cron emphasized the importance of taping out actual chips to identify new failures, especially at new technology nodes. “Memories sometimes need new technology-specific BiST algorithms to find these new defect syndromes. Taping out real silicon is a great hedge to finding design styles or physical layout schemes that may later become SDEs.”

So far, the semiconductor industry has made the greatest progress in more effectively screening out defects at test and containing SDE damage through software. However, marginality in design and variability in processes are likely causes of SDEs that are particularly challenging to find. What starts out as a latent defect that passes all tests and inspections can fail in the field once it is subjected to real-world conditions.

“With an SDE, you may have an IP that is marginal, but it is good enough to pass at time zero,” said Nitza Basoco, technology and marketing director at Teradyne. “But with a certain combination of signal paths and environmental conditions, a marginal defect may become a critical defect. And because the defects are sensitive to different combinations of factors, they may or may not result in a failure.”

Unlike testing, which catches failures after they occur, some strategies take a preventive approach. “We are focused on predicting these failures, because today the big problem is that most of these issues come back to the semiconductor vendors, and they spend a long time and a lot of money doing failure analysis. And in many cases, they cannot reproduce the failure.” said Evelyn Landman, co-founder and CTO of proteanTecs. “The goal is to avoid failures in the first place. We are seeing that we can find defects in some chips that are coming back, in cases where our methodology was not applied.”

For example, specific process monitors that are sensitive to leakage current can predict the expected leakage current for every chip based on a model. If the leakage current exceeds the expected value, it indicates a potential silent data error defect.

A second method uses telemetry monitors to track timing margin because changes in timing margin can be a strong predictor of failures. A change in timing margin can result from resistive metal lines, for instance, caused by loose connections, or slow calculations by a transistor, caused by feature roughness.

Timing delays also depends on where the fault propagates. If a timing delay propagates along a short path, a small delay will not be seen. If it propagates along a lengthy critical path, then even a small delay may cause a malfunction.

Yet all these monitors do come at a cost in the form of silicon area. Only so many sensors can be added to the device, particularly at a leading-edge node, before there is no longer available space. For this reason, any telemetry sensors need to be placed judiciously where they are most needed.

Part of the reason silent data errors are happening more frequently may have to do with the increasing amount of time chips spend in high stress modes. “An SoC wasn’t meant to be run 24/7 at the maximum voltage, maximum frequency, high power consumption,” said Teradyne’s Basoco. “It was meant to be at these levels for shorter periods of time. And now it’s spending the majority of its time in a high stress environment, so things are going to break down. We need to look at what we are going to be running, and what things we need to tweak in order to ensure the longevity of these devices in an environment that is very different than the environment at qualification.”

“With silent data corruption, there are three ways in which we’ve gotten things under control — by detecting these errors, minimizing them, and building defect-tolerant systems,” said Advantest’s Leventhal. “You have to be able to do all three of these things. I liken it to the way in which communications are dealt with. We never expect a communication link to be perfect, so you always have this error checking going on. If the system detects an error, you perform a retry. That is the expected mode of operation.”

And testing for SDCs doesn’t happen in isolation. “Any structure implemented to help detect defective components would help detect devices that would otherwise be silent upon failure,” said Synopsys’ Cron. “There is not a specific tool for the silent data errors, but any features that indicate general silicon quality issues is useful. For example, process monitors combined with outlier detection analytics can help filter out chips that might become problematic in the field.”

The alarm bells around silent data errors have calmed down because companies like Meta and Google have found ways to contain the problem using software. “For now, there is containment, but if SDE frequency increases to the point where they don’t have containment or the current workarounds no longer suffice, the industry needs to be prepared,” said Basoco.

And the industry is being proactive with multi-die assemblies. “A DFT (design for test) architecture for 3D-ICs is emerging that takes into account silent data corruption errors and the growing size of designs,” said Siemens’ Rajski. “Test compression is not new, but its use in these cores is ubiquitous. Second, the streaming scan network, which delivers packeted data from one core to another, is used on most large designs and delivers data at very high speed. And we’re developing iJTAG to help program a large number of instruments in parallel. At last year’s ITC we introduced in-system test, which provides deterministic capability especially for customers that are concerned either by silent data errors or which have a specific problem in RAS (reliability, availability and serviceability). Finally, you need monitors to understand the process corners, the PVT corners, structural sensors, like a slack or path sensor, and you need to correlate the sensor readings with the test results.”

Conclusion
Though the number of silent data errors caught during manufacturing and testing is increasing through DFT, process monitors and more thorough testing, the industry has a long way to go in terms of identifying all the root causes of SDEs, mitigating their impact, and preventing them from propagating in data centers.

Nonetheless, leading companies are responding with more comprehensive and mission-mode based testing as well as in-system testing methods. Better sharing of data between hyperscalers, IC manufacturers, test companies, DFT providers and EDA companies will lead to more holistic solutions while preventing a duplication of efforts within the supply chain.

References

  1. Open Compute Project press release, https://www.opencompute.org/blog/ocps-server-resilience-initiative-sdc-academic-research-awards-announced
  2. “Speaking Up About Silent Data Corruption – Part I, Synopsys,” https://www.youtube.com/watch?v=FgpqDPf0mZA

 

Related Reading
Silent Data Errors Still Slipping Through The Cracks
Expanded DFT and test strategies are catching more SDEs, but this rare problem in server fleets is far from solved.
Strategies For Detecting Sources Of Silent Data Corruption
Manufacturing screening needs improvement, but that won’t solve all problems. SDCs will require tools and methodologies that are much broader and deeper.



Leave a Reply


(Note: This name will be displayed publicly)