Silent Data Errors Still Slipping Through The Cracks

Expanded DFT and test strategies are catching more SDEs, but this rare problem in server fleets is far from solved.

popularity

Silent data corruption errors in large server farms have become a major concern of cloud users, hyperscalers, processor manufacturers and the test community.

Silent data errors (also called silent data corruption errors) are hardware errors that occur when an incorrect computational result from a processor core goes undetected by the system. The data is silently corrupted because neither software nor hardware catches the incorrect calculation of, say, 1+2=4. That error then propagates through the system, maybe for several weeks on long training runs, resulting in a waste of costly resources.

“The whole issue with these errors is that they are silent,” said Adam Cron, distinguished architect at Synopsys. “The program you’re running doesn’t hear about it. The OS doesn’t hear about it. The user doesn’t hear about it. The only reason we know they exist is that eventually the data issue is discovered, but perhaps via a strange path. For example, a user might discover that they are missing a file. Debugging determines that the chips or cores storing these files have made an incorrect calculation that caused the file to be deleted. Based on this seemingly random issue, data centers have begun testing their chips in-system to see what other random defects they can find. And they are finding them at alarming rates of 1,000 DPM.”

Silent data errors are not a new type of failure or caused by a new type of defect, but they are more noticeable. “Everyone agrees this is not some new phenomenon,” said Ira Leventhal, vice president of applied research and technology at Advantest. “But it has just now come to the fore, given the massive scale at which these systems are being deployed. When you’re doing these long training runs, in some cases, you’re involving tens of thousands of servers. So you have a massive number of cores that get involved in just getting one particular job done.”

Both Google and Meta sounded the alarm about SDC errors a few years ago, and the extent of the problem boils down to 1 in 1,000 machines in a data center fleet having a silent data error. So far, SDCs/SDEs are rare hardware events that cannot be traced to any single root cause. Nevertheless, several approaches are being taken, especially on the testing front. This makes sense because some 80% of SDCs are believed to be time-zero test escapes — errors that elude testing. And while the hyperscalers have containment strategies in place, SDEs are widely expected to become an even greater issue as AI algorithms become more complex, and as an increasingly number of GPUs, CPUs and memory will be working together to solve the world’s greatest computational problems.

These silent data errors also have become more prevalent due to ever-shrinking transistors fabricated near their physical limits, huge numbers of cores in data centers running identical code, and the limitations in testing hugely complex systems.

Where do SDEs come from?
The root cause of SDEs can be traced to at least two parts of the chip. “When it comes to where in the silicon SDEs come from, it’s usually from resistive opens or weak transistors,” said Janusz Rajski, vice president of engineering, Tessent Group at Siemens EDA. “By that I mean transistors performing a logic function but they are slower. They have weaker drive strengths so they cause subtle, small timing delays. And depending on the voltage, temperature, environmental conditions, sometimes you may detect it and sometimes you may not detect it. It also depends on where the fault propagates, because if it propagates along a short path, a small delay will not be seen. If it propagates along a long path, like a critical path, then even a small delay may result in malfunction.”

There is no simple solution that solves SDCs. “The causes are multi-faceted and we need to bring to bear multiple solutions together to resolve this problem,” said Advantest’s Leventhal. “You’re not going to get there without a high level of collaboration in the industry, because it’s testing the corner cases. But then, it’s also in test program development, and going after certain kinds of process problems that might lead to SDC errors.”

One method for identifying SDEs involves on-chip monitors that can track how a chip is aging. “Silent data corruption has many sources. It may be defects that were not found. Or maybe there was not enough guard-band to account for lifetime effects, especially with all these new applications that are being released to the field,” said Evelyn Landman, co-founder and CTO at proteanTecs. “There’s aging that may impact the chip itself, but there’s also the chips around it that are delivering power or clocks to the main chip. We use monitoring agents to measure where a chip is in relation to its expected life; for instance, you may expect a certain degradation after 5 years for a server chip, and you can measure at 3 years and see if it is performing as expected for that level of aging.”

Timing margin of devices is especially relevant when it comes to failures. “We often look at the timing margin, because timing margin is a great predictor of failure. With our real-time health monitoring application we provide what we call a “performance index,” showing how close this chip is to a failure. And because we can predict this, we can completely avoid reaching the point of failure,” added Landman.


Fig. 1: Comparison between detecting SDC after a failure occurs or detecting a degradation fault in time to avoid the failure using a Real Time Health Monitor (RTHM). Source: proteanTecs

Effective screening and testing can help to detect SDCs, both during fabrication and during field use. “The ability to identify outliers in datasets and diagnose symptoms such as time delays or voltage degradation can help expose indications of impending SDC errors,” said Jyotika Athavale, architect at Synopsys. “AI/ML algorithms could flag when certain conditions are met that show early signs of SDC.  Silicon lifecycle management is a solution that allows chip designers to monitor, analyze, and optimize semiconductor devices throughout their life. This makes it easier for designers to track and gain actionable insights on their devices in real-time, and ultimately detecting SDC before it’s too late.”

Understanding what is going on inside a device is critical. “Typically, we always look from the outside into the chip to improve testing, like doing cell-aware testing and then device-aware testing,” said Advantest’s Leventhal. “It might make sense to look at it from the inside out, perhaps at something like application-aware or propagation-aware testing, because there’s only certain things that can happen within that processor device that could potentially propagate throughout the entire network. So this concept looks outward to understand how the device is going to be used, and what things can actually propagate.”

This requires more testing at more insertion points. “We’ve been relying on power-on test, which is typically running some kind of BiST,” said Steve Pateras, vice president of marketing and business development at Synopsys. “That gives you some kind of sanity check. But if you’re concerned about RAS issues and looking at really high coverage, or you’re worried about SDCs, then you can’t just have 80% or 90% coverage. You’ve really got to be doing full manufacturing test.”

That kind of coverage needs to be done during manufacturing, but it also needs to be done in the field.

“More and more, we’re seeing data centers taking full manufacturing test patterns and applying them to a system in the field,” said Pateras. “That’s gaining traction. For example, we have our USB PCIe-based high-speed test interface, which we call HSAT. That drives our internal DFT infrastructure, including scan chain. Customers take ATPG patterns and apply them through the HSAT interface. They’re using it for system-level test, but they’re also using it in the field and applying it as needed. So you have a full manufacturing test, getting full coverage for all the defects, and applying this on whatever periodicity is needed — once a day, once a week, once a month.”

DFT plays a critical role here. “The software people found SDEs because they determined that some operation didn’t make sense, and initially they developed functional tests to account for it,” said Siemens’ Rajski. “But functional tests are very expensive to develop. There’s no automation, and the coverage is not known. So we introduced in-system test, which provides deterministic capability for customers concerned about SDE or RAS. In-system test has all the advantages of structure approaches, and companies can use any advanced fault models. Plus, it’s fully automated. So companies are now interested in increasing the lifecycle of servers from two or three years to four or five years. In order to do this, they will have to know which servers are actually still sound and can be kept.”

In addition to DFT, another part of SDE isolation involves monitoring. “You need monitors to identify subtle defects and to understand the process corners — for instance, PVT corners. The sensor readings can correlate with test results, and they can work both in functional mode as well as in structural mode,” Rajski noted. “In functional mode, when a sensor indicates that the slack is getting very slow, meaning the propagation delays are getting close to the clock cycle, that may result in an error. Then, an alert is signaled, and the system goes from functional operation to structural operation.”

Not all SDEs are caught, however. “Silent data errors typically manifest in areas of the design with less testing or checking performed on them, either at the manufacturing stages or in-field,” said Synopsys’ Cron. ” The blocks with large datapath content are prime candidates — the floating-point or arithmetic units, for example. Testing for them is difficult because different defect types manifest in different environments. For example, via defects might manifest at cold temperatures. Resistive defects might manifest at hotter temperatures. Pushing the limits of performance of every single path in the design is too expensive. But focusing on areas of a design with low slack that will fail with small delay defects is a good start. Synopsys TestMax uses slack-based cell-aware ATPG, for example, to help weed out defective components before they ship.”

On the test side, it is helpful to be able to identify which tests at ATE and SLT are best for detecting SDEs. “There’s lots of stuff that’s being implemented at the end product level,” said Nitza Basoco, technology and marketing director at Teradyne. “One of the strategies today involves ensuring that what our customer is doing at the end-product level is doable either on the ATE or the system-level tester level, and vice versa.

At first, many observers pointed to advanced process nodes as a source of SDEs. “The defects do not occur because we have more advanced nodes per se, or because the process or the design itself has a problem,” said proteanTecs’ Landman. “It’s the increasing workload and the stress inflicted on these complex systems. That’s why measuring workloads and remaining margins is critical to understanding the interaction of SW with the HW.”

Nonetheless, there is a certain marginality at the advanced nodes that affects SDCs. “Marginality may have two components, said Andrzej Strojwas, CTO at PDF Solutions. “One is that under those heavy loads and continuous usage, these errors are suddenly popping up. The second issue — and this is really scary — is that the marginality could have existed at the time of testing, even final test, but only becomes a problem in the field. One example of this is shorts between the gate and source drain contacts, because the margins there are basically almost negligible. This is really tough for designers to meet, and as a result, you actually may have leakages that become reliability failures in the field.”

One method for testing these marginalities is with electron beam inspection. “The eProbe can detect some leakages, some opens, not exactly the full opens, but soft opens at a rate of 10 to 15 billion structures tested per hour,” said Strojwas. “And you think, ‘Hey, that’s pretty good.’ But for processors with 100 billion vias, that’s still difficult to do. And you cannot test every wafer.”

Data errors are not new to the industry, and error correction mechanisms exist for just that reason. “You could be computing along and, for whatever reason, a bit flips in memory. It could be a row hammer event, it could be an error in data transmission,” said Steven Woo, fellow and distinguished inventor at Rambus. “Then, if the error exceeds the capability of the error correction mechanism to detect it, it becomes silent. You don’t see it. So the compute, when it reads the data, thinks it’s a valid value even though the bit has changed. That’s called silent data corruption.”

Redundancy and detection methods must be deployed strategically. “The concern is how much redundancy or detection capability you need to avoid silent data corruption. Nothing is ever perfect. So what techniques can I develop to infer that SDC has possibly happened?” said Woo. “If you’ve got an iterative solver, or something like that, it should eventually converge. If it’s not converging, it’s probably a sign that some data corruption has happened. And it only gets worse as more components interact with each other.”

Stress testing is not a new approach, but it’s a useful tool in the toolbox when it comes to weeding out SDC errors. “Stress testing is a powerful play here. So when you elevate the temperature, elevate the voltage, and put the device in an environment where it doesn’t really want to be, you start to see where the cliff is,” said Teradyne’s Basoco. “And the more that you can do that early, the more feedback you can give to the designers, and to the system owners, and to the software guys, to know where the safe space is and where it isn’t. Plus, you check both ends of stress, both hot and cold, to identify the worst-case corners, especially at the lower process nodes. You want to make sure you’re exacerbating the device in the conditions or worst-case corners for that particular device.”

One of the most gnawing aspects of silent data errors is that they are conditional, only precipitating into failures given certain voltage, frequency, temperature, workload, and aging combinations. “Latent defects are not symptomatic until after the components have been operational for a certain duration,” said Synopsys’ Athavale. “Monitoring environmental changes in the silicon, as well as application stress and timing margin changes for memory and logic paths over time, allow for prediction of an SDC error before it manifests.”

AI to the rescue…sometimes
Machine learning is known as a technology that can highlight patterns that are not obvious and may depend on multiple variables coming together in just the right way. That sounds like ML would be perfectly suited to apply to silent data corruption.

But first one needs to have a robust and large data set.

“Assume I’ve collected a bunch of data from devices that had SDC errors, and so now I can say I have these 50 devices, 100 devices, 1,000 devices, and I have millions of other devices that don’t have errors,” said Basoco. “I run machine learning using all the data that I take in from my monitor, sensors, internal tests, mission mode type tests, etc. Now, can I then use it to say, ‘Oh, I should be careful about these devices that I already have in my fleets?’ But you have to test lots of devices in order to get there.”

Nonetheless, ever since Google and Meta engineers sounded the alarm regarding SDE/SDC, the industry has been rapidly developing additional testing strategies and DFT programs.  “It’s beneficial to be a hyperscaler, because those companies are doing their own chips and have the chip data,” Basoco said. “You have the SLT data, you have the server data, and the module data at all the points along the way. You can do machine learning-based analysis, pulling in data that you have through partnerships and the sharing data across the board. That’s what a lot of people are doing today — developing and trying to figure out if they have data-sharing partnerships in order to be able to do more detailed investigations.”

Conclusion
Systems are likely to get more complex, so one key in dealing with SDEs may lie in developing more fault-tolerant systems with superior error correction mechanisms and built-in resiliency.

“In the future, we’re going to be talking about reliability as a first-class design parameter in architectures,” said Rambus’ Woo. “So you’re going to need ways to detect errors as best you can, and you’re going to need ways to recover from them and be tolerant of them. Architects are going to have to factor that in going forward. Some of these techniques are being developed right now, and some are well-established, like error detection and correction mechanisms. But we will have to go beyond that.”

There is plenty of support behind solving SDC issues. “The good news is people are being very aggressive about it,” said PDF Solutions’ Strojwas. “Google, Meta, AWS, made the whole community aware of it, and they made this a big issue for the entire supply chain. So there are new approaches to testing, diagnosis and usage going forward.”

And while it’s not clear that the industry will unravel all the root cause contributors to silent data errors, strategies to identify, check/verify, diagnose and test for SDEs are well underway.

— Ed Sperling contributed to this report.

Related Reading
Strategies For Detecting Sources Of Silent Data Corruption
Manufacturing screening needs improvement, but that won’t solve all problems. SDCs will require tools and methodologies that are much broader and deeper.
Why Chips Fail, And What To Do About It
Improving reliability in semiconductors is critical for automotive, data centers, and AI systems.



Leave a Reply


(Note: This name will be displayed publicly)