Strategies For Detecting Sources Of Silent Data Corruption

Manufacturing screening needs improvement, but that won’t solve all problems. SDCs will require tools and methodologies that are much broader and deeper.

popularity

Engineering teams are wrestling with how to identify the root causes of silent data corruption (SDC) in a timely and cost-effective way, but the solutions are turning out to be broader and more complex than simply fixing a single defect.

This is particularly vexing for data center reliability, accessibility and serviceability (RAS) engineering teams, because even the best tools and methodologies are not able to prevent randomly occurring hardware-related errors. And just throwing more testing and inspection at SDCs, which increases the overall cost of developing a chip, doesn’t guarantee that problems won’t crop up later during a chip’s lifetime.

With SDCs [1], the silent part is deadly to data centers. In 2021, Meta and Google engineers [2, 3] separately shared their findings, which included their in-depth software stack-to-hardware interaction investigations. Their results pointed to manufacturing test escapes.

To reduce SDC test escapes from thousands to 10 DPPM, engineering teams investigated a range of test techniques. Simply increasing the number test conditions (e.g. frequency) to existing content has reduced test escapes. Feedback from hyperscalers like Meta, on the nature of failures has guided SoC device suppliers to add targeted content at system level test (SLT). For instance, Intel DCDIAG [4] has content that increases data randomization in computations, which can be used to address the known data dependencies of these failures.

Over the past three years technical discussions at test conferences have highlighted additional options to pursue:

  • More test patterns on a SLT systems as a cost-effective option;
  • More functional test patterns;
  • Increasing test patterns targeted at timing related behaviors, and
  • Exercising nearby cores and/or IP blocks to create noisy environments.

According to root cause analyses, SDCs often precipitate as open defects. And because SoCs can contain 10+ billions of contacts and vias inspection for poorly formed interconnects could add an additional effective screen.

But screening won’t be sufficient to meet the 10 DPPM goal. “A combination of screening and architectural techniques will be required to address this problem,” said Sankar Gurumurthy, director of silicon design engineering at AMD. “As screening techniques improve over time, it might be possible to screen our way out, but we are not there yet.”

Even with architectural changes, full containment of SDCs is unlikely. It’s up to the RAS teams to identify and fix remaining issues through the software infrastructure. [5]

SDCs occurrences can be explained by four possible scenarios:

  1. Test escapes;
  2. Marginal defects, as defined by Intel engineers [6];
  3. Latent defects, also known as early life failures; and
  4. Circuit degradation, also known as wear-out failures.


Fig. 1: Semiconductor failure rate over lifetime of a device. Source: Siemens EDA

“Timing error defects — these escapes are our biggest contributor, and it is a long random tail of failures,” said David Lerner, Intel senior principal engineer, during a 2023 ITC panel. “We don’t see any novel defects. We do observe a bias toward open defects. Latent defects certainly are a factor. Today, they are the minority of in-field returns. As we get better at cleaning up time zero with the test coverage, I’m more concerned with latent defects in the future. I’m not too concerned with aging degradation, because we don’t see much of evidence of that.”

Hyperscaler datacenter operators agree that data dependency makes these difficult to detect, while it is challenging to distinguish whether SDC errors are caused by a test escape or an aging-related issue.

“We noticed that in the initial stages of SDC detection that a large percentage of SDCs were due to test escapes and architectural choices, and a smaller portion was due to degradation,” said Sriram Sankar, director of engineering at Meta. “However, with iterative partnerships across our vendors, and by sharing valuable field data over time, we’ve been able to fix a large percentage of the faults due to the test escapes using vendor system-level test (SLT), as well as integrator testing. This has helped plug some, but not all, of the test escapes. As a result, along with the remaining test escapes, we observe failures are in significant part due to marginalities and failures over time. For the latter we can’t conclusively say degradation or data dependency.”

Manufacturing screening improvements
The general consensus is that targeted screening will reduce escapes, and that hyperscalers are willing to pay for the extra inspection and/or testing. That improved screening can be implemented throughout the manufacturing flow, starting with 100% wafer inspection during production.

The key is identifying exactly where that extra screening should happen.

For instance, real-time screening with e-beam voltage contrast can identify resistive contacts/vias and perhaps even latent defects, yet it can significantly impact throughput. Targeting specific areas can speed up the process by at least an order of magnitude.

“There are a lot of unnecessary features inside of a chip, especially when you look at the micro-features,” said John Kibarian, CEO of PDF Solutions, at the recent PDF user conference. “If you think about searching for an open contact or via, contact/via densities are less than 10% of the wafer. So, 90% of the wafer surface is not worth rastering.”


Fig. 2: Comparison between e-beam measurement for the full die vs. direct scan. Source: PDF Solutions

“At the lower metal and transistor layers, if I’m trying to inspect the contacts that land on a transistor, there are a minimum of three per transistor,” noted Indranil De, general and engineering manager of e-beam tools at PDF Solutions. “With 30+ billion transistors on a die for a big NVIDIA chip, that’s well north of 100 billion contacts per die. But if I’m inspecting the lower metal layers (metal-1 to metal-3), the number of vias is much fewer. Today, we have customers inspecting these metal layers. These layers are more amenable simply because they don’t have that many vias. As you get down to the transistor level, your throughput begins to go down because now you’re looking at a gargantuan number of contacts. The question is whether our e-beam tool, or any tool out there, is capable of inspecting that many contacts for opens in a reasonable amount of time.”

Inspecting for opens in vias/contacts the throughput inspection time needs to consider the most critical layers and trade off achievable inspection within a reasonable time for production. That’s why layout analysis for those critical failures is so essential.

Test conditions
All the root cause behavioral evidence points to defects impacting timing relationships. Existing structural and functional test content can reduce these test escapes by stressing the localized electrical environment, as evidenced by published and anecdotal reports. In decreasing order of value/effectiveness are voltage margining, multiple frequency points, and temperature.

At SLT, Intel test engineers lowered power rail voltage by as little as 10mV and detected SDC-attributable escapes. They shared the results in an ITC 2022 poster [7].

“We looked very closely to why we’re missing these defects,” said Sandeep Bhatia, DFX lead for TPUs at Google, during an ITC 2023 panel. “On selection of chips that we have examined closely, our current fault models tend to focus on the logical sensitization. We need to introduce electrical sensitization. In the lab we ran functional patterns and measured the power voltage sensors on one of these chips. With these specific patterns running, the power grid is noisy — peaks touching on overshoots and undershoots. Now, imagine a marginal defect that only triggers an error if the power grid happens to reach a specific voltage level.”


Fig. 3: Power noise can be identified using functional test. Source: Google/Sandeep Bhatia

Bhatia added that high fault coverage for fault models can improve screening. “But you also need to look at the stress conditions,” he said. “What are your test conditions? How much voltage margins are you putting? Are you doing hot and cold temperature? Some of these techniques are not necessarily new, but to what extent are we applying those? Can we introduce switching noise during tests, power noise, and signal integrity noise? We need to look at the additional factors that are playing a role in screening but are often not captured in fault coverage numbers.”

To find those additional factors, engineers turn to more sophisticated data analysis, leveraging data across the silicon lifecycle.

“The key to identifying and catching more complex defect profiles is connecting data from across the product lifecycle so screening models have more information to identify predictive patterns,” said Ron Chaffee, senior director of applications engineering at NI. “With silent data corruption, the source and the impact of the defect is convoluted — hidden within a variety of variables or in the subtle interactions of multiple components and environmental conditions. To improve, these systems must have the right data sets, in the right format, to connect silent data corruption events to the historical data from that unit.”

The right data can make the connection between test conditions in the field versus manufacturing. “As these variables become evident, those now lead into what aspects of the product should be tested more rigorously,” said Chaffee. “Test engineers can design new tests to provide more granular data into those predictive areas, making the system continually adapt and improve.”

The reports from hyperscalers and their SoC suppliers show some defects only occur in low frequency corners. At first glance that may not make sense. Some of this is attributed to dynamic voltage and frequency scaling (DVFS), which is used to ratchet down power consumption during different states of operation.

“In the world of voltage frequency scaling, we now know that when we look at a set of 100 functional tests, it may be number 1 and 5 that are critical for high frequency, and 5 and 7 are critical for low frequency,” said Klaus-Dieter Hilliges, platform extension manager, at Advantest Europe. “So different functional tests can be critical for low frequency, which means low voltage. Then, accordingly, that defect that you have, in that particular exercise path at low voltage, is showing up at that particular functional test.”

Hence, engineers are applying more functional test content in addition to existing structural test content.

Using more effective test content
There is widespread agreement that by targeting the tests at observed behaviors, engineers will reduce test escapes. How to get there is less obvious.

“There are numerous ways to approach this,” observed Harish Dattatraya Dixit, principal engineer at Meta. “First thing would be to evaluate the return on test investment, and continuously tune content based on fleet feedback. Manufacturing testing time is a direct function of cost and affects yield. At the end of the day, we have to bear the cost for these SDCs at some point in the silicon lifecycle. Increasing test time for capturing larger data dependencies and randomizations can be beneficial. Earlier, there was little to no data on the failures with vendors. But now, with collaborative partnerships with Meta, other hyperscalers, and vendors, we have seen improvements in the content selection, testing sequence approaches, and test execution strategy.”

The behavioral evidence indicates defects are causing small timing delays, which are exacerbated under certain conditions. As first described by Intel engineers [6] test/DFT engineers refer to these as marginal defects.

For structural test approaches, engineers can generate new test content with advanced fault models that can reveal these small timing delays. Digital/logical fault models are used to generate structural test content. The list of fault models below is in rough order of simulation effort and pattern count:

  • Stuck at faults
  • Transition faults
  • Path delay faults
  • Slack-based transition faults
  • Cell-aware faults
  • Slack-based cell-ware transition faults
  • Pseudo-exhaustive physically-aware faults [8]

At first, it seems possible to generate test content for all of these advanced fault models. And it is more complicated than it sounds.

“Several decades ago, a set of papers co-authored by Li C Wang [9,10] showed that if your fault models don’t perfectly match your defects as you get toward the end, where you’re getting the last bit of your coverage, you are actually biased at that point in favor of your model as opposed to your defect,” said Jennifer Dworak, professor of electrical and computer engineering at Southern Methodist University, during the ITC panel. “Advanced fault models can certainly help, but only if they are very well matched to the defects that are actually occurring. Otherwise, you end up wasting time and resources going after things that are not going to happen. You end up with a fault number explosion and the corresponding pattern explosion.”

Others agree with regard to test cost and detecting faults that matter. “No one fault model is going to be the solution,” said AMD’s Gurumurthy at the ITC panel. “Adding a fault model, adding tests and data for the fault modeling, normally gets you better patterns than just random patterns. But you have to keep in mind the cost of things. So you always have to make sure that whatever pattern you came up with, and with nice specific experiments going on, that it’s more effective than random patterns. It needs to [target] real bad parts.”

Transition faults are inadequate for detecting small time delays. Slack cell-aware faults, in contrast, consider circuitry within a logic gate, resulting in more targeted tests.

“For SDCs, time on the tester seems negotiable,” said Adam Cron, distinguished architect at Synopsys. “Slack-based cell-aware is one of your more expensive fault models in terms of pattern count, pattern creation, etc. But once you have these patterns, you can figure out which ones might be more useful. Slack-based cell aware requires a tremendous amount of computing resources. If it’s a pattern count issue, you can just do slack-based transition delay. It’s better than just transition delay, and has been proven in many published works. At SNUG, I’ve seen dozens of papers that show devices fail at a lower frequency with slack-based patterns versus non-slack-based patterns.”

The reality is that advanced CMOS nodes digital circuits have become much more sensitive to process variations. Janusz Rajski, vice president of engineering for Tessent at Siemens EDA, said during the ITC panel that test content could be generated over process corners. “In design, we have the paradigm of multi-mode and multi-corner. Yet in testing, we refrain from generating patterns for multiple corners — voltage, temperature and process. This approach was mentioned many years ago, but it was considered prohibitively expensive. Now, we have a different understanding of the tradeoff between cost and quality. Tests will be applied under realistic conditions, but the patterns also should have to be generated for those realistic conditions. Currently, customers will generate patterns for nominal process corners, and they will apply it at different voltage levels. If you change the process corner, the timing characteristics are different, and new paths become critical timing paths.”

Google and Meta RAS teams found SDCs by tracing to the failing sections of code, which is mission-mode exercising of circuitry. Often referred to as functional test, these can be applied at the wafer and unit levels on traditional ATEs or on an SLT.

“My perspective on functional tests is that scan is the foundation of structural tests. Because of the DI/DT of noise contribution, functional test is needed due to defect sensitivity,” said Intel’s Lerner. “The goal is to eventually derive scan tests with advanced fault models such that we don’t need functional tests. But we won’t know that until we’re able to measure with functional tests that we’re not finding anything.”

Evidence shows that creating a noisy environment increases likelihood of SDC detection, which favors running functional content on SLTs.

“In a lot of the SDCs discussions people say, ‘We can’t just add more scan/structural tests. We need to supplement with functional testing.’ And even defining functional — what does that mean? Are these targeted functional tests or full software-driven mission-mode tests. That matters,” said Ed Seng, strategic marketing manager for advanced digital at Teradyne. “It’s not just running this math equation. It comes back to the activity level, and I’m sure this will get worse with advanced packaging and chiplets. The main thing we’re focused on is enabling more mission-mode type of testing for our customers, and removing barriers for them to adopt SLT.”

For complex SoCs applying functional test content on an SLT platform represents the most economical option. Test engineers can apply functional tests in several modes, running software applications and mission-mode assembly-based code. Test engineers refer to the latter as bare metal content, i.e., running assembly code without an operating system.

“Everybody has their best practices,” said Advantest’s Hilliges. “It’s just that we don’t think we have one answer. We need more structured content. And I’m very confident all will work on that. Meanwhile, we need to work with functional content and make it way more systematic than it is today in terms of writing targeted bare metal content, not just system-level tests via an operating system.”

Improving on-line error detection and monitoring
Given the dependency upon specific sets of code to trigger an SDC attributable defect, it is not cost-effective to guarantee 100 DPPM with manufacturing test. This accounts for the increased interest in online error detection and telemetry circuit data. Meta noted that it already collects some telemetry data.

HPC SoCs with billions of transistors are not simply gates and architectural blocks. They are complex networks of transistors, resistors, and capacitors, subjected to the physics of electro-magnetic waves that are influenced by thermodynamics. Telemetry circuits can measure a range of parameters, including power rail voltage droop, localized temperature, and transactional activity at the microarchitecture.

Engineers typically use telemetry circuits to monitor and characterize environmental parameters for correct operation, such as power grid and micro-architectural faults with performance counters, for example.

“However, given the ‘silent’ nature of SDCs, these two focus areas no longer suffice,” said Noam Brousard, vice president of solutions engineering at proteanTecs. “Characterization requires a more precise understanding of the source of the problem. Environmental sensors are too broad. Counters may only find the problem after it becomes a problem. What we need is the monitoring of the actual in-chip circuits’ performance — specifically, the ability of these circuits to propagate signals from source to target to meet timing constraints. These are great precursors to failure. Temperature, aging, latent defects, voltage droops, and even increased software stress will eventually lead to an increased delay in the signal propagation, edging the circuits closer to a timing failure.”


Fig. 4: Continuous health and performance monitoring in mission-mode. Source: proteanTecs

A path delay in a critical data path can identify anomalous behavior related to SDCs. However, this only works when path monitors are placed in those paths, and it’s not possible to place them everywhere.

“Telemetry can help root out some potential root causes,” said Google’s Bhatia. “But at least for the determinations of the classes of chips that we looked at, the issue was random defects. And in these cases, I’m not sure how telemetry can help us. If the defect is random, it can be anywhere on the chip, and it’s going to be in a different location for every chip.”

As telemetry can’t guarantee early detection of SDCs, RAS and design engineers recognize the need to change the ‘silent’ data corruption to ‘audible’ data corruption — not using error correction but detection so that execution can cease within microseconds. This requires both hardware and software working together.

“These errors are silent, and thus very hard to detect,” Bhatia said. “They’re doing damage all along until we happen to find it, and that’s when we say all the previous computations that we were doing were actually erroneous. How do we recover from that? We need to investigate design techniques that are at least self-checking. If you trigger that defect by actual workload, then you have a feature on the chip that will trigger and flag it. You convert silent data corruption to loud data corruption. You still have a defect. You still have the problem of pulling the chip out, but you’ve drastically reduced its impact by exposing it. There are a whole set of techniques that have been published, but it’s time to look at them and see how we can cost-effectively introduce them into our current designs– parity, ECC, chip monitors, compute replay, algorithmically based fault tolerance, and residue checkers. It may be time to introduce some of those concepts into design.”

Other SoC providers agree. “We need a multi-layered approach, a sort of ‘all of the above’ answer,” said Intel’s Lerner. “The bottom of the pyramid is sub-basement, because this is really a test conversation. But it’s necessary to talk about architecture here because we need to maximize architectural detection- parity, fabric protection, residue, etc. These are key tradeoffs of performance and cost. And we all need to work with our design partners, making it clear that if you don’t make the right decisions up front in your design, you’re going to be exposed.”


Fig. 5: A multi-layered approach required to achieve low rates of SDCs. Source: Intel/David Lerner

Conclusion
Containing SDCs will require changes throughout the design-through-manufacturing flow but also in the field detection. It’s a multi-faceted problem revolving around subtle manufacturing defects, and no single solution will solve all problems.

“Should we aim to screen out? Absolutely, we should,” said Meta’s Sankar. “Will we get zero SDCs? Most likely not. But we may get asymptotically to zero with efforts across the stack. And this is the goal that we expect our manufacturing partners to pursue. However, we don’t think screening alone is a solution. Design, architectural and software solutions are required to mitigate this at scale. While we can optimize for one node and one chip and then invest in building the best test content, ensuring computational integrity is too important for us to solely rely on this one method. Instead, we should explore how we can make our applications fault tolerant and how an infrastructure can actually tolerate SDCs. Meta is committed on this part of the journey in the stack. In addition, there is a longer-term horizon effort around improving and incentivizing architectural solutions and design patterns which are inherently SDC resilient.”

Related Stories

Why Silent Data Errors Are So Hard To Find

Screening For Silent Data Errors

Hunting For Hardware-Related Errors In Data Centers

Pinpointing Timing Delays in Complex SoCs

References

  1. https://www.opencompute.org/documents/external-ver-0-3open-compute-specification-server-component-resilience-workstream-pdf
  2. “Silent Data Corruptions at Scale”; Harish Dattatraya Dixit Et. Al., Facebook, arxiv.org/abs/2102.11245, Feb. 2021
  3. “Cores that Don’t Count”; Peter H. Hochschild Et. Al., Google Research, Jun. 2021
  4. Intel’s diagnostic tool for data centers, DCDIAG.
  5. “Detecting silent data corruptions in the wild,” HD Dixit, L Boyle, G Vunnam, S Pendharkar, M Beadon, S Sankar, March 2022, arXiv https://arxiv.org/abs/2203.08989
  6. P. G. Ryan, I. Aziz, W. B. Howell, T. K. Janczak and D. J. Lu, “Process defect trends and strategic test gaps,” 2014 International Test Conference, https://ieeexplore.ieee.org/document/7035276
  7. Intel’s poster from ITC 2022, “Improving System Level Screening Efficiency Through Negative Voltage Margining,” by Luis D. Rojas et. Al.
  8. W. Li, C. Nigh, D. Duvalsaint, S. Mitra and R. D. Blanton, “PEPR: Pseudo-Exhaustive Physically-Aware Region Testing,” 2022 IEEE International Test Conference https://ieeexplore.ieee.org/document/9983894
  9. L. C. Wang, T. W. Williams and M. R. Mercer, “On efficiently and reliably achieving low defective part levels,” 1995 International Test Conference https://ieeexplore.ieee.org/document/529890
  10. L. Wang, M. R. Mercer and T. W. Williams, “Using target faults to detect non-target defects,” 1996 International Test Conference https://ieeexplore.ieee.org/document/557120


Leave a Reply


(Note: This name will be displayed publicly)