Subtle IC defects in data center CPUs result in computation errors.
Cloud service providers have traced the source of silent data errors to defects in CPUs — as many as 1,000 parts per million — which produce faulty results only occasionally and under certain micro-architectural conditions. That makes them extremely hard to find.
Silent data errors (SDEs) are random defects produced in manufacturing, not a design bug or software error. Those defects generate software failures, which Meta and Google separately reported in papers last year. What makes these errors so concerning is their randomness, the low number of defects, and the difficulty in detecting the failing CPUs. That has raised a call for action in the semiconductor test community.
Simply put, manufacturing testing needs to improve. But understanding the nature of the errors, why they are hard to detect, and why we are seeing them now underscores the rising compute intensity inside data centers, the complex interactions between various components and software, and the importance of getting this right.
“The growth of both the scale of data centers — millions of CPUs installed and managed by a single cloud service provider — and the increase in silicon integration, with higher core count, larger silicon area in a single SoC package, has increased the visibility of all failure modes, including SDE,” said David Lerner, senior principal engineer for product development quality and reliability at Intel. “In addition, the development by Intel of DCDIAG and related tests, along with increased focus, monitoring, and test writing by data centers, has made the SDE phenomenon more apparent.”
But it hasn’t simplified the process of finding them. Different engineering teams have access to different levels of data, which has led to a number of theories about the causes of these errors, as well as the electrical stimulus and environmental conditions that would result in a better test. These manufacturing defect escapes are due to either missing test coverage at time zero or the latent nature of the defect. Some industry experts speculate the errors may also be due early degradation characteristics as a consequence of actual workloads.
Understanding the physical defects provides engineers guidance on screening methods. But first it’s essential to know the stimulus and environmental conditions that point to the errors.
What data centers see
Cloud service providers use millions of server chips that run 24 x 7 with various applications, workloads, and maintenance. In the past few years, software engineers began to notice erroneous behavior traced to a specific to a small subset of server machines, known as silent data corruption (SDC), corrupt execution errors (CEE), or silent data errors (SDE).
Google engineers described the mysterious behavior’s impact in their 2021 paper, “Cores That Don’t Count” as follows:
“Imagine you are running a massive-scale data-analysis pipeline in production, and one day it starts to give you wrong answers — somewhere in the pipeline, a class of computations are yielding corrupt results. Investigation fingers a surprising cause: an innocuous change to a low-level library. The change itself was correct, but it caused servers to make heavier use of otherwise rarely used instructions. Moreover, only a small subset of the server machines are repeatedly responsible for the errors.”
CPUs are essential to managing operations and computations inside of data centers, but incorrect computing can have huge consequences. In their 2021 paper “Silent Data Corruptions at Scale” Meta engineers explained as follows:
“It manages the devices, schedules transactions to each of them efficiently and performs billions of computations every second. These computations power applications for image processing, video processing, database queries, machine learning inferences, ranking and recommendation systems. However, it is our observation that computations are not always accurate. In some cases, the CPU can perform computations incorrectly. For example, when you perform 2 x 3, the CPU may give a result of 5 instead of 6 silently under certain microarchitectural conditions, without an indication of the miscomputation in system event or error logs.”
In their respective papers, both Meta and Google engineers listed examples of the consequences in applications they have seen SDEs. Among them:
With silent data errors, the faulty values pass on to the next computation, and the consequences can be indirectly noted, which makes it harder to debug. To determine the source, engineering teams executed a diligent debug process and learned that errors were highly dependent upon the numerical values fed into the computation. These values differed for each defective core. Such behavior indicates a subtle failing behavior that requires very specific excitation.
The impact of the resulting faults eventually caused both companies to investigate the cause. Both papers noted that SDEs can be traced to a specific core on specific server chips. Still, locating root causes for hardware defects in a huge software stack is a non-trivial effort, which Meta engineers described in detail. The Meta infrastructure team initiated their investigation in 2018 to understand multiple detection strategies and the associated data center performance cost to support such detection. After three years of work, they learned a lot, and illustrated the data dependence upon errors. Root causing to Core 59 on a CPU, specifically, they noted that calculating the INT (1.153) = 0 and INT (1.152) = 142.
Computer engineers have long held the assumption that escaped defects will become fail-stop or fail-noise errors, and hence trigger machine checks or generate wrong answers for many instructions. With these silent data errors, this is no longer the case. They clearly are a cause for concern for both data center owners and IC suppliers.
Nature of the failures
To be successful with any detection strategy, whether it’s inspection-based, ATE test, or system-level test, engineers need to know much more about the nature of these failures. All investigations rule out overall circuit design marginality. These failures are due to processing defects. However, because the ultimate tests involve customers exercising their systems in mission-mode, thermal, voltage, and frequency conditions can exacerbate a localized defect that under other conditions would pass. Targeted experiments performed at data centers have shown that a small percentage of these SDEs only show up after a specific length of operating time.
To develop detection solutions, full understanding of the failure mechanism is needed. The CSPs can narrow it down to code snippets but they don’t necessarily know all the other parameters. This is due both to the difficulty of tying a software code snippet to specific circuit faults, environmental conditions, and a process window localized with precision to the circuit defect.
“It’s very possible that we’re facing defects coupled with circuit design margin,” said Adam Cron, distinguished architect at Synopsys. “Just a little push in any one variable puts them over the edge. Those pushes are a little bit of heat, a little bit of voltage droop, a little bit more resistance in a contact. These small perturbations are not modeled in the models given to designers. And they’re not tested in that manner until they’re put into a customer’s system.”
The key is in the details of the various conditions and use cases. “If you understand how these conditions vary under different workloads and connect the various views in space and time, you will be able to root cause issues more quickly,” said Nitza Basoco, vice president of business development at proteanTecs. “The root causes of silent data corruption effects are an important topic for our industry. If we could diagnose the physical and electrical causes, we can come up with better detection methods. Right now, there are many postulations on the causes of these time-zero escapes and reliability failures. Understanding the mix and contribution level for each is our industry’s next area of focus.”
For a 5nm CMOS ASIC product at a data center — not Google or Meta — one industry source said that product engineers shmooed the customer returns over temperature, voltage and clock frequency with the ATE test content (Schmooing is sweeping a test condition parameter through a range). A vast majority of them failed at specific points in the shmoo. They also noted that a small percentage of the units failed the production level ATE test indicating reliability related failures.
With the transition from planar to finFET CMOS transistors, it’s reasonable to expect the increased processing complexity to uncover defects that exhibit more subtle behaviors.
“There’s definitely some contribution from finFETs, because if you have multiple fins and one of them is defective, it causes parametric behavior,” said Jansuz Rajski vice president of engineering for Tessent at Siemens EDA. “So depending on the test conditions and the path tested, you may or may not detect it. With planar transistor defects there would be a higher probability that it will result in a more substantial faulty behavior that you would be able to detect.”
These problems are not new, however. In a 2014 International Test Conference paper, titled “Process Defect Trends and Strategic Test Gaps,” Intel authors highlighted an emerging defect with a much more complex faulty behavior. Their description of defects with marginal behavior match with current knowledge of SDEs. In their abstract they summarized the situation as follows:
“Lithography challenges are beginning to replace `FAB dirt’ as the more challenging source of defects to detect and screen at test,” Intel engineers wrote. “These new defects often cause marginal behaviors, not gross failures, with subtle signatures that differ significantly from both traditional defects and from parametric process variation. Testing for subtle marginalities in effectively random locations exposes important gaps in prevalent test strategies. The strongest marginality screens focus on fixed locations, and the strongest random defect screens look for grosser signatures. These trends and gaps will drive critical new requirements for fault modeling, test generation, and test application, and implementing them effectively will require a new level of collaboration between process and product developers.”
In investigating customer failing units with SDE behaviors, Intel’s Lerner noted that even properly identifying a defect as an SDE can be a challenge. “From what we have observed, the defects that may cause silent data errors (SDE) are generally no different than those defects that cause detectable errors, both correctable and uncorrectable,” Lerner said. “Whether a defect causes an SDE as opposed to a machine check error (MCE) is primarily a function of where the defect occurs.”
Why SDEs are hard to detect
Compared to consumer computers, microprocessors bound for data centers have longer logic test patterns run at wafer and final test. In addition, 100% of server chips are tested using a system-level test, which runs functional test patterns with long test times — 40 to 60 minutes is typical. Yet despite the test content volume, test flow parts exhibiting SDE behavior escaped.
To understand the challenges in creating test content for either traditional digital testing or functional test testing, consider that the reported SDEs involved a computation addition, subtraction, division, and multiplication. For 16-bit adder to functionally calculate all combinations, it needs to run 232 unique combinations. Assuming a 1 GHz clock, this represents 4.24 seconds. For just an ALU’s adding function this represents an inordinate amount of test time. Decades ago similar extremely long test times for simpler logic ASICs prompted IC suppliers to use fault models based upon manufacturing defects. While including more intelligence about the adder design at the transistor and interconnect level can reduce the functional combinations to something smaller, this remains a non-trivial engineering task.
Structural tests based upon fault models can be graded. Within the universe of possible faults for a device under test (DUT), you can simulate structural tests to determine the coverage. At one time, 98% stuck-at fault coverage was considered high coverage and sufficient. That’s no longer true. In addition, defects impacting the circuit path timing need to be detected. Today, multiple test coverage metrics exist for stuck-at faults, transition faults, and cell-aware faults.
Using functional test pattern coverage calculations and linking them to digital fault models has been an area of research and has yet to be solved.
“You can’t say, ‘Well, I booted the OS and determined what potential faults that covered,'” said Peter Reichert, system-level test system architect at Teradyne. “There’s not a path to get from one to the other. For traditional tests, we have a way of identifying faults and a way of grading our test programs against the faults. With instruction tests performed at system-level test, we don’t really have that ability. It’s very hard to map an action like running a specific program on a processor to what faults will it find.”
Functional test patterns run instructions within a CPU. Test engineers can detect unique defects because the patterns create conditions that differ from structural tests, and thus activate the defective circuit in a way a faulty behavior can be detected. In effect, a functional test pattern more closely mirrors the behavior in a customer system. Functional patterns can be applied during ATE test and system-level test. However, the temperature controls and voltage levels are greater in an ATE tester than in a customer system. So while system-level testers look more like a customer system, they also offer a more precise control of voltage levels and temperature.
So why doesn’t SLT detect these faulty server chips? Part of the answer is in the behavior. These are silent data errors. If you don’t specifically look for them you won’t find them.
“I work with high performance digital customers, U.S. and worldwide, and certainly this is a serious and frustrating topic for a lot of them,” said Dave Armstrong, principal test strategist at Advantest America. “Why is system-level test inadequate at finding these 100 to 1,000 PPM defects? You’ve got to remember that any tester, ATE or system-level or something else, is a programmed general-purpose resource. If the resources exist on the tester, which I believe they do, then the software is not there or the test programming has not been performed to find the errors.”
While the resources probably do exist on the tester, the right test conditions are required to find the problems. SDEs occur at very low defect rates, and they occur randomly on any core inside a server chip or any single computational circuit block on that core. Running all possible computations takes a huge amount of time.
“Why doesn’t this system level test catch it? The only answer I have for you is because the test isn’t focused enough to capture it, but I don’t know how to do this. This is the challenge of system level test that right now I don’t know how to improve upon. Like you said, if I have to do every numeric computation possible that’s obviously not practical,” said Teradyne’s Reichert.
It took Google and Meta engineers thousands of engineering hours to find SDEs. Recently, a cloud service provider engineer reported that using a specific detection strategy, it takes 15 days to screen the company’s fleet of 1 million servers.
Moving software applications failures to a system-level test content can be done.
“Historically, testing specifically targeting failures that manifest solely as SDE was limited,” said Intel’s Lerner. “While there was certainly some system-level test in place for detection of SDE failures, the low rate of observance in the field did not suggest there was a high rate of SDE occurring due to defects. Recently, Intel increased its investment in tests specifically targeted to detect SDE. Today, these tests are screening defects that may have been missed previously.”
Fig. 1: Rate of defect screening with DCDIAG test on third-generation Intel Xeon Scalable SoCs. Source Intel
But SDEs remain difficult to find due to the need for randomness in the SLT patterns which impacts the length of time to detection. In addition, the repeatability of tests can be low for some defects. In a 2022 ITC paper, “Optimization of Tests for Managing Silicon Defects in Data Centers,” Intel authors noted there are ‘easy-to-detect’ and ‘hard-to-detect’ defects using the cumulative DCDIAG time needed to detect them.
What else can IC suppliers do?
Cloud service providers have worked closely with their server SoC suppliers to provide system-level test content that contains SDE occurrences. To improve screening, many industry experts believe more can be done during the manufacturing inspection and test processes.
“The screening has to be much better,” said Andrzej Strojwas, CTO at PDF Solutions. “Can the test quality be improved? Definitely. We actually see quite a bit of data that shows front end is a big contributor, and front-end systematic defects can be translated into logic cells. Two things are absolutely necessary — cell-aware diagnosis, and cell-aware fault models that take into account these front-end systematic defects,”
Understanding circuit design margin for a range of test environment conditions (frequency, voltage, temperature) can provide additional insight. First you need to find the test content that duplicates in some manner the customer’s software code.
“Software-derived tests will allow us to focus in and find the problems and the margins more effectively,” said Advantest’s Armstrong. “You need to know this upfront in pre-silicon validation, first silicon, and post-silicon. In addition, you need to continuously learn from your RMAs. Also, being able to rerun a test based on the new software, whether it’s a new library or new code that identifies a certain problem, is really important.”
Once you have this, then you need to understand the test environment as localized as possible to the circuit block with the defect. “Creating arrays of sensors internal to the design, and combining analytics, enhances your understanding of device operation,” said proteanTecs’ Basoco. “Information on power, voltage droop, timing margin and noise sensors can provide context on an application being run. Other meta data on when the activity occurs, and measurement location, is also necessary. If you understand how these conditions vary under different workloads and connect the various views in space and time, you will be able root cause issues more quickly.”
Such data was not available to the data center engineers reporting out. “At a workshop I listened to a Google engineer present about these SDEs,” said Synopsys’ Cron. “I typed a question, ‘Do you have any SLM data that goes with your time of issue?’ The answer: ‘No, they have got no such data. So we couldn’t even guess what’s going on in there.’ You know maybe there is a hot temperature issue.”
Engineers always appreciate more data. Detection and then diagnosing based on that data is essential.
“This is where SLM comes in, because now if you’re monitoring all the various environmental conditions — process, voltage, temperature, and path delays — and you’re doing this on a regular basis, you can see what was going on in that system when that failure occurred,” said Steve Pateras, vice president of marketing at Synopsys. “And then you’re getting data to help you diagnose whether there was a temperature spike and/or a voltage droop.”
Conclusion
The intersection between CMOS nodes at 10/7/5 nm, lower operating voltages, billions of transistors in data center CPUs with a large number of cores, and data center owners with a fleet of millions of cores, creates the ultimate stress test for these complex devices. In addition, while reports have been specific to CPUs there is no reason that accelerators (e.g. GPUs, TPUs) are impervious to the defect mechanisms causing these silent data corruption errors during computation.
While there is every indication is that you can’t fully test your way out of SDEs in customer’s systems, the test engineering teams will respond. An unprecedented collaboration between, process, product, test and system development teams, in along with end customers, is needed.
References
Related Stories
Silent Data Corruption
How to prevent defects that can cause errors.
Silicon Lifecycle Management’s Growing Impact On IC Reliability
SLM is being integrated into chip design as a way of improving reliability in heterogeneous chips and complex system.
Part of the problem is that companies are used to making ICs that are expected to work 100%, and things like ECC memory are discarded in favor of speed and lower power. That almost guarantees soft errors (due to noise and radiation) will go undetected unless there are CRC checks on data transfers.