More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.
Engineers are beginning to understand the causes of silent data errors (SDEs) and the data center failures they cause, both of which can be reduced by increasing test coverage and boosting inspection on critical layers.
Silent data errors are so named because if engineers don’t look for them, then they don’t know they exist. Unlike other kinds of faulty behaviors, these errors also can cause intermittent functional failures in data centers. Both Meta and Google engineering teams said software practices can identify and contain chips with SDEs test. [1,2] Even though these cannot be 100% screened during inspection and test, experts acknowledge the need for improved manufacturing screening.
Engineers are improving both ATE test coverage of path delay defects and system-level test content that focuses on the computation circuitry. Altering test parameters like frequency and voltage supplies also has been used. By implementing 100% inspection of contacts, vias, and other critical levels, even more SDE escapes are screened.
Meta and Google hyper-scaler data center operators have reported finding SDEs that specifically impact CPU computations. They traced error rates of 100 and 1,000 DPPM (defective parts per million) to a single core on a data center CPU, which engineers attributed to manufacturing-generated defects.
SoCs bound for data center applications are complex, large devices containing billions of transistors and hundreds of billions of contacts. Any computing device bound for data centers — CPU, TPU, GPU — manufactured at 14/10/7/5nm nodes, will exhibit SDEs at similar rates. Put simply, manufacturing high quality parts becomes harder with each CMOS process shrink. And the data center analyses note that the SDE problem does not just manifest as test escapes.
“Undoubtedly, this is a complex problem. A reasonable approach is to divide it into sub-problems,” said Janusz Rajski, vice president of engineering for Tessent at Siemens EDA. “I see three subproblems — time zero defects, which represent test escapes, early mortality defects (a.k.a. latent defects), and aging related defects that occur later during the lifecycle of the product in its system.”
By breaking down the defect behavioral types, engineering teams can focus their efforts. Better tests and better screening can help. Better tests implies new test content. Better screening can be implemented by either altering test conditions, the pass/fail criteria, or by taking advantage of 100% wafer inspection capabilities.
Defects at advanced nodes
In the past three decades, defect types in CMOS processes have not changed significantly. With each shrink, however, defects occur at a higher rate. Design sensitivity significantly changes — both physical structures and timing parameters. So a defect that didn’t matter at 22nm will have an effect at 14nm, for example, and defect that didn’t matter at 14nm will have an effect at 10 nm, etc.
“Layout sensitivity comes from pushing the design rules to the edge,” said Andrzej Strojwas, CTO of PDF Solutions. “For example, consider the design rule tip-to-tip size for gates. If you look at the margins for source/drain to contact to gate spaces, they are virtually mission impossible to accomplish. Next, consider design rules for contact. People had to shrink the contacted gate pitches and the spacing between the source/drain contact and the gate. The contact size and the distance between the contact and the gate have all been squeezed.”
All the test evidence for SDEs points to an increased path delay, which typically means an increased interconnect path resistance. The primary defect mechanisms that cause increased path resistance are a narrowing of interconnect metal and a marginal contact or via. Narrowing of an interconnect can be due to the presence of a random defect or to a layout sensitivity to lithographical variability. For contact- and via-related defects, one needs to consider that holes always have been challenging to create, fill, and land on underlying metal or silicon.
Contacts require a metallurgically sound ohmic connection between the metal and silicon, which is accomplished by forming a silicide. By annealing at high temperatures, the metal and silicon create this interfacial layer. Missing silicide, or insufficient silicidation, results in higher contact resistance. With M1 to M2 vias, it is a metal-to-metal connection, but 10/7/5nm the holes are tiny and difficult to create.
Making a contact/via requires etching a hole and filling it with metal to create a metallurgical contact. The process steps for making contacts/vias is fraught with opportunities for a poor ohmic connection. This has only increased with the introduction of EUV lithography at critical layers. EUV requires thinner resists, which more easily erode during the etching process, resulting in hole-edge roughness. Plasma etching into dielectric is followed by wet chemical cleaning, and sufficient capillary action cleans up the residues. As contacts shrink, residues create an insulating barrier that interferes with the silicide reaction. These additional challenges create a higher probability of increased resistance.
Increase in subtle defective timing behaviors
Along with increased defectivity at advanced nodes comes the increased path delay variability and a decreased timing margin along some paths. With path delay variability, identical defects can result in an SDE on one chip but not in another. Increasing path delay contributes to race conditions and timing glitches at flip-flops, which results in metastability behavior.
Others engineers have noted that finFET circuits demonstrate increased Miller capacitance, which can result in a data dependance at standard cell inputs. These subtle circuit level behaviors align with the system behavior observed by hyper-scaler users of data center ICs.
The data presented by Meta, Google, and Intel indicate SDEs are due to defects that manifest under different system conditions or different computation data inputs. This evidence does point to path delays. In addition, low repeatability of some SDE failures points to timing glitches, which can result from longer or shorter path delays.
“It’s very possible we’re seeing design marginality,” said Adam Cron, distinguished architect at Synopsys. “They’re running things here too fast, or they haven’t given enough leeway in the design. Just a little push and it’s over the edge. And those pushes are transistors — a tiny bit smaller, a little bit of heat, a little bit of voltage drop, a little bit more path resistance. They’re not modeled in the models given to designers. And they’re not tested in that manner until they’re put in the system.”
But the SDE problem is also not as simple as defects impacting timing directly.
The behaviors are more subtle. As highlighted by Intel engineers in 2014 [3], these defective behaviors have unusual behavior with respect to frequency and voltage. Surprisingly, they noted that failures can occur for lower frequencies.
Fig. 1: Conceptual shmoo plots for marginal defect behavior that Intel engineers noted [3]. Source: Anne Meixner/Semiconductor Engineering, redrawn with permission
The reported low repeatability of standard test content and system-level test content speaks to the subtlety of the defect behavior. Such low repeatability may be due to metastability of digital flipflops.
“If you asked me which one of the faults are causing SDEs, my answer is soft faults are the most likely because that’s what the data shows,” said Dave Armstrong, principal test strategist at Advantest America. “It may very well end up being a delay fault. But it also could be a non-deterministic situation caused by metastable flops. If you have a input that’s too close to the clock edge, it violates either the setup or the hold time, and then the output may be indeterminate. It may actually go to neither a one or zero, but mid-band. Then, depending upon how its next interpreted, it can go to a one. Then you can repeat it and it will go the other way. And that goes back to the workload of the neighboring core, which can create tiny and localized perturbations in temperature and power rails. When you repeat operations on the core with the SDE you get different results with workload 1 versus workload 2 versus workload 3.”
Manufacturing screening options
Compared to consumer products, microprocessors for the server market have longer test patterns run at both wafer probe and package test. In addition, every part receives a system-level function test, often requiring between 40 minutes to an hour. Yet SDE failures escape these tests.
“Debugging SDEs is a challenge by itself since it involves all software and hardware stack levels, including at design, test, and usage,” said Walter Abramsohn, director of product marketing at proteanTecs. “As opposed to bugs, these errors are inconsistent, and even when reproduceable, they could appear in different forms at different places with different alterations of the data. It could take months to be able to track them down.”
Characteristic of the SDE failing cores in server CPUs is that they occur randomly, and very specific data values are needed to drive the failure. Applying random data can assist in testing, but seemingly exhaustive testing with random numbers to ferret them out is still incomplete.
In a September 2022 talk, Harish Dixit, release-to-production engineer at Meta, provided details on the company’s software identification processes. “[Dixit] put up a Venn diagram of their testing for 15 days (in-production test, i.e., ripple) and for 6 months (out-of-production test, i.e., fleet),” said Synopsys’ Cron. “The intersection of those two was 70%. After 15 days when you test another 5.5 months, you get 23% more detects. But the shorter, 15-day test uniquely detects another 7%. They did mention they’re testing with random data. With the six months, they’re doing a more exhaustive test across a longer period of time, and so every test is different. But the shorter test detected parts that the longer test did not. What they never really said was, ‘We never actually ran the test that caught the 7% during those 6 months.’ So six months isn’t exhaustive, either, which tells me that you cannot test your way out of it because it’s potluck.”
Fig. 2: Venn Diagram of all SDEs found by Meta’s 6 month (fleet) and 15 day (ripple) screens. Source: Anne Meixner/Semiconductor Engineering
The need for random data and instructions by nature indicates the sheer challenging for manufacturing test. However, by improving manufacturing screening, engineers can improve outgoing quality and reduce the failure rate in the data centers.
“Because of the vast data, address, and instruction space of modern processors and other SoCs, random combinations of data/instructions are needed to cover all possible failure modes, including those that manifest as SDE,” said David Lerner, senior principal engineer at Intel. “Because SDE failures are, by definition, ‘silent’ and only apparent if every bit/digit of every computation is verified, screening to sub-100 PPM levels requires far more extensive testing than what has been required previously to meet customer expectations of quality. [4] Historically, testing specifically targeted failures that manifest solely as SDE was limited. While there was certainly some system-level test (SLT) in place for detection of SDE failures, the low rate of observance in the field did not suggest there was a high rate of SDE occurring due to defects. Recently, Intel increased its investment in tests specifically targeted to detect SDE. Today, these tests are screening defects that may have been missed previously.”
Improving test patterns
In the manufacturing setting, test content can be expanded at wafer probe test, final unit test, and then system-level test.
With ATE sockets, engineers evaluate the applied test patterns based upon their digital circuit test coverages. The baseline test provides stuck-at fault coverage, where the inputs or outputs of logic gates fixed at either a 1 or 0. These patterns can be run at speed or slower. Fault coverage targets are usually set very high, e.g. 98%. The next step up is transition fault coverage, where the unsuccessful changes from 1 to 0, or 0 to 1, are tested at speed. Path delay testing takes a closer look at the speeds, but it is challenging computationally to both determine all the delay fault paths and to apply them. In addition, there’s an assumption that simulation can accurately predict which paths will be the longest. In the previous decades, microprocessor design teams have shown that the predicted longest paths based upon stimulation typically are not what comes out on top during post-silicon validation.
Testing based on cell-aware fault models is becoming more common due to the need for more precise coverage. Multiple industry experts point to cell-aware based testing as a way to increase detection of SDEs.
“Two things are absolutely necessary — cell-aware diagnosis, and within cell-aware, taking into account systematic defects,” said PDF’s Strojwas.
Running multi-cycle patterns, using two or more cycles, can detect more defects. “With cell-aware patterns, we’re looking inside cells, and we’re finding defects based on the cell structure. Defects include transistors-level properties and interconnect defects — opens, shorts, internal bridges,” explains Rajski. “Developed patterns could be single-cycle patterns or multi-cycle patterns. If the transistor is weak or a connection is weaker (higher resistance), then it can result in increased propagation delay. With multi-cycle patterns, one can test for propagation delays. For instance, are the cell outputs switching fast enough? Two-cycle or multi-cycle patterns will stress the timing relationships. Then, one obviously wants to then propagate it along a longer path, so that the likelihood of detecting is higher. This is an important aspect, because we are seeing that with our patterns we actually have much higher ability to detect defects than just combinational cell-aware patterns.”
In a manufacturing setting, system-level test replicates an end-customer’s system.
“System-level test is an attempt to get to more of the real-life scenarios,” said Peter Reichert, system level architect at Teradyne. “So this problem (SDEs) is not going to be stuck-at. It’s going to be something like crosstalk between two circuits, or different thermal heating between within two parts of the die that only manifests when you use the part in a specific operation,”
The challenge is that an exact mapping from a functional test to what faults are covered doesn’t really exist. Engineers apply the content, and if a unit fails it is not passed on to the customer. System-level test should be capable of replicating data center failures by applying the appropriate test content. However, as highlighted earlier, the randomness needed in data for calculations to isolate these single-occurrence errors may not even be repeatable.
From using the Intel Data Center Diagnostics Tool, there now exists a large suite of system-level test sets optimized to detect SDEs. Intel engineers, in their ITC 2022 paper, reported on the detection capability versus the test length for three test sets. With random data being used in each test set, repeating the test set with new random data increases detection rate.
Fig. 3: Intel’s reported results on time-to-fail distribution using the Eigen tests. [4]
Pinpointing environmental stresses
Another tool involves playing with the manufacturing test conditions — clock frequency, power rail settings, temperature, etc.
“One of the points that was driven home to me at ITC 2022 is that ICs become less reliable in their lower-speed power saving modes,” noted Reichert. “Since Vdd is reduced to save power, the transistors are not only slower, but there is more variation in speed. And there isn’t a feasible way to test path delays, at least using scan. One thought is this question — ‘Have the researchers who isolated SDEs to a specific core or operation in their computers identified what speed/power mode the processor is running in?’ If not, it seems like an important thing to do.”
If known, that could be used to run shmoo over those modes. The 2014 Intel ITC paper highlighted that a customer’s failing parts can exhibit distinct differences at lower frequencies than good parts. This could be due to a combination of path delay faults and CPUs using Dynamic Voltage and Frequency Scaling (DVFS).
Thus, changing test conditions of existing test content to align with the voltage/frequency modes observed in the field could be useful. And some testing indicates the change doesn’t have to be too sophisticated to have an effect.
In a poster presented at ITC 2022, Intel engineers demonstrated the effectiveness of reducing the power rail settings by tens of millivolts to detect customer failing SDEs with standard system-level test (SLT) content. [5] By removing it for production level test, an applied voltage stress can exacerbate a latent defect that could cause an SDE. When applying 6% of the SLT content, 3% of SDE DPPM was detected. Afterward, they began running 90% of SLT test content and detected 30% of the SDE DPPM.
Inspecting 100% of die
When engineers know the defect mechanism, they can screen for it during wafer inspection.
Traditionally, production wafer inspection is performed by sampling multiple sites on perhaps two wafers in a wafer lot. To meet a production level process times between manufacturing steps, engineering teams have about 2 hours to perform an inspection. But with recent technology improvements, 100% inspection has become possible. Wafer inspection vendors have developed the equipment, physical design analysis, and computer vision algorithms to support a more sophisticated targeting and comprehension of inspection measurements. This capability can support detecting the most likely SDE defect mechanisms found within standard cells by looking at contacts, vias, and the first few metal layers, which also create the interconnect between standard cells.
Contacts and vias are important to inspect. There are a minimum of three contacts and vias at M1 and M2 in an SoC with 2 billion-plus transistors. An e-beam tool can be used to look for contact and via defects, and different resistance values result in different voltage-contrast emissions.
“With optical proximity correction (OPC) there is retargeting, but there are still problems,” said PDF’s Strojwas. “We are seeing that even in volume-produced products, you can have 2% to 3% yield losses because of very specific layout systematics. Customers can use our extraction tool, called Fire, to identify the areas in which those systematics exist. Next, in the white space, they can insert identical structures in close proximity of these areas. Then, with our vector-based e-beam tool, they can measure at these areas and identify failures. These type of defects occur at parts per billion and require inspecting billions of locations of contacts or stacked vias.”
With 100% inspection, engineers now can search for defects that could narrow an interconnect. In addition, the outlier detection strategies, commonly used in automotive testing, have been adapted to evaluate defect inspection data.
Conclusion
Both Meta and Google engineering teams invested thousands of engineering hours to identify both the cause of silent data errors, as well as the specific software applications, computations, and data values that drive SDEs. Checking a fleet of hundreds of thousands to perhaps millions of servers requires a strategy that makes efficient use of the time allocated to check for SDEs.
Identifying server CPUs with faulty behavior does result in containment, but it comes at a cost. Engineers are evaluating anything that can be done during the IC supplier’s manufacturing test process to reduce the escapes. The options begin with wafer inspection, involve additional test content, and extend to better system-level testing during manufacturing.
Fig. 4: Screening options available for data center SDE detection. Source: Semiconductor Engineering
While screening options can improve detection of defects that cause the silent computational errors, industry experts concur that a 100% containment during manufacturing is not economically feasible. Thus, additional steps need to be taken to mitigate the impact.
“For the various reasons discussed earlier, it is unlikely that test alone is a sufficient mitigation for SDE, especially at data center scale,” said Intel’s Lerner. “At a minimum, incremental resiliency and redundancy features and periodic in-field testing, preferably utilizing efficient BiST-like structures, will be necessary to achieve a sufficiently low SDE FIT rate.”
—Katherine Derbyshire contributed to this report.
References
Related Stories
Why Silent Data Errors Are So Hard To Find
Subtle IC defects in data center CPUs result in computation errors.
Silicon Lifecycle Management’s Growing Impact On IC Reliability
SLM is being integrated into chip design as a way of improving reliability in heterogeneous chips and complex system.
Leave a Reply