Hunting For Hardware-Related Errors In Data Centers

Why tracking defects is so difficult in the fab, and what’s being done to change that.

popularity

The semiconductor industry is urgently pursuing design, monitoring, and testing strategies to help identify and eliminate hardware defects that can cause catastrophic errors.

Corrupt execution errors, also known as silent data errors, cannot be fully isolated at test — even with system-level testing — because they occur only under specific conditions. To sort out the environmental conditions that produce errors, engineers need data internal to SoCs that ideally is time-stamped so they can trace it back to the failing lines of code. But that time-stamped data isn’t available yet, and it will take time to provide that capability. In the meantime, pressure is building inside large data centers to solve this problem — particularly among the hyperscalers that have encountered these issues.

The data center computational errors that Google and Meta engineers reported in 2021 have raised concerns regarding an unexpected cause — manufacturing defect levels on the order of 1,000 DPPM. Specific to a single core in a multi-core SoC, these hardware defects are difficult to isolate during data center operations and manufacturing test processes. In fact, SDEs can go undetected for months because the precise inputs and local environmental conditions (temperature, noise, voltage, clock frequency) have not yet been applied.

For instance, Google engineers noted ‘an innocuous change to a low-level library’ started to give wrong answers for a massive-scale data analysis pipeline. They went on to write, “Deeper investigation revealed that these instructions malfunctioned due to manufacturing defects, in a way that could only be detected by checking the results of these instructions against the expected results; these are ‘silent’ corrupt execution errors, or CEEs.” [1]

However, once evident, data center operators can quantify the impact to their operations, such as a data center’s reliability and availability. Google listed the computational impact of CEEs in increasing order of risk as:

  • Wrong answers that are detected nearly immediately, through self-checking, exceptions, or segmentation faults, which might allow automated retries.
  • Machine checks, which are more disruptive.
  • Wrong answers that are detected, but only after it is too late to retry the computation.
  • Wrong answers that are never detected.

All four possibilities have impacted data center operations, as both Google and Meta engineering teams noted in 202. [1, 2] The first two issues, while troublesome, may be recoverable. The second two are not recoverable because the incorrect computation has been passed on to a subsequent piece of code, which eventually disrupts system operation.

Each randomly occurring manufacturing defect maps only to specific calculations and data inputs. The SDE’s low repeatability indicates that environmental conditions play role. Due to an SDE’s specificity, finding these hardware defects is extremely challenging. It requires a concerted effort over time, which Google characterizes as “many engineer-decades.”

Industry experts have concluded that data internal to the failing core could provide insight and potentially lead to improved screening procedures during chip manufacturing. At advanced CMOS process nodes, in particular, SoCs bound for data center applications are more vulnerable to interactions between system operation conditions and the challenges of fabricating tiny devices, interconnects and vias.

“At a high level, SDE issues are very subtle in nature,” said Steve Pateras vice president of marketing and business development at Synopsys. “They’re not hard faults. So having a more stringent test is not going to necessarily solve this problem. I don’t think these are things you can find in a testing environment where it is about playing with more patterns. You’re going to find that it’s a result of an environmental set of conditions, whether it’s the environment in which the system finds itself, or whether it’s a certain level of compute processing that is creating stress on the system. We need something beyond test and DFT — beyond manufacturing test in my view.”

Others agree. “The SDE problem space is growing faster than our ability to solve it. They are not easy to trace at the hardware level and manage to propagate all the way through the stack to the application level — either crashing or disrupting the system operation,” said Walter Abramsohn, director of product marketing at proteanTecs. “We need new telemetry sources that can monitor these devices in the field and predictively raise a flag before SDE occurs. Aging and degradation must be accounted for when maintaining IT infrastructure. This is the only way we can scale our compute resources reliably.”

Telemetry sources are on-die circuit monitors that provide internal data on environmental conditions, circuit performance, and even specific functional block operations. This data may be insightful, which is one premise of silicon lifecycle management (SLM) infrastructure. Engineers wanting to access longitudinal data on circuit and functional performance must link on-die monitors data from manufacturing test to first system power-up, and all the way through end of life.

Based on high-level observations, it’s possible that workloads in neighboring cores can influence the SDE repeatability. Due to the lengthy time needed before a complex SoC exhibits an SDE, some surmise that degradation due to accelerated aging is a potential cause. However, without more internal data tied to actually system workload data and tracking of device degradation, these causes are pure speculation.

You don’t know what you don’t know
What’s consistent about the Meta and Google engineering teams’ reports is they lack additional data about what’s going on in the circuits during these failures.

“At one of the workshops I attended on this topic, I listened to a Google engineer explain their analysis,” said Adam Cron, distinguished architect at Synopsys. “I asked if they had any silicon lifecycle management data to go with the ‘time of issue.’ The answer: They had no such data. Maybe there is a hot temperature issue. They don’t even know. And Meta engineers have stated they don’t even have logic BiST to apply. I’m not saying logic BiST is a solution. But at least it gives you some sense that the same thing that could have passed on the tester still does, in fact, pass in the system.”

Engineers at Google further confirmed their need for internal data, “Our understanding of CEE impacts is primarily empirical. We have observations of the form, ‘This code has miscomputed (or crashed) on that core.’ We can control what code runs on what cores, and we partially control operating conditions (frequency, voltage, temperature). From this, we can identify some mercurial cores. But because we have limited knowledge of the detailed underlying hardware, and no access to the hardware-supported test structures available to chip makers, we cannot infer much about root causes.” [1]

Detection is indeed vital, but without information for diagnosis little can be done to improve manufacturing screening solutions or respond appropriately with a resilient design solution.

“A very important distinction to be made is first detecting these problems in the system, but also — more important — is diagnosing them.” said Synopsys’ Pateras. “This is where silicon lifecycle management comes in, because now if you’re monitoring all the various environmental conditions, PVT, path delays, and you’re doing this on a regular basis, you can see what was going on in that system when a particular failure occurred. And then you’re getting data to help you diagnose based on information like if there was a temperature spike or a voltage spike at the time of the failure.”

In other words, the internally collected data needs to be time stamped to connect it to the execution of 60 lines of code.

Understanding conditions to improve screening
To fully comprehend what’s happening in the field requires on-die data about design margin, environmental conditions, and functional level operation at or near the time of failure. Such data then can guide both manufacturing screens and future circuit and architectural design choices.

The conditions of failure are sometimes counterintuitive. Consider clock frequency, for example.

In their paper, Google engineers noted, “Temperature, frequency, and voltage all play roles, but their impact varies: e.g., some mercurial core CEE rates are strongly frequency-sensitive, some aren’t. Dynamic Frequency and Voltage Scaling (DFVS) causes frequency and voltage to be closely related in complex ways, one of several reasons why lower frequency sometimes (surprisingly) increases the failure rate.”

An increase in failure rate with lower frequency corresponds with observations from others. In a 2014 paper, Intel engineers wrote that some CPUs (22nm and 14nm) exhibited failures at lower frequencies.

More recently, an industry source shared that their analysis of 5nm HPC RMAs determined that a majority of these chips failed standard production at-speed scan tests when executed at lower clock frequencies.

These observations point to the complex relation between clock frequency and functional path delay margin. In the presence a defect, certain temperature, voltage conditions, and a lower clock frequency, a specific calculation can produce the wrong answer. Such scenarios suggest that changing the clock frequency outside of the DFVS-constrained envelope could alleviate the failing behavior.

“The clocks are a lot more adjustable and controllable in these products,” said Dave Armstrong, principal test strategist at Advantest America. “One of the questions that might give us a clue is, ‘How many VCOs are there on a design?’ From my recent experience, there are multiple time domains. And there’s tracking between the time domains and noise triggers on those time domains. That’s an area that could allow us some adjustability to provide more margin. Again, there are certain triggers that might occur based on the SDE system-level data we’re seeing.”

Indeed, the triggers can be difficult to decipher without additional knowledge about the hardware and the internal chip data.

On-die monitor data during system operation
The electrical and thermal environment for a device differs between ATE and an end customer’s system. In the mid 1990s, in response to customer returns, CPU suppliers started using system-level test (SLT) as an additional screen.

“The faults that can cause SDE fall in to three main categories — test escapes, latent faults that are not present at test but manifest once the part is in the field, and faults that occur in the field due to wear-out or damage,” said Peter Reichert, system-level test system architect at Teradyne. “SLT test can help with test escapes in that it provides two advantages over traditional test. The first advantage is that functional test is a good way, possibly the best way, to find path delay faults. The second advantage is the low cost of an SLT test insertion allows for longer test times at reasonable cost, thus providing greater fault coverage.”

Almost all empirical evidence about SDEs points to path delay-related failures. Functional test has been part of the engineer’s arsenal of test content at both ATE and  SLT. With longer test times, system-level test can cover more functional paths. But with about 1 hour of manufacturing test time, it’s simply impossible to replicate the system workloads that hyperscaler data center operators execute. Thus, an alternative to testing all possible failing paths is to measure path delay margin in the field.

Path delay monitors measure the timing relationship between signal paths and clocks into latches. And such measurements could identify anomalous behavior contributing to an SDE.

“The path delay monitor itself is a very small IP. So in a typical design, you can have hundreds if not thousands of them spread across a die,” said Firooz Massoudi, solutions architect at Synopsys. “They’re all connected through a scan chain to the central controller. The monitor connects to the actual functional paths, especially critical paths. It continuously monitors the timing margin available on those paths under different temperature and voltage conditions. Also, you can set triggers, i.e., thresholds. So if the margin of an individual path drops below the threshold, the monitor sends a signal to the central controller. Putting this in combination of voltage and temperature monitors plus ring oscillators gives a holistic view of where the silicon is at any one time. When a monitor detects a failure, you have a profile of the silicon data that really helps to analyze what is happening.”

Currently the monitors and the associated embedded analytics are tied through a central controller and can be observed during manufacturing test and in-field usage. And this data can be used in conjunction with on-die BiST.

“Built-in self-test (BiST) and embedded analytics structures are important because they open up opportunities to gain a more sophisticated view of how the chip is functioning,” said Richard Oxland, product manager for Tessent Embedded Analytics at Siemens EDA. “At a system level, they provide the kind of data that can be used to build a signature of normal system operation. You can configure the on-chip monitors at run time to measure what matters and use the device’s own processing power to build the signature — an edge-based use case. And you can pull the data off-chip and do a cloud-based analysis of the behavior of a fleet of chips.”

In fact, feeding back cloud-based analysis to improve manufacturing test is an emerging application.

“Advantest Cloud Solutions (ACS), when used in combination with embedded sensors in chip design (IP provided by multiple partners), provides comprehensive insights for root cause analysis of silent data corruption (SDC) issues,” said Keith Schaub, vice president, technology and strategy at Advantest America. “Embedded sensors enable the collection and monitoring of health data across the entire chip area throughout the device’s test lifecycle, as well as field use. This data then can be fused together in the cloud and analyzed to detect and isolate the sources of SDC. Additionally, this cloud-based service can be used to automatically generate test patterns or screens, which can further aid in proactively identifying and resolving SDC issues. With the ability to collect, integrate, and analyze data from multiple sources throughout the device’s life cycle, engineers greatly enhance their understanding, and thus, resolution of SDC issues in a continuous and proactive manner.”

Fleet behavior from manufacturing telemetry data already is being used. While in-field data collection is now possible, the telemetry data is independent of system failures. Meta and Google observed SDE failures that occur for specific calculations. To connect telemetry data to such system failures that are detected hours/days/weeks after it occurs requires two things — timestamps of the telemetry data, and logging of telemetry data by data center operators to enable correlation when SDEs are detected.

Several companies are working toward this SLM capability. “Right now, this is not implemented, but that’s one of the features that we’ll be enabling — specifically adding a timestamp to monitor measurements as they are sent to the controller,” said Synopsys’ Massoudi.

As noted earlier, there are multiple on-die monitors that can add context to SDE failures. Their implementation into large SoCs need to be approached in a systematic manner while addressing the feasibility of managing the data they generate. “When it comes to SDEs, test alone is not enough,” said Oxland. “We need a secondary methodology, comprising of five further elements:

  • Some a priori knowledge of the manifestation of SDEs in ‘system performance’ parameters that can be measured, because we need to instrument the chip to detect;
  • Forensic data recording, including multiple data sources, because any anomaly we detect will need to be attributed to a silicon or software fault;
  • Automatic detection of anomalies using baseline data from BiST and embedded analytics’
  • Automatic triggering of detailed forensic data dump, because we don’t want to sift through unmanageable volumes of data, and
  • The ability to run test patterns on DFT structures in a deployment environment, because we need a cost-effective way to zero in on the cause once the SDE has been detected.”

Detecting latent defects and premature aging in the field
It may be months before an SDE manifests in a core. This leads to speculation that an interconnect path or a transistor has degraded during that time. However, it takes very specific data and instructions to excite an SDE failure, which also can take months to occur. And without additional data, the evidence that some SDEs are due to reliability causes remains inconclusive.

What do the IC suppliers know? “We judge that the risk of SDE for reliability or intrinsic degradation to be low,” said David Lerner, senior principal product quality engineering at Intel. “This is due to our pre-production qualification methods, which characterize all wear-out and degradation modes to provide confidence that every part has sufficient margin for the lifetime of the part.”

Yet for data center operation at nearly 24/7, and shrinking MOS devices on their SoCs, engineers wonder if early wear-out is at least partially to blame. This could be due to a marginal defect or switching a transistor more frequently than the reliability engineers modeled.

The classic bathtub curve used by semiconductor reliability engineers shows the classical three stages of early-life failures, failures during the lifetime and end of life failures.

Fig. 1: Semiconductor device bathtub curve. Source: Ansys

But other factors can accelerate these failures, resulting in increased resistance along the interconnect or degradation in transistor switching time, which in turn can cause increased path delay. Interconnect thinning, due to a combination of electromigration and defects, can increase resistance over time. Reliability engineers use various stress tests to precipitate failures. The failures mechanisms all related to oxide degradation — as hot carrier injection (HCI), negative-bias temperature instability (NBTI), and time-dependent dielectric breakdown (TDDB).

Understanding the cause of these effects is important for the lifecycle of devices, but they won’t solve the problem with silent data errors. “Finding an issue with aging is not going to help with random defects that were missed in testing,” said Andrzej Strojwas, CTO at PDF Solutions. “Now, at very specific locations, things may be happening because it was compromised at the start by having the random defects that narrowed the interconnect path.”

While reliability engineers work with the production test engineering team to screen early life failures, a.k.a. latent defects, at time zero in manufacturing test, better screening could help to some extent, according to Intel’s Lerner. But they won’t catch everything.

“There is a fraction of latent defects that can be accelerated through manufacturing stress and screened. This fraction maybe increased through increases in test coverage and application of a more aggressive stress,” Lerner said. “However, there are limitations, and latent defects that may cause SDE cannot be eliminated by manufacturing, necessitating in-field mitigations.”

Without in-field data over the usage of the part it’s difficult to determine if the failures that took months to appear at Meta and Google are caused by aging or having the appropriate environmental conditions for a  specific execution. The reliability question has sparked interest in using monitors to detect aging specific mechanisms.

“Silicon aging is a huge challenge,” said proteanTecs’ Abramsohn. “We see increasing evidence of faster and more widespread effects of aging in the field, resulting in performance issues and SDCs. The ability to determine the root cause of these issues depends on the ability to localize where the aging is faster and why. We need to be able to measure it and create a database that helps us pinpoint the mechanisms, and then study the patterns to identify where most of the issues are coming from.”

Numerous companies offer on-die monitors that specifically measure transistor properties associated with wear-out. Typically separate from functional circuitry, designers can spatially distribute these monitors across large die. In the field, periodic logging of their measurements can provide longitudinal data about variation in aging.

“We have an aging sensor that measures all the transistor reliability mechanisms (e.g., NBTI, HCI),” said PDF Solutions’ Strojwas. “It’s a small circuit, 50 x 50 microns, that we can put in the die. It has its own voltage regulator, so we can do a local stress of just the sensor, separate from the rest of the circuitry. You can increase the voltage from 0.7 to 1.2 to 1.5 volts to do the accelerated measurements while the device is operating in the field. If there is anything that’s happening in terms of aging, these sensors will provide very precise information.”

Conclusion
While all of these insights into accelerated mechanisms for SDEs are useful, having a timestamp that corresponds to faulty system behavior is still required.

The CPU SDEs/CEEs reported by hyperscalers heighten engineering teams’ interest in the value of internal device data. As proteanTecs’ Abramsohn summed it up, “SDC is a symptom, not a cause. The urgent need is to understand the root causes, but that requires more visibility in areas that are currently obscure.”

Several changes are needed. SoC design engineering teams need to intelligently embed on-die monitors to provide insight into path timing margin, local voltage and temperature conditions, and aging. Secondly, the data logging infrastructure needs to be put in place to enable connectivity to localized analytics and external connectivity to manufacturing testers, data center systems and cloud platforms. Finally, engineers absolutely require time-stamped failure and telemetry data and to accurately perform diagnosis.

References

  1. H. Hochschild, et al, “Cores That Don’t Count,” Proc. 18th Workshop on Hot Topics in Operating Systems (HotOS), 2021.
  2. D. Dixit, et al, “Silent Data Corruption at Scale,” Feb. 2021.
  3. Ryan, et al, “Process Defect Trends and Strategic Test Gaps,”, International Test Conference, 2014.

Related Stories
Screening For Silent Data Errors
More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.

Why Silent Data Errors Are So Hard To Find
Subtle IC defects in data center CPUs result in computation errors.



2 comments

Jan Hoppe says:

Hardware Trojans can be inserted in FPGAs or FPGA basbased SoCs. FPGAs are very common in cloud, 5G, SDRs etc. They can bring service denial or data stealing and that’s is on large scale. Another problem overlooked and of growing danger to homeland security. Your article is instructive. What is probability of such failures

Anne Meixner says:

Jan,
Glad you liked the article.
Security was not a focus of this article.
These are manufacturing defects.

The Meta paper cited stated they observed hardware defects on order of 1000 PPM. The Google paper cited didn’t directly share numbers though they did state they observed similar numbers as the Meta team.

Leave a Reply


(Note: This name will be displayed publicly)