Silent Data Corruption Considerations For Advanced Node Designs

Detecting when degrading defects impact the useful life of the chip.

popularity

Ensuring reliability, availability, and serviceability (RAS) has long been an important consideration for many types of electronic systems, with major implications for chip design. Clearly, military hardware must be very reliable, and servers and automotive systems are also expected to be available constantly. Some amount of failure is inevitable, so being able to repair, avoid, or mitigate faults is also important. In recent years, the demand for RAS has increased and the ability to achieve target metrics is a growing challenge.

There are many factors at play in this evolution. The sheer scale of today’s huge chips adds complexity, while the advanced processes required to build them have exceedingly high transistor density and greater manufacturing variability. In the mission to maximize performance, intrinsic and extrinsic degradation now become bigger concerns, affecting silicon health in different phases of the silicon lifecycle than might be expected. The increasing use of multi-die packaging adds more thermal issues to the mix.

At the system level, the tight integration of hardware and software introduces new vulnerabilities and increases the threat surface. Workloads of diverse applications are unpredictable and have ever-higher peak requirements. Despite these challenges, RAS expectations are growing, and targets continue to get more stringent. Users are demanding better guarantees of reliable, safe, and secure operations of devices, software, and systems. Traditional manufacturing test and runtime diagnostics are simply not good enough.

The number one RAS concern for hyperscalers is silent data corruption (SDC), in which data errors go undetected by the overall system. An error might be masked and cause no issues but, if it propagates, it can lead to a system or application crash or hang, or an incorrect result for an application. Any of these results can severely compromise RAS metrics. Unavailability and wrong answers are both highly undesirable outcomes.

SDC sources include permanent, intermittent, transient, and degrading faults. Root causes can be extrinsic manufacturing defects, intrinsic silicon aging, or radiation induced transient errors. Severe defects are easily detectable by manufacturing tests, but weak defects can create circuit marginalities that fail only under a certain combination of operating conditions. Some latent defects are not symptomatic until after chips have been operational for a certain duration in the field. Weak and latent defects are not easily detected during manufacturing.

The importance of detecting errors and avoiding SDC events in the field, during chip mission usage, is underscored by a surprising characteristic of finFET technology used in sub-20nm processes. As shown in the figure above, degrading defects shift into the useful life of the chip. It is imperative to prevent these defects from causing SDC events. Fortunately, it is possible to detect such faults and predict impending failure by monitoring whether critical timing and voltage parameters exceed a pre-defined threshold.

Since key error mechanisms have been observed to manifest themselves in the field as timing issues, one of the best predictors for potential errors is reduced timing margins. Monitoring environmental changes in the silicon, monitoring application stress, and tracking timing margin changes for critical speed paths over time can allow for prediction of SDC events. A prognostics solution that monitors timing paths in mission mode can be used to detect degrading faults and predict remaining useful life (RUL) before manifestation of a failure. The RUL is calculated based on the measured rate of timing degradation compared with a reference baseline.

Monitoring voltage and timing during field operation are two important examples of paying attention to what is happening inside the chip. This is a key part of silicon lifecycle management (SLM), which extends from design through manufacturing and in-field deployment to end of life. Successful use of SLM techniques during field deployment of chips requires software to perform analytics on both individual chips and “fleets” of chips, thereby also enabling the detection of outliers. Collecting the data on the state of the silicon requires a set of IP that monitors Vmin, timing, and more.

The Synopsys Silicon Lifecycle Management Family provides a complete solution for all stages of SLM, including in-chip monitoring to predict impending failures and avoid SDC. The process of using Synopsys SLM IP involves four steps:

  • Monitor: Embedded monitors are integrated early in the design stage
  • Transport: Data from the monitors is gathered and transported to a unified SLM database
  • Analyze: The monitor data is analyzed throughout the device lifecycle
  • Act: Based on the analysis, insightful decisions are made in real time at any lifecycle stage

Since the monitors drive the entire four-step flow, a wide variety of embedded SLM IP is needed to achieve all the desired benefits. Key elements of the solution include a path margin monitor (PMM), a clock and delay monitor (CDM), a process, temperature, and voltage (PVT) monitor, a signal monitor, an AXI bus monitor, ring oscillators, and error correcting code (ECC) logic. This IP, along with the supporting analytics software, enables the capabilities to:

  • Monitor the health of the chip
  • Detect symptoms of a degrading fault
  • Predict an SDC error before it occurs
  • Take the necessary corrective action to improve availability

Based on the calculated RUL, the SLM solution can identify the point at which a component or system is likely to fail and take action to prevent it. This improves the system’s reliability and availability by identifying earlier potential issues before they lead to an SDC event. This helps demanding applications such as automobiles and hyperscaler data centers to achieve their target metrics while reducing maintenance costs and improving overall operational efficiency. This, in turn, satisfies consumer demand for better RAS despite all the challenges outlined earlier.

In summary, traditional manufacturing test cannot find all defects or prevent SDC events in the field. High-performance and mission-critical applications demand increased resiliency of hardware components, with enhanced RAS capabilities. Large, complex, deep submicron designs make it hard to meet this challenge, requiring mitigations in design, architecture, and test while employing best practices throughout the chip lifecycle.

An effective silicon lifecycle management solution can help address these challenges by improving silicon health and operational metrics. The Synopsys SLM Family, including SLM IP, enables the performance and RAS requirements for demanding applications and provides the monitoring and detection capabilities needed to enhance manufacturing quality and product integrity in the field. Learn more about the Synopsys Silicon Lifecycle Management solution.



Leave a Reply


(Note: This name will be displayed publicly)