Enhancing Silicon Reliability With In-System Test And SLM Data

A convergence of DFT techniques and the proliferation of in-silicon monitors can flag potential failures before they occur.

popularity

Innovation in semiconductor development and manufacturing shows no signs of slowing down. Ever-larger chips at ever-smaller geometries create new challenges all the time. At the same time, competitive pressures are shrinking time to market (TTM) and putting enormous pressure on project teams. Furthermore, the wide use of electronics in safety-critical applications demands better reliability, availability, and serviceability (RAS). Chip test, from lab bring-up and production manufacturing all the way into the field, is one domain strongly affected by these trends.

I recently had the pleasure of presenting the tutorial “What Can In-System Test and SLM Data Bring to the Reliability Community” at the IEEE International Reliability Physics Symposium (IRPS). This focused on how the latest innovations in chip test are enabling a new generation of more reliable chips and systems. In the process of preparing my talk and slides, I gave some thought to how this critical part of the semiconductor industry has evolved. I observed that there is a convergence of design for test (DFT) techniques and the proliferation of in-silicon monitors and sensors that is helping to meet the challenges.

The chip test concepts familiar from previous generations of silicon are still in use. During manufacturing, chips are tested at the wafer and package level using automated test equipment (ATE). For 3D chip designs, where dies are stacked, test also occurs at the stack level. On the manufacturing floor, testers run patterns that are carefully crafted to excite possible silicon defects and propagate the effects so that they can be observed. Good chips are shipped, while bad chips undergo failure diagnosis to find the defects and correct them to improve yield.

Most of the patterns are produced by automatic test pattern generation (ATPG) tools. These use fault models (stuck-at, path delay, IDDQ, transition, etc.) to represent potential defects and coverage metrics to help estimate what percentage of defects could be caught. Compression techniques minimize test time and enable high volume production. The use of scan chains makes ATPG practical. Flip-flops are strung together into one or more chains, so that desired values can be loaded serially and results can be scanned out.

In addition to ATPG-based scan test and compression, logic built-in self-test (BIST) and memory BIST are widely used. Together, these DFT techniques cover a large portion of typical chip test needs. Many of the same ideas carry over naturally to post-manufacturing checks and in-field operation. Parity bits and error correcting code (ECC) are common in memories and bus paths to detect errors and correct them when possible. If an alpha particle flips a bit, the error can be detected before it compromises system functionality. With BIST, the chip internally generates patterns to test logic (LBIST) and memory (MBIST) and checks the results. BIST might run every time that the chip reboots or periodically during its mission, even in the field.

Beyond these well-established techniques lies silicon lifecycle management (SLM), the “other” focus of my tutorial. SLM spans the complete life of a chip, from the earliest phases of design through bring-up lab, manufacturing, and in-field mission usage. For example, analysis of parts that failed in the field provides data fed back to manufacturing for improved reliability, and then to the design stage to enhance future revisions or derivatives of the chip or its technology library. SLM is a relatively recent concept that continues to grow in importance.

One important aspect of SLM is the use of in-system test during field usage. DFT, monitors, and sensors embedded within the chip can detect a wide range of failures. Perhaps more importantly, they can detect silicon degradation due to aging or environmental stress, flagging potential failures before they can occur. Proactive preventive maintenance is then possible. This is especially important in safety-critical applications, where chip failures can lead to system crashes or other catastrophic consequences. SLM IP combined with DFT create massive observational capabilities for an SoC in the field.

One example of an IP block supporting SLM is the ring oscillator. A loop with an odd number of inverting gates creates a series of clock pulses, and a counter keeps track of them. The number of pulses in a given time interval provides a measure of silicon performance. If in-field test shows that the performance at the same operating conditions is decreasing over time, this may be a sign of degradation due to aging and preventive maintenance is recommended. Synopsys provides a broad portfolio of SLM IP, as summarized in the figure below.

The SLM process can be summarized as:

  • Monitor: Embedded monitors and structures are integrated early in the design phase
  • Collect: Data from the monitors is gathered and transported to a unified SLM database
  • Analyze: Data from the embedded monitors is analyzed throughout the device lifecycle
  • Act: Based on this analysis, insightful decisions can be made in real time at any lifecycle stage

DFT, SLM IP, and supporting software enable many interesting capabilities. One important use case is better prediction of the minimum operating voltage (Vmin) for the chip. These values are predicted during chip development, with a guard band to account for aging effects. If silicon performance is underpredicted, then the chip is overdesigned. With accurate field data from SLM IP, especially when coupled with AI machine learning techniques, Vmin prediction is more accurate and chip design is closer to optimal. In this case, SLM sets the margins and DFT (or functional loads) confirms operation.

Another example of SLM and DFT in action is better library characterization. All phases of chip design are based on libraries intended to model silicon. As soon as process test wafers are fabricated, SLM IP can begin collecting actual data. Analyzing deviations between predicted and measured values, for example of threshold voltage (Vth), provides insight into how the library model can be improved. By the time that production chips are in development, they are using more accurate libraries based on the test chips. Further data from chips in the field can be used to refine libraries even further, improving future designs.

I concluded my tutorial by discussing how some chips can repair themselves in the field based on feedback from SLM, DFT, and other in-system test methods. Many recent standards for interfaces and buses, including UCIe, AIB, and HBM4, provide spare lanes or pins. If SLM data indicates a current or impending failure, swaps can be performed automatically to bypass the problem. Some chips also duplicate cores for dual modular redundancy (DMR) or provide three registers for triple modular redundancy (TMR), with a failing flop “outvoted” by the other two.

Given that memories take up half of many chip designs, techniques have been developed to keep them reliable. These techniques include error correction/detection (ECC), parity, built-in self-repair (BISR), column/row redundancy, etc.

The growing use of redundancy and the rise of 3D designs are increasing the importance of robust DFT and SLM solutions. The combination of SLM and traditional DFT provides a rich trove of data for detailed analysis. Gathering this data from “fleets” of deployed systems and not just individual chips enables even more actionable results. AI tools are finding new uses for DFT and SLM data and generating new metrics to improve chips across their lifecycles. I am certain that every future tutorial and talk I present will contain exciting new results.



Leave a Reply


(Note: This name will be displayed publicly)