Home
TECHNICAL PAPERS

Detecting Defect-Induced Silent Data Corruptions in CPUs (Stanford, Google)

popularity

Researchers from Stanford University and Google have published “ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions”.

Abstract

“Hyperscaler reports of silent data corruptions (SDCs)—presumed to be caused by silicon manufacturing defects—have motivated the development of functional tests for detecting defective CPUs and their use in hyperscaler fleet studies. Interestingly, all such tests seem to assume that defects induce consistent errors: two instances of the same instruction within the same thread, given the same architectural inputs, always produce the same wrong architectural output. We find that this assumption unnecessarily restricts which programs can serve as tests—biasing which defect-induced errors can manifest and get detected—and limits identification of affected instructions to those impacted by errors that short or targeted tests can reproduce—biasing how errors are characterized.

We present ITHICA, which automatically generates functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight, challenging the assumption above, is that the most pernicious defects—those most likely to escape manufacturing testing—cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections, overcoming both aforementioned limitations of prior functional tests. We use ITHICA to transform industrial hyperscaler test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscaler fleet studies.”

Find the technical paper here. May 2026.

Vavelidou, Ioanna, Subho S. Banerjee, Eric X. Liu, Mike Fuller, Subhasish Mitra, and Caroline Trippel. “ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions.” arXiv:2605.15638, May 2026. https://doi.org/10.48550/arXiv.2605.15638.

 



Leave a Reply


(Note: This name will be displayed publicly)