Why Your NoC Verification Strategy Must Consider Using Formal


By Ashish Darbari and Bing Xue It’d be inconceivable these days to design a modern high-performance SoC without a network-on-chip (NoC) fabric. AI hyperscalers are inherently multi-threaded and rely on using hundreds of processing elements (PEs). Crossbar-based fabric would just not scale. What also changes with the adoption of the NoC is how to handle coherency between PEs. ACE is no long... » read more

Detecting Defect-Induced Silent Data Corruptions in CPUs (Stanford, Google)


Researchers from Stanford University and Google have published “ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions”. Abstract “Hyperscaler reports of silent data corruptions (SDCs)—presumed to be caused by silicon manufacturing defects—have motivated the development of functional tests for detecting defective CPUs and their use in h... » read more

Ensuring AI Reliability: Mitigating Silent Data Corruption Risks


Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand ext... » read more

Silent Data Corruption: A Major Reliability Challenge in Large-Scale LLM Training (TU Berlin)


A new technical paper, "Exploring Silent Data Corruption as a Reliability Challenge in LLM Training," was published by researchers at Technische Universitat Berlin. Abstract "As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults... » read more

Ensuring AI Reliability: Mitigating OCP’s Silent Data Corruption Risks


Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand ext... » read more

Keeping The Lights On: How Digital Twins And Smart Semiconductor Management Power Our 24/7 World


Hey there, tech enthusiasts and digital pioneers! Have you ever stopped to think about the tiny, intricate components that keep our modern world humming? From the advanced safety features in your car to the massive data centers powering AI, semiconductors are truly the unsung heroes. But what happens when these tiny titans face immense pressure, like the non-stop demands of AI workloads? That's... » read more

Critical Optimization Factors For GenAI Chipmakers


Today’s GenAI arms race is fought with novel chip architectures and packaging. Specialized hardware designs are proliferating in the form of GPUs, TPUs, NPUs, and more, all tuned for parallelism and matrix-heavy AI math. In this hyper-competitive landscape, chip vendors scramble to differentiate their products on multiple fronts. They promise some mix of better performance, efficiency, or ... » read more

The Severity Of Test Escapes And SDCs Caused By Them (Google)


A new technical paper titled "Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing" was published by Google. Abstract "Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unadd... » read more

Efficient Failure-Detection Methods for GPU Control-Logic (Hitachi, Osaka Univ., Kyoto Univ.)


A new technical paper titled "A Hardware-Aware Failure-Detection Method for GPU Control-Logic" was published by researchers at Hitachi, Ltd., Osaka University, and Kyoto University. Excerpt "Various failure detection methods have been proposed for SDCs caused by faults in data units such as registers. However, effective methods for detecting SDCs resulting from faults in control logic, such... » read more

Silent Data Corruption


Everyone expects their compute systems to generate the correct answer. When they don't, it's cause for alarm, because it's not always clear how long the problem has persisted. Even worse, chips and systems are now so complex that it may require a unique sequence of operations to trigger a silent data error, and they may show up only occasionally, and maybe only after months or years of use in t... » read more

← Older posts