Silent Data Corruption: A Major Reliability Challenge in Large-Scale LLM Training (TU Berlin)


A new technical paper, "Exploring Silent Data Corruption as a Reliability Challenge in LLM Training," was published by researchers at Technische Universitat Berlin. Abstract "As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults... » read more

Ensuring AI Reliability: Mitigating OCP’s Silent Data Corruption Risks


Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand ext... » read more

Keeping The Lights On: How Digital Twins And Smart Semiconductor Management Power Our 24/7 World


Hey there, tech enthusiasts and digital pioneers! Have you ever stopped to think about the tiny, intricate components that keep our modern world humming? From the advanced safety features in your car to the massive data centers powering AI, semiconductors are truly the unsung heroes. But what happens when these tiny titans face immense pressure, like the non-stop demands of AI workloads? That's... » read more

Outsmarting Silent Data Corruption In AI Processors With Two-Stage Detection


Silent data corruption is on the rise following advancements in semiconductor technology. The explosion in AI for speech, image, video, and text processing leads to a growing complexity and diversity of hardware systems, bringing an increased risk to data integrity. SDC rate is much higher than software engineers expect, undermining the hardware reliability they used to take for granted. Rec... » read more

How To Build Resilience Into Chips


Disaggregating chips into specialized processors, memories, and architectures is becoming necessary for continued improvements in performance and power, but it's also contributing to unusual and often unpredictable errors in hardware that are extremely difficult to find. The sources of those errors can include anything from timing errors in a particular sequence, to gaps in bonds between chi... » read more

Hunting For Hardware-Related Errors In Data Centers


The semiconductor industry is urgently pursuing design, monitoring, and testing strategies to help identify and eliminate hardware defects that can cause catastrophic errors. Corrupt execution errors, also known as silent data errors, cannot be fully isolated at test — even with system-level testing — because they occur only under specific conditions. To sort out the environmental condit... » read more