Silent Data Corruption Considerations For Advanced Node Designs


Ensuring reliability, availability, and serviceability (RAS) has long been an important consideration for many types of electronic systems, with major implications for chip design. Clearly, military hardware must be very reliable, and servers and automotive systems are also expected to be available constantly. Some amount of failure is inevitable, so being able to repair, avoid, or mitigate fau... » read more

Strategies For Detecting Sources Of Silent Data Corruption


Engineering teams are wrestling with how to identify the root causes of silent data corruption (SDC) in a timely and cost-effective way, but the solutions are turning out to be broader and more complex than simply fixing a single defect. This is particularly vexing for data center reliability, accessibility and serviceability (RAS) engineering teams, because even the best tools and methodolo... » read more

Memory’s Future Hinges On Reliability


Experts at the Table: Semiconductor Engineering sat down to talk about the impact of power and heat on off-chip memory, and what can be done to optimize performance, with Frank Ferro, group director, product management at Cadence; Steven Woo, fellow and distinguished inventor at Rambus; Jongsin Yun, memory technologist at Siemens EDA; Randy White, memory solutions program manager at Keysight; a... » read more

AI Becoming More Prominent In Chip Design


Semiconductor Engineering sat down to talk about the role of AI in managing data and improving designs, and its growing role in pathfinding and preventing silent data corruption, with Michael Jackson, corporate vice president for R&D at Cadence; Joel Sumner, vice president of semiconductor and electronics engineering at National Instruments; Grace Yu, product and engineering manager at Meta... » read more

How To Build Resilience Into Chips


Disaggregating chips into specialized processors, memories, and architectures is becoming necessary for continued improvements in performance and power, but it's also contributing to unusual and often unpredictable errors in hardware that are extremely difficult to find. The sources of those errors can include anything from timing errors in a particular sequence, to gaps in bonds between chi... » read more

Hunting For Hardware-Related Errors In Data Centers


The semiconductor industry is urgently pursuing design, monitoring, and testing strategies to help identify and eliminate hardware defects that can cause catastrophic errors. Corrupt execution errors, also known as silent data errors, cannot be fully isolated at test — even with system-level testing — because they occur only under specific conditions. To sort out the environmental condit... » read more

Mitigating Silent Data Corruptions in High Performance Computing


A new technical paper titled "Mitigating silent data corruptions in HPC applications across multiple program inputs" was published by researchers at University of Iowa, Baidu Security, and Argonne National Lab. The paper was a Best Paper finalist at SC22. The researchers "propose MinpSID, an automated SID framework that automatically identifies and re-prioritizes incubative instructions in a... » read more

Silent Data Corruption


Defects can creep into chip manufacturing from anywhere, but the problem is getting worse at advanced nodes and in advanced packages where reduced pin access can make testing much more difficult. Ira Leventhal, vice president of U.S. Applied Research and Technology at Advantest America, talks about what’s causing these so-called silent data errors, how to find them, and why it now requires ma... » read more

Assuring Reliable Processor Performance At Scale


In today’s data center environment, resilience is key. Cloud providers are built on as-a-service business models, where uptime is critical to ensure their customers’ business continuity. Reputation and competitiveness require service at extremely high performance, low power, and increasing functionality, with zero tolerance for unplanned downtime or errors. If you’re a hyperscaler, o... » read more