Droop And Silent Data Corruption


By Aakash Jani and Lee Vick Let me set the scene. You are a child psychologist (played by, let’s say, Bruce Willis for illustrative purposes), and you are sitting next to a frightened kid. He turns to you and whispers, “I see dead bits.” Okay, I grant you that’s not exactly the quote, but data center operators are seeing transient errors at an alarming rate, and at scale. These error... » read more

Chip Aging Becoming Key Factor In Data Center Economics


Chip aging is becoming a much bigger concern inside of data centers, where it can impact server uptime, utilization rates, and the amount of energy needed to drive signals and cool entire server racks. Aging in chips is the result of both higher logic utilization and increasing transistor density. This is problematic for data centers, in general, but especially for AI chips where digital log... » read more

Silent Data Corruption Considerations For Advanced Node Designs


Ensuring reliability, availability, and serviceability (RAS) has long been an important consideration for many types of electronic systems, with major implications for chip design. Clearly, military hardware must be very reliable, and servers and automotive systems are also expected to be available constantly. Some amount of failure is inevitable, so being able to repair, avoid, or mitigate fau... » read more

Strategies For Detecting Sources Of Silent Data Corruption


Engineering teams are wrestling with how to identify the root causes of silent data corruption (SDC) in a timely and cost-effective way, but the solutions are turning out to be broader and more complex than simply fixing a single defect. This is particularly vexing for data center reliability, accessibility and serviceability (RAS) engineering teams, because even the best tools and methodolo... » read more

Pinpointing Timing Delays Can Improve Chip Reliability


Growing pressure to improve IC reliability in safety- and mission-critical applications is fueling demand for custom automated test pattern generation (ATPG) to detect small timing delays, and for chip telemetry circuits that can assess timing margin over a chip's lifetime. Knowing the timing margin in signal paths has become an essential component in that reliability. Timing relationships a... » read more

What Data Center Chipmakers Can Learn From Automotive


Automotive OEMs are demanding their semiconductor suppliers achieve a nearly unmeasurable target of 10 defective parts per billion (DPPB). Whether this is realistic remains to be seen, but systems companies are looking to emulate that level of quality for their data center SoCs. Building to that quality level is more expensive up front, although ultimately it can save costs versus having to ... » read more

How To Build Resilience Into Chips


Disaggregating chips into specialized processors, memories, and architectures is becoming necessary for continued improvements in performance and power, but it's also contributing to unusual and often unpredictable errors in hardware that are extremely difficult to find. The sources of those errors can include anything from timing errors in a particular sequence, to gaps in bonds between chi... » read more

Hunting For Hardware-Related Errors In Data Centers


The semiconductor industry is urgently pursuing design, monitoring, and testing strategies to help identify and eliminate hardware defects that can cause catastrophic errors. Corrupt execution errors, also known as silent data errors, cannot be fully isolated at test — even with system-level testing — because they occur only under specific conditions. To sort out the environmental condit... » read more

Silent Data Corruption


Defects can creep into chip manufacturing from anywhere, but the problem is getting worse at advanced nodes and in advanced packages where reduced pin access can make testing much more difficult. Ira Leventhal, vice president of U.S. Applied Research and Technology at Advantest America, talks about what’s causing these so-called silent data errors, how to find them, and why it now requires ma... » read more

IC Stresses Affect Reliability At Advanced Nodes


Thermal-induced stress is now one of the leading causes of transistor failures, and it is becoming a top focus for chipmakers as more and different kinds of chips and materials are packaged together for safety- and mission-critical applications. The causes of stress are numerous. In heterogeneous packages, it can stem from multiple components composed of different materials. “These materia... » read more

← Older posts