Advanced silicon lifecycle analytics and on-die telemetry are needed to counter minor but still impactful voltage and frequency perturbations.
By Aakash Jani and Lee Vick
Let me set the scene. You are a child psychologist (played by, let’s say, Bruce Willis for illustrative purposes), and you are sitting next to a frightened kid. He turns to you and whispers, “I see dead bits.” Okay, I grant you that’s not exactly the quote, but data center operators are seeing transient errors at an alarming rate, and at scale. These errors are colloquially known as silent data corruption (SDC) and are a serious threat to reliability, availability, and serviceability (RAS) in data centers today, as cited by Meta1, Google2, and Intel3 independently. A lot of research has gone into SDCs through the lens of data centers. However, these threats can be seen in many other designs at scale, including automotive and edge compute.
The really spooky part of SDCs is they are not caught in testing during manufacturing, hence the silent part of SDCs; in fact, by their very nature, they are almost impossible to capture, as figure 1 shows, due to their ephemeral nature and reliance on particular combinations of on-chip variation, age, workload, etc. But at the scale of thousands of trillions of operations per second, operators are seeing hundreds of SDCs per day, which will lead to data loss or even broader impacts to strategic business operations.
Fig. 1: Rate of defect screening (with Intel’s datacenter diagnostic tool test) on third-generation Intel Xeon Scalable SoCs. (Source: Intel and SemiEngineering)
The root causes of SDCs can vary from physical to electrical. Without causality, we are left studying correlation, which leads me to my third three-letter acronym: silicon lifecycle analytics (or management, SLM). As an industry, we are moving towards more sensors providing more on-die telemetry to feed into more analytics engines to capture the correlating factors with SDC events. Two of those correlating factors happen to be voltage and frequency1,2.
A transient and measurable voltage fluctuation is colloquially known as a droop4. These droops have been shown to create setup and hold violations which could affect the functionality of your circuitry. Traditionally, design teams will mitigate droop with additional voltage margins (which burns more energy and reduces reliability, lifetime, and competitiveness) or schedule workloads so they don’t flip all the transistors in a localized region simultaneously (which reduces performance and increases the software burden and risk).
Larger voltage droops can be modeled and are often fixed in the pre-silicon stage, but more minor perturbations in voltage, heat, or resistance can lead to these SDCs and aren’t as easily modeled5. These minor perturbations need to be monitored at-test and in-field with clock- and droop-specific metrics to develop a clearer picture for SDC characterization.
At this point, it’s fair to assume that you will need to go back to your IP vendors and ask for all these specific sensors in addition to your foundational IP. That may not necessarily be your only path forward. The more modern iterations of foundational IP are shifting towards integrating observability that feeds into large silicon analytics models, providing our best hope for finding these mercurial gremlins infesting our advanced designs.
For example, at Movellus we offer clock health telemetry as part of your clock source to provide insight into localized clocking behavior across your chip. Our integrated droop response system provides a full suite of advanced droop telemetry to monitor, and mitigate, your power delivery network’s behavior. IP for the silicon analytics world is undergoing a paradigm shift to provide core functionality with additional observability, and soon IP without these features will be seen as very outdated and incapable of satisfying modern design requirements.
The challenges posed by silent data corruption (SDC) require a multi-faceted approach involving advanced silicon lifecycle analytics and on-die telemetry. Voltage and frequency perturbations, a significant contributor to SDCs, need precise monitoring through functional and observable IP solutions that offer application-specific telemetry.
Let us help you see the dead bits instead of the ghostly child psychologist. And if you are already seeing ghosts, don’t call us. Call a psychologist, because in the movie that vision was very rare and that was the source of many problems. But as we have demonstrated that level of insight is no longer a nice-to-have, it is a critical need, and not having it is where the true terror lies.
Lee Vick is vice president of strategic marketing at Movellus, responsible for partnerships, ecosystem, and business development efforts. Previously he managed business development for all SLM products at Synopsys after the acquisition of Moortec, where he was vice president of North America sales. Prior to that, he spent 15 years at Tensilica (acquired by Cadence in 2013), and has served in strategic planning, engineering, and architecture roles at Intel, Compaq, TI, and the Dept. of Defense. Vick has a BSEE from the University of Texas.
Leave a Reply