Use data to manage data center infrastructure.
In today’s data center environment, resilience is key. Cloud providers are built on as-a-service business models, where uptime is critical to ensure their customers’ business continuity. Reputation and competitiveness require service at extremely high performance, low power, and increasing functionality, with zero tolerance for unplanned downtime or errors.
If you’re a hyperscaler, or a chip supplier to hyperscalers, then you already understand that any system with hundreds of thousands of servers will occasionally encounter a CPU that doesn’t do what it should. In some cases, these errors happen in a way that cannot immediately be detected or flagged. Sometimes a calculation will give the wrong result. Other times an instruction doesn’t behave exactly as it should. In certain cases, the error is inconsistent, rendering it even more difficult to find than a problem that: “…always happens when I do this!”
When such issues occur, they can be extraordinarily time consuming to track down.
Such issues aren’t the fault of the chip’s design, or the way it’s used, but are an artifact of the semiconductor industry’s use of ever-shrinking process geometries to squeeze costs down and to drive performance ever higher. As processes shrink, the behavior of any chip becomes increasingly more variable. Add to this the fact that certain parameters will change after extended use, and you have the recipe for issues that can be as difficult to pinpoint as they are costly to endure.
Take the case of Facebook, as described in the company’s February 2021 paper: “Silent Data Corruptions at Scale”[1]. A firmware upgrade suddenly started to cause random issues in one of the company’s data centers. Over the course of 18 months the problem was somewhat isolated to one core in one processor chip, which caused a math software package Scala to return an erroneous result from a very specific calculation. The integer part of a calculation for the number: 1.153 was returned as zero when it should have been 156. Nothing in the system flagged this as an error. That core gave the correct result for 1.152, but it answered zero when the power was 53. Since this number was returned as the file size after compressing a file, the system software decided that the file didn’t exist, or was basically lost. Not nice!
Google shared a similar story in June with the company’s paper: “Cores that Don’t Count” [2]. In Google’s case “mercurial” cores on processors executed differently than they were expected to, yet the problem was limited to a certain core in all of the processor chips, making it slightly easier to locate than Facebook’s culprit. Google named these “Corrupt Execution Errors” or CEEs, a newly discovered cause of some of the SDCs (Silent Data Corruption) reported by Facebook. They can sometimes cause wrong answers that are never detected. CEEs are difficult to remedy because, unlike errors in storage, memory, or I/O, error correction cannot be applied. Google expects for the problem to become increasingly harder as silicon processes shrink. After “many engineer-decades,” the team investigating this issue was able to track down the faulty core and isolate it.
Microsoft presented its thoughts at the OCP Global Summit in May of 2020 [3]. The company didn’t disclose a specific issue that they tackled, but proposed instead a uniform way of reporting errors, because large systems experience issues on a regular basis. The company outlined challenges with current cloud scale HW fault diagnostics, citing reasons such as lack of clear and comprehensive error reporting, gaps in telemetry solutions that are required to root cause new failures, and custom error log formats which stem from lack of standards-based error reporting. These lead to high rates of ‘no problem found,’ high replacement rates of HW components, long lead times for identifying mitigations for HW failures, and integration of multiple vendor tools which are not cloud-ready. Microsoft hopes that the industry can come together to report errors in a uniform way as a regular occurrence so that they can be corrected by software.
As semiconductor processes shrink the issues facing chip providers get increasingly complex. Noise issues worsen, metal lines start to change their resistance over time, and trapped charges alter the behavior of transistors the more they are used. Probability looms large as these effects combine to cause unanticipated issues in chips that do well against their very thorough production test suites. Meanwhile, semiconductor production processes disciplines must continually grow stricter since it becomes increasingly difficult to make transistors work as planned with today’s process geometries. The variability from one chip to the next increases enough to make it difficult to predict whether or not a chip will perform to its specifications or can even be tested at all for reliability.
These issues are new and increasing in frequency. The hyperscalers mentioned above, as well as others, are the first ones to identify this growing concern and report on it. But it is certainly not a one-time enigma to be shrugged off. The industry will need to address the needs sooner than later. It’s no wonder data center networks today are built on an architecture of redundancies, a highly expensive one.
To make matters worse, issues don’t always show up immediately. In some cases, a data center that has been performing perfectly will suddenly start to misbehave when a small and well-tested software patch is deployed. In other cases, a smooth-running data center will start experiencing random errors, at first very rarely, but then becoming more frequent over time.
Let’s walk through some examples:
At the October 2019 Electronic Design Process Symposium, a different Google team [4] reported on the sundry error mechanisms that caused chip failures over time. The presenter listed seven “Main Reliability Degradation Mechanisms:” NBTI, PBTI, RTN, HCI, TDDB, EM, and SER. Without going into detail, note that four of these, NBTI, PBTI, HCI, and TDDB, are all caused by trapped charges, which increase the longer a chip is powered on. RTN, Random Telegraph Noise, is inevitable in any electronic system and is purely random. It cannot be avoided, and measures must be taken during a chip’s design to reduce the likelihood that it will disrupt the processor. EM, electromigration, increases the resistance of a metal line over time and is caused by high current densities in the chip’s metal interconnect. As chip processes shrink, the current density increases and worsens the problem. Chipmakers moved from aluminum to copper to address this issue over a decade ago, but it has once again become problematic, so there is talk about potentially moving to other materials. Finally, SER, the soft error rate, is caused by radiation, and is also becoming a problem in terrestrial systems.
Hyperscalers, as well as chipmakers, are paying a lot of attention to all of these mechanisms, because data centers can be brought to their knees by these statistical deviations. Facebook explained that SDC issues can be caused by device errors, early life failures, degradation over time, and end-of-life wear-out. Google said that “Faulty cores typically fail repeatedly and intermittently, and often get worse with time.”
These hyperscalers want for their systems to be able to recognize these errors and report them. It would be even better if they could be predicted. With appropriate monitoring techniques this could very well be achieved.
How can such monitoring be done?
Up until now companies could not detect the SDC and CEE failures presented by Facebook and Google. But today, highly sophisticated SoCs can be made to serve as sensors of their systems. They can be internally monitored for everything from variations across production lots to material degradation, aging, software effects, application stress, latent production defects, and environmental impact.
Today, this kind of solution is available. Proteus, the proteanTecs predictive analytics engine, collects deep data over time through ML driven on-chip telemetry, identifying high risk issues and proactively tracking them based on learnings. Leveraging chip telemetry data, historical data, predictive modeling, and machine learning algorithms, the platform enables manufacturers and brand owners to identify systematic issues and take action with swift root cause analysis, extending the system lifetime and preventing epidemics. Through continuous monitoring, it is also possible to validate software upgrades over existing hardware to ensure reliable performance.
SDCs and CEEs can be correlated to other parameters within the chip and the system it’s embedded in, so processor behavior can be predicted, and field errors flagged even before they occur. This provides everyone from chip developers who are introducing a new processor to system administrators, who have been using it for years, an ability to understand how their systems perform today and are likely to perform tomorrow. Proteus integrates with data center management systems to provide automated action deployment based on the learned insights.
Full lifecycle health and performance monitoring for reliability assurance.
In addition, monitoring similar devices provides accurate per-device analysis for liability assessment. By detecting faulty CPUs and correlating to other CPUs that share the same characteristics, and are potentially prone to such errors, faults can be prevented.
With targeted applications for performance monitoring, degradation monitoring and root-case analysis, predictive maintenance is enabled with a feedback mechanism to the supply chain. Designs, manufacturing processes, test programs, and in-field health tracking become hyper-personalized and predictive.
Leave a Reply