Data Centers Need High Reliability Semiconductors

Nvidia CEO Jensen Huang: “…you’ve got to be perfect.”

popularity

“In the world of designing cars, planes, AI factories … you’ve got to be perfect,” said Nvidia CEO Jensen Huang on CNBC last month. “And the reason for this is because there is so much at stake.”

Cars and planes need to be extremely reliable because people die if they aren’t. In AI data centers, no one dies when systems fail, but the economic impact is gigantic because Amazon, Google, and Microsoft are trillion-dollar market cap companies. Their customers rely on them to power giant economic engines, which make no money when systems are down.

For example, in early December, a 10-hour outage at an Illinois data center resulted in a halt in global trading in currency and commodity markets, from gold to oil to interest rates.

Data center reliability standards & strategies

Cloud Service Providers operate hundreds of huge data centers across the globe connected by thousands of miles of fiber optics. These are the world’s largest and most complex computers.

Fig. 1: Google data centers — 29 locations in 11 countries. Source: Google

Data center infrastructure is designed for very high reliability with several options. For example, Google offers from 99.9% uptime with 43 minutes per month of maximum downtime, to 99.999% uptime with just 26 seconds per month of maximum downtime. My laptop crashes more than that. The higher reliability is achieved by straddling several regions (data centers) with software that can quickly shift loads between data centers to avoid a single point of failure. This has a cost in parallel compute and storage for redundancy. If you are hosting a global trading platform, the cost is worth it. Storage is duplicated, so if one copy is lost or unavailable, operation continues with the other.

Data centers are much more than just semiconductors. For highest reliability, data centers have redundant cooling systems. If one fails, the other takes over. Power distribution is also redundant with spare units that kick in if needed. Also, if grid power fails, batteries or generators kick in.

Fig. 2: Third-generation cooling units for Ironwood TPU Superpods. Source: Google

The high-level strategy for semiconductor reliability is similar to other parts of the data center:

  1. Design components for very high reliability;
  2. Design components and systems to identify early signs of failure and fix them first; and
  3. Add redundancy so that if a component does fail during operation, it is identified rapidly and a backup takes over.

Semiconductor architecture strategies for data center reliability

Data center chips need to be designed to be as reliable as possible, but failures happen. So data center chips and subsystems need to be architected to be fault tolerant.

Data centers have thousands of identical servers, switches, etc. If a server or rack isn’t functional, it can be mapped around.

ECC: Data center CPUs use ECC memory for high reliability. Ever since HBM2, HBM incorporates on-die ECC. HBM3 uses more robust Reed-Solomon codes. HBM also has redundant data bus lanes, so if there is a lane fault during operation, it can be remapped to a spare functional lane.

Scale-up Network Redundancy: NVLink is Nvidia’s super competitive advantage, allowing much bigger pod sizes with very low latency between GPUs. But why is Nvidia NVLink72 and not 64? Nvidia’s recommendation is to run with 64 of the GPUs, keeping 8 as spares (or on stand-by, running lower priority, pre-emptible workloads). Similarly, there are 18 switches, even though you only need to use 16 of them for 64 GPUs. In NVLink, every switch is connected to every GPU. This allows the bandwidth between GPUs to be modulated, but it also means that a failed switch can be mapped around without performance compromise. While the NVL72 keeps running, the failing switch or compute tray can be hot swapped to restore full redundancy for maximum reliability.

A few months ago, SemiAnalysis reported that signal integrity issues with the NVL72 backplane, at least as of that time, were resulting in data errors that can take hours to isolate and fix. NVL72 fixes take an order of magnitude longer than the earlier generation. As electrical frequency rises to deliver higher pod performance, the reliability of data transmission decreases due to signal integrity issues. Improving reliability by switching to optical transmission, which has no cross-coupling or electro-magnetic signal integrity issues, will be required as well to increase pod size using optics’ longer reach.

Scale-out Network Redundancy: Scale-out systems today are primarily Ethernet-based, which are packet-based and ensure delivery of packets with re-tries and alternate routings, if required. Error checking and correction of the data payloads is done on every packet. RSTP (rapid spanning tree protocol) enables shifting from a failing primary path to an alternate in milliseconds. The robustness of the network is very high, but it has a cost in latency. Still, this is how all data centers connect racks and rows today.

Optical Circuit Switches: At the UBS Technology Conference in December, Coherent CEO Jim Anderson said, “We love OCS.” OCS = Optical Circuit Switches. Google pioneered OCS, deploying it with its TPU super-pods. A circuit switch allows hundreds of fiber inputs to be rerouted in milliseconds to hundreds of fiber outputs. This has many benefits. One such benefit is enabling rapid re-routing of high-bandwidth data around failed chips.

Fig. 3: Google Cloud OCS enables rapid link reconfiguration. Source: Google

Hot Swappable: If at all possible, systems should be designed modularly and hot swappable, so if a part needs to be replaced it can be done quickly, easily with minimum disruption.

Semiconductor component design for reliability

Unlike most other semiconductor applications, mechanical engineering is increasingly important for data center reliability. AI Accelerators now have multiple XPU and HBM die on a CoWoS interposer on organic substrate in a package that is soldered to the PC Board. The variation in materials and temperatures between the elements of the “sandwich” and the thousands of bonds between the layers are a risk for warps and breaks in physical connections.

Some aspects of data center operation are less demanding with respect to reliability:

Operating Temperature: Nvidia Blackwell GPUs have a peak operating temperature of 85C junction temperature. (Junction temperature is the temperature of the transistor.) AMD Epyc processors typically operate at a max temperature of 95C Tj (junction temperature) but can run briefly at up to 105C Tj. These temperatures are much lower than in Automotive (up to 125C Tj) because 1) power grows exponentially with temperature; 2) reliability goes down as temperature rises – metal migration, etc.; and 3) expensive cooling systems are economically viable in a data center to keep power down and reliability up.

Operating Life: Autos require operating life of 10, 15, or 20 years. But in data centers, the operating life is much shorter. The Wall Street Journal recently discussed the estimated useful lives, for accounting purposes, of the major hyperscalers, which are in the range of 5 to 6 years. In this sense, data centers are like iPhones. In 5+ years there will be something much better, so it is economical to upgrade rather than run old technology, especially given the power constraints in most parts of the world. Even with a short life, design for reliability is critical to ensure failures during the operating life are as low as possible.

Extensive Reliability Data: On the flip side, an operating life of 5 years means that when new accelerators/CPUs/networks are deployed, they must be brought up into operation very rapidly. This is like an iPhone ramp.

Hyperscalers want the best technology, but will only deploy it when there is extensive reliability data available.

For every semiconductor component, customers will want to see extensive reliability and stress testing resulting in very low FIT rates (failure-in-time/failures per billion device-hours). This can involve high temperature, at-frequency testing of thousands or tens of thousands of devices over months at high cost.

Failure Prediction & Isolation: But that is not enough. Customers will want to have on-chip telemetry that tracks leading indicators of failures, which can be monitored to determine when a device should be proactively replaced BEFORE failure. For example, on a communications device, an increasing BER (bit error rate) could be an early warning indicator.

If a device does fail, it should self-diagnose and raise a flag so that the location of the error can be quickly isolated and fixed. Today in data centers, it can take hours to trace back failures to their root cause.

Vendors to data centers will want to have access to telemetry data on their chips so they can learn to improve their ability to predict failures. Also, they need failure analysis experts who can determine what failed and why to both provide feedback on design improvements for higher reliability and adjust firmware settings to reduce wear out and/or improve failure prediction.

Aim for perfection

Data centers are the biggest market today for semiconductors. To win, you need high performance at low power and low cost. But without a high-reliability architecture, firmware, and design, you won’t get designed in.



Leave a Reply


(Note: This name will be displayed publicly)