Taking Data Center Serviceability To The Next Level

How hyperscalers can successfully bring structural System Level Test to the data center.

popularity

It is no secret that Artificial Intelligence (AI) workloads are driving an exponential growth in the scale of supercomputers and data centers. Training the latest LLM (Large Language Model), for instance, typically requires thousands of specialized processing cores running at full speed. As these models get more advanced with each generation, they need additional compute performance to absorb and process incredible amounts of data. Taking on these tasks takes time – as in months – and sizeable investments to the tune of a few hundred millions to billions of dollars are made by companies who need to build large scale supercomputers and data centers in order to remain competitive. As expected, the industry is focusing most of its efforts on creating ever-faster hardware to handle the ever-growing AI workloads, pushing the limits of performance and power with each new product release.

However, this race for performance is not the only challenge data centers are facing. Reliability, availability, and serviceability of the data center (also known as “RAS”) is also critical. Considering the fact that each minute of downtime in a hyperscale environment can cost upwards of $10,000, and coupled with the fact that data centers need to adhere to high levels of SLA (Service Level Agreement) at the tune of 99.999% availability (a.k.a. “five 9’s”), it is no wonder that some data center operators see RAS as a main concern.

One way to address this concern is to increase the amount of testing done on the large-scale SoCs (System-on-Chip) running the AI workloads while optimizing the DFT (Design-for-Test) methodologies. However, these advanced chips are increasingly adopting chiplet-based designs in order to help reduce size and power requirements while improving performance, with the unfortunate side effect (at least from a testing perspective) of reducing the number of pins available for test pattern application. Combine this trend of reduced pin availability with growing design sizes and the result is an effective decrease in test bandwidth, which in turn leads to longer test times and ultimately higher test costs. To address this challenge, hyperscalers are turning to a new approach that uses existing high-speed I/O (HSIO) interfaces already available on the SoC, such as PCI Express (PCIe) or Universal Serial Bus (USB) to deliver structural/scan test data using the native functional protocol of the interface. In order to enable seamless communication to these HSIO interfaces, Synopsys provides the Silicon Lifecycle Management (SLM) High-Speed Access & Test (HSAT) IP, which can be instantiated to connect to the chip’s DFT infrastructure. The result is a solution that allows for high bandwidth test capabilities while removing the traditional dependency on slower General Purpose Input/Outputs (GPIOs). In practice, this means manufacturing test patterns can now be reused and applied not only during System Level Test (SLT) but also when the device is still in use in its functional application (In-System Test or IST).

Fig. 1: Synopsys SLM High Speed Access & Test IP.

Instantiating HSAT IP for in-system use brings many obvious benefits, one of which is that the full diagnostic capabilities of scan patterns are now available during SLT and IST – without having to take the device out of the system. This is particularly important to improving serviceability (the “S” in RAS) and addresses the crucial need of hyperscalers to continuously track and monitor the health of their data centers. Failures occurring in-field can now be root caused efficiently with minimum disruption.

There are additional advantages of enabling data center serviceability through HSAT. For instance, early indications of degradation or premature aging can now be deducted by comparing the results of tests done during manufacturing with the results of the same tests done while the device is in the system, in the field. In addition, scan patterns could be augmented over time to improve fault detection throughout the lifecycle of the device. Thanks to HSAT, this type of scan testing can be done at scale.

Deploying HSAT and implementing this solution for the first time can be done via a multi-step approach. As a first step, HSAT can be brought up during manufacturing test on automatic test equipment (ATE), using ATE vendor-specific hardware that drives the high-speed interface. The next step is to enable HSAT at SLT, which could be done for example by using a local microcontroller on a board acting as the host system (PCIe RC) and running ATPG patterns through the high-speed interface in the case of PCIe, or a standard laptop or desktop computer acting as the host in the case of USB. The patterns from ATE are reused and the results stored by the microcontroller can be compared, showing any discrepancies between test at ATE and SLT. The last step is to perform in-system test, in which case the microcontroller typically connected to each SoC running in the data center can be used to trigger HSAT-based scan patterns. This could be done either ad-hoc in the context of a diagnostic session or scheduled to run at regular intervals or specific events such as at every power-up of the system. The patterns from ATE and SLT can (and should) be reused here, and test results compared to understand how the device is performing and aging in system. Last but not least, additional patterns could also be introduced to improve fault coverage, further improving the reliability of the system.

Hyperscalers must balance the need to test systems that are increasing in complexity and satisfy data center RAS requirements, while dealing with an effective decrease in test bandwidth that inevitably results in higher test costs. An effective approach is to leverage existing high-speed interfaces, which can be done by implementing the Synopsys SLM HSAT IP. Doing so opens the door to new capabilities allowing for extensive testing in-system.

An example of such deployment by Amazon Web Services (AWS) is provided in the whitepaper published by the IEEE, titled: Novel Technique for Manufacturing, System-Level, and In-System Testing of Large SoC Using Functional Protocol-Based High-Speed I/O.

For more information, visit https://www.synopsys.com/solutions/silicon-lifecycle-management/high-speed-access-and-test.html



Leave a Reply


(Note: This name will be displayed publicly)