Providing assurance of data center system uptime with silicon lifecycle management.
By Charlie Matar, Rita Horner, and Pawini Mahajan
While once the domain of large data centers and supercomputers, high-performance computing (HPC) has become rather ubiquitous and, in some cases, essential in our everyday lives. Because of this, reliability, availability, and serviceability, or RAS, is a concept that more HPC SoC designers should familiarize themselves with.
RAS may sound like a self-explanatory term, but what does it really involve when it comes to HPC SoCs? Data center operators have long maintained service level agreements with customers to provide assurance of system uptime. RAS complements these agreements and can now be supported with new technology that ultimately yields actionable insights. In this article, originally published on the “From Silicon to Software” blog, you’ll learn why silicon lifecycle management (SLM), embedded monitoring IP, and the right design and verification tools can be your allies in achieving high levels of RAS for your HPC designs.
The video footage captured by a home’s security doorbell or a building’s surveillance system. Financial and business operations modeling. Scientific and medical research. Augmented reality and virtual reality. The list goes on when it comes to application areas that rely on HPC. The proliferation of data collected by our devices and systems, AI-fueled analytics, the availability of massive amounts of compute resources, and the cloud are converging to make it possible to quickly derive useful, actionable insights, making HPC an integral part of a much wider array of applications than when the very first supercomputers emerged in the 1940s.
A typical HPC infrastructure today consists of three key elements: compute, network, and storage. Each requires a certain level of performance, latency, power efficiency, scalability, productivity, and security. Let’s take a closer look at each element:
There are two primary types of HPC systems: the homogeneous machines and the hybrid ones. Homogeneous machines only have CPUs. Hybrids, by contrast, have both GPUs and CPUs, with GPUs running the tasks and CPUs overseeing the computation.
HPC clusters can be composed of large numbers of servers, where the total physical size, energy use, or heat output of the computing cluster might become a serious issue. Furthermore, there are requirements for dedicated communications among the servers that are somewhat unique to clusters.
Because small design differences amount to large benefits when multiplied by the number of servers in the clusters, we are seeing the emergence of server designs that are optimized for HPC. Sometimes these are designs targeted at large, public Web operators, such as search engine firms, that deliver similar benefits in HPC clusters. However, they can also offer features only appropriate for HPC users. For example, if the system were designed to provide the cluster interconnect differently, there might be potential for significant cabling reductions.
The utility of HPC lies in its ability to process an enormous amount of data—petabytes or even zettabytes—and to run complex models in real time (or near real time). It goes without saying that anytime an HPC system goes down, it leads to lost money and business discontinuity. The ramifications become steeper with mission-critical applications. At advanced nodes, with large monolithic dies or complex architectures such as multi-die, it becomes even more challenging to meet RAS requirements.
Depending on the criticality of the application at hand, systems can be built with backups that provide redundancy in the event of failures. Beyond redundancy, there’s much more you can do at the system and chip levels to meet RAS targets. This is where SLM plays a big role, providing intelligent, automated in-chip monitoring IP and methodologies to generate actionable insights through each phase of a system’s life.
Designers have been embedding monitors and sensors into their chips for a number of decades; however, the technology has evolved to a point where it can now provide much more accurate data with higher levels of granularity. This enables greater visibility into the device’s real-time environmental, structural, and functional conditions. Examples include the monitoring of thermal hotspots, process variation, and voltage supply, as well as accurate measurement of timing margins.
Thanks to embedded and cloud-based analytics, along with the availability of a unified SLM solution, design teams will be able to build up a continuous, real-time picture of the silicon health of their device, not only during the in-design, in-ramp and in-production phases but also during in-field operation. They can get a better understanding of root cause and debug and repair right away, reducing costs and potential harm. Issues that SLM can address include transistor aging and delay faults. To get a picture of the benefits this brings, consider a satellite that has a bug. Normally, it can take many weeks to get a repaired board back from the lab to install in the satellite, taking it out of commission for an extended period for troubleshooting and repair. By moving fault detection and fault repair in the field through SLM technologies, teams can keep their systems up and running with fewer and shorter disruptions.
Taking a look at a data center, we can see another example highlighting how SLM can facilitate meeting RAS requirements.
Rather than a disconnected set of point tools, an end-to-end solution that brings together everything from design calibration analytics to in-chip monitoring and system performance optimization can make the process of addressing RAS targets a more seamless one. Synopsys is unique in providing such an end-to-end flow, with our Silicon Lifecycle Management Family, complemented by a broad portfolio of low-latency, silicon-proven IP and design and verification technologies for HPC applications. Within this set of solutions are features such as physically aware silicon monitors, cloud analytics, and embedded analytics and optimization technologies. SoC sensor IP and process monitors, also part of the family, are used for in-design, in-ramp, in-production and in-field optimizations. Reliable silicon at the manufacturing phase and in field, where monitors can collect real-time data about the silicon, plus comprehensive test and debug solutions, ensure high RAS.
Given the increasingly wider range of applications that now rely on HPC, maintaining high levels of reliability, availability, and serviceability of these systems is a critical consideration across the board. Achieving optimal levels of RAS to support everything from streaming video to climate change modeling is another important element in keeping our digital, smart everything world running at top speed.
In case you missed them, catch up on these other related blog posts for more insights:
Charlie Matar is senior VP of engineering in the Synopsys Solutions Group.
Rita Horner is product marketing director in the Synopsys Solutions Group.
Pawini Mahajan is staff product marketing manager in the Synopsys EDA Group.
Leave a Reply