中文 English

Choosing The Right Server Interface Architectures For High Performance Computing

Some of the less common considerations for assessing the suitability of a system for high-performance workloads.

popularity

The largest bulk and cost of a modern high-performance computing (HPC) installation involves the acquisition or provisioning of many identical systems, interconnected by one or more networks, typically Ethernet and/or InfiniBand. Most HPC experts know that there are many choices between different server manufacturers and the options of form factor, CPU, RAM configuration, out of band management protocols, and connectivity, all of which weigh supreme in choosing the “right” or best system. What is less common is consideration of SoC and system design, balancing each of the elements listed below by adopting forward-looking interface standards, all of which can impact overall performance and endurance significantly. As an example, several years ago, I selected a server with a high-performance pair of processors and a large memory footprint to build an HPC cluster that had insufficient memory bandwidth to support the use of all installed RAM without significant performance degradation for the workloads it hosted. System designers always need to make tradeoffs, using the best VALUE components available, not always the best performing elements to design cost competitive systems that will be compelling in the marketplace. The result will deliver a balanced architecture for some but probably not all configurations that are possible with that server model. This article will outline some of the less common considerations one might use to assess the suitability of a computer system and configuration for your high-performance workload.

First and foremost is a characterization of the workloads themselves. What is the size of the memory working set of a given machine’s most common computational load? Is the workload distributed as a tightly coupled parallel computation that relies on a high-speed network fabric for inter-process communication or does the computing take place independently in each server (embarrassingly parallel) pushing each processor’s cores, local memory bandwidth and capacity, PCIe or other internal bus, and installed accelerators to the limits of their capabilities? Can the system operate at peak performance within the power budget without overheating or throttling back on the clock rates or bandwidth? Are all the paths between critical elements used in a computation adequate to support latency, bandwidth, and throughput requirements and capabilities of the participating devices? If you plan to maximize or skimp on the processor specifications, amount of RAM, number of installed devices, or choose any other configuration options that are on the edge of the system design’s norms, you may need to consider if doing so will compromise the ability of the system to operate at peak efficiency and accommodate the intended workloads well.

Fig. 1: Examples of HPC workloads.

Once you have considered the work that the system will do and configured the primary elements (processor type and core count, RAM capacity and installed accelerators/coprocessors), there are some additional considerations that could have a significant impact on the lifetime and performance of your cluster.

The PCIe bus is a critical interface for internal communication between system processors and I/O devices installed in a server. GPGPUs and other accelerators, NIC cards, analog-to-digital interfaces, even GPUs designed to support video displays often rely on this bus for all communication with the CPU, system memory, as well as between the devices themselves. PCIe has gone through several generational revisions and lately has begun to take on the added complexity and functionality of CXL. Each generation of PCIe has maintained backwards compatibility and approximately doubled the perlane bandwidth over the prior generation. Today’s processors and servers utilize the PCIe Gen 4.0 bus with each lane signaling up to 16 GT/s, however, a designer at the cusp of planning a new datacenter and HPC installation should consider the next generation of PCIe Gen 5.0 systems which will be capable of signaling at 32 GT/s. Processors supporting PCIe Gen 5.0 are in development now with servers likely to make their debut in 2023. The most recently ratified specification for PCIe 6.0 supports the Gen6 signaling rate of 64 GT/s per lane and is the first generation of PCIe to use PAM4 signaling to double the achievable bandwidth yet again, but that will probably take a few more years to see daylight as an available server technology. It is also expected that PCIe Gen 6.0 system motherboards will have a much higher intrinsic cost, so hybrid servers using different generation of interconnected boards may be in our slightly longer-term future.

Fig. 2: Development cycle for PCIe generations.

To accommodate the intricacies of PAM4, a low-latency forward error correction (FEC) was developed for PCIe Gen 6.0. This error correction protocol relies on a fixed sized packet encapsulation scheme called Flow Control Unit (FLIT) encoding instead of variable sized TLPs used in the prior PCIe generations. Additionally, PCIe Gen 6.0 introduces a new low power state, L0p, allowing power/bandwidth scaling to take place with concomitant energy savings.

CXL, mentioned earlier, is also a cache coherent protocol that operates over the physical layer of PCIe and takes advantage of the foreign protocol negotiation capability that is intrinsic to the hosting PCIe architecture. CXL also eliminates some of the latency intrinsic to the PCIe protocol. Also a FLIT-based protocol, the first instance of CXL will be seen as version 1.1, and this along with the second generation CXL 2.0 will first be introduced on PCIe Gen 5.0. Already the CXL 3.0 specification is nearing completion. Unlike some of the earlier attempts to standardize a symmetric cache coherent protocol, like CCIX or GenZ, CXL offers asymmetric coherency, typically under the supervision of the host processor. An upside of this architecture is that it results in much lower latency of communication over the native PCIe interconnect. All versions of CXL have three device types defined in the protocol specification. Each of these device types are defined by the subset of three communications protocols (CXL.io, CXL.mem and CXL.cache) that they support:

  • Type 1: Accelerators/NICs without independent memory supporting CXL.io and CXL.cache
  • Type 2: Accelerator devices that have onboard memory like GPGPUs and more advanced SmartNICs supporting all three protocols, CXL.io, CXL.mem, and CXL.cache
  • Type 3: Devices for memory expansion such as solid-state storage, persistent memory, or DRAM devices supporting CXL.cache and CXL.mem

To use CXL, the system processors and at least one device must be capable of supporting the protocol, but not every device needs to be CXL aware. The native PCIe protocol will still function on a CXL-capable system with devices that are unaware of the CXL protocol. Some of the sharing and/or pooling capabilities of CXL may require external fabric switching, and systems looking ahead might build in externally facing PCIe or CXL ports with this in mind. While further details of this specification are out of scope for this article, more information can be found at https://www.computeexpresslink.org.

Memory architecture (controller and bus) is another area that bears some additional examination. Memory bandwidth limits the capacity available to each core as mentioned earlier and many codes have large working sets which are very sensitive to this interface. Unlike PCIe, DDR memory maximum speeds are fixed to a given generation and DIMM type. Servers today support DDR4 memory, but next year the promise of DDR5 looms large, supporting up to 6400Mbps per pin for an overall maximum bandwidth of 51 GB/s as well as higher DIMM densities – up to 2Tb / DIMM! Because modern processors can operate on data even faster than this memory bus can deliver data, the practical amount of local memory that one CPU can support is limited. NUMA architectures are therefore common in multi-CPU servers, developed as a refinement of symmetric multiprocessing architectures (SMP) in the early 1990s. Because direct memory attachment is divided between multiple CPUs, performance relies heavily on a fast intra-CPU interconnect when data is neither cached nor located in the local NUMA domain. This has far-reaching implications for applications depending on whether the application working set fits into a single NUMA domain and/or a given set of CPU cores within a single processor. The connectivity required of a CPU to support multiple channels of a DDR4/5 memory bus pushes socket technology to its limits, requiring 1700 or more pins per processor!

Apart from DRAM memory considerations, solid state memory or NVME storage can leverage the PCIe bus for connectivity. CXL type 3 (memory) devices hold great promise to change the paradigm of system-attached memory by imposing a common set of 64-byte addressable semantics for memory expansion using a variety of memory class devices and enabling remote access through disaggregation known as “pooling” over a switched CXL fabric, borrowing unused memory from other systems on the same fabric. CXL 3.0 introduces the ability to share memory between multiple systems using these same semantics. When operating systems develop more sophisticated tiering support into the memory subsystem, this capability can not only enhance application performance, but it also holds promise for significant savings if systems can rely on aggregate rather than depend exclusively on local memory capacity for computation. Lastly, there is a standard for High Bandwidth Memory (HBM), often found within accelerators. It is currently at HBM3, which supports bandwidth as high as 1.075 TB/s! While this seems very attractive to better match memory performance to a processor bandwidth, form factor standardization challenges, lower capacities, and cost factors currently prevent HBM memory from completely replacing DDR as primary system memory.

Ethernet is another important technology for consideration. There are typically at least two functions supported by built-in data ports used for Ethernet connection in a server designed for HPC. One of them provides access to the out-of-band management processor interface (ie: IPMI or Redfish) through which a server can be turned on or off, monitored, and managed. The second is typically used to access the OS resident on the system processor(s) interactively. These ports are generally gigabit Ethernet ports providing important functionality at low cost. They are, however, too slow to be used as a cluster high-speed interconnect for access to shared storage and Inter-Process Communication (IPC) like MPI or SHMEM. The solution of course is to provide additional, faster communications ports via a PCIe attached network interface card (NIC).

An Ethernet fabric not only enables communication within the cluster, it could also be used to extend connectivity among several cluster locations in one or more data centers either across dedicated connections or via the internet. Ethernet is certainly needed to access cloud-hosted HPC resources which are becoming more popular for hosting or bursting HPC workflows, both as a way to control infrastructure cost but also as a response to pandemic-induced limitations for in-person contact. These factors create a big advantage to using Ethernet as a primary cluster interconnect technology as contrasted with InfiniBand, which would require a gateway to access any internet-provided connectivity. One port of 200 Gbps Ethernet can be fully supported on a x16 PCIe Gen 4.0 bus but switched Ethernet speeds up to 400 Gbps are possible today with 800 Gbps to be available soon and 1.6 Tbps already in scope for standardization. If you choose Ethernet as your primary cluster interconnect fabric, you will want at least one 100 Gbps port on each server – possibly using breakout cables to feed up to 4 systems at 100 Gbps from a single 400 Gbps switch port and limiting the realm of full bandwidth high speed Ethernet to the switch fabric and data center interconnect domains. For 800 Gbps, there will almost certainly be 8×100 Gbps cabling solutions available. An HPC system cluster interconnect requires performance offload features such as RoCEv2 (RDMA over Converged Ethernet), server access to differentially vlan-tagged traffic, and flexible packet size support such as jumbo frames for storage (data) access. If there is a need for privacy or compliance with regulations such as HIPAA or GDPR, MACsec support might be a priority. The complexity and cost of low latency, low power, flexible packet size, full-featured and interoperable Ethernet ports demand SoCs built with proven Ethernet IP in both HCAs and switching infrastructure to deliver and optimize a fully functional HPC fabric at an affordable cost for all of the server-resident PCIe-connected adapters, cables and switches required.

Fig. 3: Ethernet fabric for HPC in a datacenter.

In conclusion, the server architecture selected for your HPC cluster may endure for several years and limitations may become more significant as it ages and the hosted workflows change. Most times it’s logistically impossible to upgrade the original servers while in production due to the volume of user jobs that are running. In order to add newer and updated systems to expand the cluster, they must integrate with any preexisting network, space, power, and cooling infrastructure already in place. Planning and choosing technology and standards that will both endure and fit all hosted application workflows while delivering accelerated performance is worth some serious analysis up front. The considerations mentioned in this article, while not exhaustive, are certainly crucial to longevity of your design. There is no reason that a well-chosen server should not extend cluster life up to and well beyond the warranty coverage of the systems themselves.



Leave a Reply


(Note: This name will be displayed publicly)