Enabling New Server Architectures With The CXL Interconnect

Sharing memory resources across CPUs and accelerators.


The ever-growing demand for higher performance compute is motivating the exploration of new compute offload architectures for the data center. Artificial intelligence and machine learning (AI/ML) are just one example of the increasingly complex and demanding workloads that are pushing data centers to move away from the classic server computing architecture. These more demanding workloads can benefit greatly from lower latency coherent memory architectures. This is where the Compute Express Link (CXL) standard comes in.

CXL was first introduced in 2019 and has emerged as a new enabling technology for interconnecting computing resources. It provides a means of interconnecting, in a memory cache-coherent manner, a wide range of computing elements including CPUs, GPUs, System on Chip (SoC), memory, and more. This is particularly compelling in a world of heterogenous computing where purpose-built accelerators offload targeted workloads from the CPU. As the workloads get increasingly challenging, more and more memory resources are deployed with accelerators. CXL gives us a means to share those memory resources across CPUs and accelerators for greater performance, efficiency, and improved total cost of ownership (TCO).

CXL adopted the ubiquitous PCIe standard for its physical layer protocol, harnessing the standard’s tremendous industry momentum. At that time CXL was first launched, PCIe 5.0 was the latest standard, and CXL 1.0, 1.1 and the subsequent 2.0 generation all used PCIe 5.0’s 32 GT/s signaling. CXL 3.0 was released in 2022 and adopted PCIe 6.0 as its physical interface. CXL 3.0, like PCIe 6.0, uses PAM4 to boost signaling rates to 64 GT/s.

To support a broad number of use cases, the CXL standard defines three protocols: CXL.io, CXL.cache and CXL.mem. CXL.io provides a non-coherent load/store interface for IO devices and can be used for discovery, enumeration, and register accesses. CXL.cache enables devices such as accelerators to efficiently access and cache host memory for improved performance. With CXL.io plus CXL.cache, the following use model is possible: an accelerator-based NIC (a Type 1 device in CXL parlance) would be able to coherently cache host memory on the accelerator, perform networking or other functions, and then pass ownership of the memory to the CPU for additional processing.

The combination of CXL.io, CXL.cache and CXL.mem protocols enable a further compelling use case. With these three protocols, a host and an accelerator with attached memory (a Type 2 device) can cache coherently share memory resources. This can provide enormous architectural flexibility by offering processors, whether they be the hosts or the accelerators, access to greater capacity and memory bandwidth across their combined memory resources. One application that benefits from lower latency coherent access to CPU attached memory is natural language processing (NLP). NLP algorithms require a large amount of memory which is typically larger than can be included on a single accelerator card.

Rambus offers a CXL 2.0 Interface Subsystem (Controller and PHY) as well as a CXL 3.0 PHY (PCIe 6.0 PHY) that are ideal for performance-intensive devices such as AI/ML accelerators. These Rambus solutions benefit from over 30 years of high-speed signaling expertise, as well as extensive experience in PCIe and CXL solutions.

Additional resources

Leave a Reply

(Note: This name will be displayed publicly)