CXL Thriving As Memory Link

Adoption of Compute Express Link protocol spreads as way to connect memories.

popularity

CXL is emerging from a jumble of interconnect standards as a predictable way to connect memory to various processing elements, as well as to share memory resources within a data center.

Compute Express Link is built on a PCI Express foundation and supported by nearly all the major chip companies. It is used to link CPUs, GPUs, FPGAs, and other purpose-built accelerators using serial communication, but it also allows memories to be pooled across devices to improve utilization and efficient utilization of resources.

While CXL often has been compared with NVIDIA’s NVLink, a faster high-bandwidth technology for connecting GPUs, its mission is evolving along a different path. “A couple of years ago, we were all thinking that lots of accelerators would be using CXL as a cheap, lightweight way to get to host memory,” said Richard Solomon, senior staff technical product manager for PCI Express Controller IP for the Solutions Group at Synopsys. “We’re not seeing as much of that. There’s definitely less interest than we had expected and hoped. What we are seeing is overwhelming interest in CXL as a connection to memory technology.”

In fact, CXL was never intended to cover all the bases. “The primary use cases in the CXL early stages are in memory expansion with increased capacity and bandwidth, memory reuse, and intelligent memory tiering applications,” said Zaman Mollah, senior product marketing manager at Rambus. “These solutions are easy to integrate using existing plug-and-play PCIe physical interfaces. CXL will co-exist with other similar interconnect technologies with its own applications and use cases. CXL may not provide the bandwidth needed for GPU-centric AI applications, but it has its usefulness in accelerators and CPU-based AI use cases. It allows for a composable data center architecture, and provides a flexible and cost-effective way to increase memory capacity and bandwidth with acceptable latency.”

Further, Christopher Browy, senior director of the VIP product line at Siemens Digital Industries Software said CXL likely will be used for memory pooling and sharing between hosts and coherent accelerators with remote scale-out over UALink and UltraEthernet fabrics. “Switches supporting both CXL and UALink are likely. CXL will also factor into advanced multi-level memory, storage-class memory, and caching solutions and optimizing computational storage. CXL is closer to reality now than a few years ago, and it will have its place amongst the latest emerging AI and HPC connectivity standards of the future. CXL is the best ticket, as it uniquely enables memory- and cache-based expansion. While UCIe stands to redefine chip and IP markets based on multi-die chiplet designs, CXL is key to how they will work together at the module and rack level.”

Increased functionality in 3.1, including enhanced security
The CXL standard, originally proposed in March 2019, has seen a number of revisions over the past few years. CXL3.0 was released in 2022, and the newest CXL3.1 specification was detailed last November. Many new features were added in 3.0, with some ECNs (engineering change notices) in 3.1 to enhance functionality for the new features for fabric capability.

CXL has three sub-protocols: CXL.io; CXL.cache, and CXL.memory. It was CXL.memory that created the most excitement, as designers realized what it could do for memory expansion in the data center and for advanced AI topologies. Memory build-out relies on CXL 2.0-based processors from Intel, AMD, and Arm-based hyperscalers.

According to the CXL Consortium, the 3.1 spec “improved fabric manageability to take CXL beyond the rack and enable disaggregated systems. The CXL 3.1 Specification builds on previous iterations to optimize resource utilization, create trusted compute environments as needed, extend memory sharing and pooling to avoid stranded memory, and facilitate memory sharing between accelerators.”

CXL3.1 includes additional features to further reduce latency in connected endpoints and hosts across a CXL fabric, as well as a new security protocol. “Some of these features include CXL.IO peer-to-peer (P2P), un-ordered I/O (UIO), CXL.mem P2P, as well as the addition of trusted execution environment (TEE) operation code to extend integrity and data encryption (IDE) support across the PCIe transport layer, which provides a protected path between host and endpoints through switches and retimers in the path,” said Lou Ternullo, senior director of product marketing for silicon IP at Rambus. “P2P allows PCIe/CXL devices to directly access memory in other PCIe/CXL devices on the flex bus without requiring the host processor. UIO allows transactions to pass through the transport layer without being required to manage ordering rules.”


Fig. 1: TSP enhances security. Source: CXL Consortium

The biggest change in CXL3.1 is the improvement in switch fabric capabilities. “Now it can handle port-based routing, which allows for a scale-out deployment. The switching fabric is not subject to traditional tree-based hierarchies,” said Ternullo. “It allows cross-domain access for hosts and devices with a device capable to access up to 4,096 hosts or other devices. So a designer now can implement a large system or topology and any-to-any communication.”


Fig. 2: CXL versions 3.0 and 3.1 expanded the features of the protocol. Source: CXL Consortium

There are other improvements incorporated into the CXL 3.1 spec, as well. “For example, host-host communication for fabric attached memory access with a global integrated memory (GIM) concept,” said Rambus’ Mollah. “CXL3.1 also introduces the trusted security protocol for enhanced security. With scaled-out deployment capabilities, where so many VMs will be connected via fabric, the security aspect is a very important element that needs to be taken into consideration. The extended metadata capability in CXL3.1 (up to 34 bits of metadata) allows for more diagnostic data and information to be able to be monitored. All of these are major breakthroughs for a scale-out large topology deployment.”


Fig. 3: New global integrated memory (GIM). Source: CXL Consortium

As originally envisioned, CXL was primarily targeted at heterogeneous computing, explained Anil Godbole, marketing working group co-chair for the CXL Consortium and senior marketing manager for Xeon product planning and marketing group at Intel. CXL was largely developed by Intel.

“Coherence is what really differentiates CXL from PCIe, because CXL runs on the same I/Os as PCIe,” said Godbole. “The base PHY of both protocols is the same. But at link-up time, a CXL device will link up as CXL and the host will talk CXL to it, while the PCIe device, if it was put in the same slot on the motherboard, would come up as PCIe. When it links up, the device says, ‘I must speak CXL from here.’ PCIe could never give coherent memory. So fast forward to today, increase of the memory footprint is the biggest use case of CXL.”

On the other hand, he said, “If you don’t need coherency, then you’re better off not having the overhead of the protocol, you can simply pass data from one point to another.”

Countering the objections that CXL doesn’t work with GPUs, Godbole noted that GPUs need CPUs. “You have to understand how a GPU actually gets its workload. In the end, it’s always the CPU that starts executing the neural network workload. A GPU has no brain of its own. It is simply a matrix multiplier behemoth, which is always fed by the CPU. When I’m in meetings, people are always asking what can we do to suck more bandwidth out of the CPU? That’s something we’ll be addressing in the next spec through aggregation of CXL links.”

How CXL handles memory pooling has also been questioned, said Arif Khan, senior product marketing group director at Cadence. “Between the time the spec was released and the introduction of initial OEM platforms supporting the 1.1 standard, the spec had advanced much further. Despite this, there is significant interest from implementers as they seek to build memory expanders and pooling devices. A critique of memory pooling has been offered in an ACM paper by Levis, et al., primarily around the cost, software complexity, and utility. The limited set of publicly available datasets to compare makes this an interesting read. However, the market still demonstrates demand for this standard as implementers are building solutions around it.”

The economics of memory pooling appear to be especially attractive, even if the data is sparse. “CXL will continue to be used for capacity and bandwidth expansion in the future, as well as with memory tiering via compressed memory,” said Mollah. “With lower cost per byte, advanced applications such as memory pooling use cases will become more attractive and allow for a disaggregated data center infrastructure with a lower TCO.”

But CXL isn’t the only game in town. “The standard has seen some recent skepticism with the rise of alternate standards for specialized use cases,” Khan said. “In the five years since the specification was first publicly announced, we’ve seen exponential growth in the AI accelerator space. With LLM use cases dominating the hype cycle, systems are being optimized for these applications. The CXL coherency model is being challenged by other scale-up standards that are now being conceived. Proprietary standards in use by GPU makers are in place already, and have set the benchmark for those specific applications. In any case, the standard space is still new, and it will take some time for the market to determine which standards serve which segments the best. Often, commercial dynamics play an outsize role.”

Solomon said CXL is best seen as part of a spectrum of choices, similar to how memory has a broad range of offerings, which answer specific needs. “Clearly, there are places where it makes sense not to be the fastest, especially because of the tradeoffs with price and capacity. If you look at every modern computer architecture in the last 30-odd years, there’s a hierarchy even within caches, so clearly there is a use for memory that’s not the fastest thing around,” he said. “If you’re on the absolute bleeding edge, CXL might not be fast enough for you. But if you’re building a fast, economical device, then CXL might be your best choice. There’s no one perfect technology for everything. It’s about balance.”

Siemens’ Browy agreed. “CXL being based on PCIe SerDes has lower error rate, lower latency, and commensurate lower bandwidth. NVLink and UALink utilization of Ethernet-style SerDes results in high error rate, higher latency, and high bandwidth, so for the highest performance bandwidth limited cases, such as GPU-to-GPU these have an advantage. In a world where modules increasingly are based on UCIe-connected chiplets serving as the basic building blocks of general- and special-purpose computing, the real CXL advantage is in the need for a robust, low-latency method to provide an intelligent cache-based hierarchy, including what is now local main memory, storage class memory, and evolving pooling/sharing and computational storage solutions. This will be key to database, computational storage, general-purpose computing, scientific computing, and AI as a GPU, which can be viewed as unified, intelligent memory.”


Fig. 4: CXL use cases and validation solutions. Source: Siemens EDA

The future
Ternullo believes there will be a growing role for CXL, and he expects continued use for memory expansion. “In addition, it will further enable heterogeneous compute and data center disaggregation, helping to minimize server over-provisioning and enable on-demand access to memory, storage, acceleration, and more.”

Looking ahead, Yole Research predicts a $16 billion market for CXL by 2028 given its potential in memory utilization, management, and access in terms of disaggregation and composability.

And CXL Consortium’s Godbole said this is just the beginning. “Last year was really about kicking the tires, because it was the first time you could add memory and attach it to your server. We had limited SKUs that could support CXL, which limited the market adoption. As we launch CXL, every single CPU will have a CXL feature, so this is now going mainstream.”

Related Reading
CXL: The Future Of Memory Interconnect?
Why this standard is gaining traction inside of data centers, and what issues still need to be solved.
Memory Fundamentals For Engineers
eBook: Nearly everything you need to know about memory, including detailed explanations of the different types of memory; how and where these are used today; what’s changing, which memories are successful and which ones might be in the future; and what are the limitations of each memory type.



Leave a Reply


(Note: This name will be displayed publicly)