CXL: The Future Of Memory Interconnect?

Why this standard is gaining traction inside of data centers, and what issues still need to be solved.

popularity

Momentum for sharing memory resources between processor cores is growing inside of data centers, where the explosion in data is driving the need to be able to scale memory up and down in a way that roughly mirrors how processors are used today.

A year after the CXL Consortium and JEDEC signed a memorandum of understanding (MOU) to formalize collaboration between the two organizations, support for the memory interface standard has skyrocketed. Dozens of companies are participating in the organization, and Arm, IBM, Intel, Google, Meta, and Rambus are on the board.

Arif Khan, product marketing group director for PCIe, CXL and interface IP at Cadence, summed up the protocol’s significance: “CXL enables resource utilization in a more efficient way.”

Not all workloads require CXL (Compute Express Link). “There are still a lot of workloads that can work within the memory space of direct attached DRAM,” said Frank Ferro, product marketing group director, IP group at Cadence. “What’s driving a lot of memory-intense applications is the use of artificial intelligence, machine learning, and natural language processing, which are creating demand for higher density memory. CXL will help potentially reduce the total cost of ownership (TCO) in the data center.”

Others agree. “As the relentless demand for more memory bandwidth and capacity grows, increasing the pressure on server memory systems, CXL enables new architectures to address these challenges,” said Larrie Carr, vice president of engineering at Rambus and president of the CXL Consortium. “With widespread industry support of this new standard, Rambus believes that CXL will help revolutionize the data center taking computing performance to a whole new level.”

There’s good reason for that optimism. “CXL reduces stranded memory, while enabling datacenter TCO (total cost of ownership) for hyperscalers and software innovations at the OS and application levels,” said Vijay Nain, senior director of CXL product management at Micron. “These capabilities will lead to the reduction of traffic flowing from host (CPU) to guest (GPU/IPU), resulting in increased performance per watt.”

CXL also is expected to spur the growth of memory appliances, which can be shared across multiple applications.

“This is similar to how storage appliances previously evolved to move away from storage area networks (e.g., PURE, NetApp, Nutanix, Dell Power Store, among others),” Nain said. “We see hyperscalers, virtualization software (hypervisors and containers), and SaaS applications adapting to multiple tiers of memory with larger capacity for in-memory databases and more bandwidth for AI applications. With CXL 3.0, we also see the adoption of switching infrastructure leading to industry-wide adoption as it reduces datacenter TCO.”

Despite its many innovations, CXL is familiar to many engineering teams. That has helped spur adoption. “Like a lot of these PCI Express-based standards, it was driven predominantly from Intel,” said Lou Ternullo, senior director of product marketing at Rambus. “The physical layer is essentially PCI Express, in order to leverage the entire ecosystem and the main connectivity to the server CPUs, which is the PCI Express bus.”

This PCI heritage gives CXL Consortium Chairman Jim Pappas reason to believe CXL will stick around, which makes implementation efforts worthwhile. “CXL is an alternate protocol,” said Pappas. “When you first power up the system, part way through the sequence the device says, ‘I also know how to speak to CXL.’ At that point, the enumeration takes a turn, and it connects to a different protocol, but the wires are the exact same wires. CXL is a novel and previously unachievable way of dynamically adding memory into platforms as simply as adding a PCI card.”

This is a new way to look at increasing the utilization of memory without actually adding more resources. “What CXL adds over PCI Express is that it’s a cache-coherent interconnect standard that extends system RAM to external CXL-based memory,” said Christopher Browy, senior director for verification IP product line management at Siemens EDA. “Now you can think about how to program differently than the way you did before, because you have so much more memory available to you and it can grow dynamically on the fly as needed.”

Fig. 1: CXL sub-protocols. Source: CXL Consortium

Fig. 1: CXL sub-protocols. Source: CXL Consortium

CXL has three sub-protocols: CXL.io; CXL.cache, and CXL.memory. “CXL.io is necessary,” Pappas said. “You need to be able to do I/O instructions just to be able to talk to the accelerator, registers, etc., replicating the function of the PCI bus. If you move to type 2, it supports all three of the protocol modes, .io, .cache and .mem. In CXL 1.0, memory can be directly attached. In CXL 2.0, it can attach memory to a pool of processors. This could be used for bandwidth expansion or capacity expansion, letting you use new types of memory, like storage-class memory or persistent memory, or tiers of memory with different performances and cost structures.”

CXL.mem has generated most of the excitement as engineers realize what it can do for memory expansion in the data center and for advanced AI topologies.

“The people who are raving about it are excited about the opportunities using the .mem protocol,” explained Richard Solomon, technical marketing manager for PCI Express controller IP at Synopsys. “Every single SSD maker on the planet is looking at this as a way to map their expertise in whatever non-volatile memory they’re used to onto what has been the holy grail of storage forever, which is byte-addressable storage. Obviously you’re not going to just take an NVM SSD and magically slap a CXL interface on it. But if you’re an expert at managing NAND flash, you can replace the front side of that with a CXL controller and your own logic to now say, ‘I’m building CXL.mem NVM.’”

Additionally, the .mem path is architected to be lower latency than traditional PCI Express. “There’s a lot of protocol overhead associated with it,” said Rambus’ Ternullo. “That was one motivation to get a lower latency interface. So now you have this alternate means of connecting different types of memory or memory to the server CPU.”

As for current implementations, Cadence’s Khan said the first use cases of CXL are a host CPU connected to a CXL controller with all kinds of different device memories. “You can swap out these memories for different kinds. It could be different types of DDDR, different rates, or you could have persistent memories and so on. There are different memory hierarchies that connect to different expanders. The expanders are designed to have hot plug functionality to swap out of CXL DRAM on hot servers without requiring them to be shut down. These are some of the things that are being built into systems today.”

CXL timeline
CXL 1.0 was released in March 2019. Revision 1.1 was released three months later, and 2.0 in November 2020. Less than two years later (Aug. 2, 2022), CXL 3.0 was published. “CXL 1.0 was almost as simple as building a regular PCI device, and in CXL 2.0 there was a restriction where you had to either support type one or type two, and then you could also support memory devices. Those restrictions have been broken free with 3.0, so you can now have multiple different device types,” said CXL Consortium’s Pappas.

And when looking at the memory cost in the data center, Cadence’s Ferro noted about 50% of what is spent in a data center is for memory. “The utilization of that memory is not high, so there’s memory that never gets touched. You’ve got over-provisioning, and you’ve got workloads that just take can’t take advantage of all those resources. CXL 3.0 will be nirvana for those who want a fully composable data center.”

But while the advances in CXL are highly anticipated by the design community, the speed of revisions has left the industry trying to catch its breath. “The fast pace of the spec revisions has hindered the deployment,” said Browy. “As an example, we’re at revision 3.0, but the initial deployments haven’t even happened in large volume. A lot of vendors, hyperscalers, and others are essentially delaying deployments until enough of the ecosystem is available. How do they decide to make the investment? How do they make that tradeoff in terms of where the state of the revision is, versus what they want to provide to their end customers to be competitive?”

The CXL Consortium is responding to users clamoring for more features and flexibility. “I can understand why some people are saying, ‘How do I keep up with this?’ It’s growing in complexity as the demand is growing for capability,” said Pappas.

Some accelerators need a symmetric coherence interface, and current CXL doesn’t have that. Synopsys’ Solomon noted the original idea behind CXL was asymmetric coherence. “The folks it doesn’t work for are pushing inside the CXL Consortium to add more symmetric coherency.”

“CXL 3.0 breaks the link between the CPU being the absolute control point and moving functions down to make things more symmetrical,” said Pappas. “It has as much complex switching topology as CXL 2.0. We could also have nodes, for example. We could have accelerators performing similar types of functions that the host performs. You can get more devices attached to the host. You can also cascade switches and use a tree topology, for example. This is a big improvement over 2.0. In addition, devices can directly talk to each other in a virtual hierarchy without being mapped to particular hosts. The host also can still control who can talk to. Now they’re mapped to the same coherency domain.”

Fig. 2: Representative CXL usages. Source: CXL Consortium

Fig. 2: Representative CXL usages. Source: CXL Consortium

Pooling and Sharing
One of the biggest changes CXL 3.0 allows is shared memory. As Pappas explained, using Figure 3 as his example, “If Host 2 modifies its data or even its cache in S1, it would update the data in Device 1, the memory, but it would also update the cache in Host 1. So those two are now maintaining coherency, over the CXL and through the switch. It’s the same thing within S2 with H2 and H# using the magenta copy of S2.”

Fig. 3: Memory pooling. Source: CXL Consortium

Fig. 3: Memory pooling. Source: CXL Consortium

Still, he expects that will take a lot longer to materialize. “Let’s say that you are building a supercomputer to run a specific application. You could build into the application the ability to do shared memory, and that may be able to accelerate your application in many ways. It will also likely help with AI applications. Think of an AI machine where you have hundreds or thousands of machines accessing the same database, rather than replicating it everywhere. You just access the database. It’s copied memory, shared memory, or even just a small cluster of machines working on part of the problem, sharing data for its part of the solution. These are all very, very powerful constructs.”

This configuration opens up options like some machines having pooled memory that’s uniquely theirs, while others can have memory that they share with other machines in a totally separate coherency domain. The coherency is done at the physical address level, and it works even though the coherency domain is different across the different CPUs.

Compute fabric
CXL 3.0 also introduces a compute fabric, allowing for non-tree architectures, in which hosts become nodes. “The biggest thing I would say for CXL 3.0 in addition to the speed doubling is the ability to build fabrics,” said Pappas. “We can have up to 4k fabrics, so this would let you connect entire racks or maybe even entire pods. Just think of all these nodes being able to talk to each other in a similar manner as what two adjacent processors did in a dual-processor system.”

Fig. 4: Compute fabric. Source: CXL Consortium

Fig. 4: Compute fabric. Source: CXL Consortium

And while there can be a cost in latency, Pappas said it’s worth the tradeoff. “Of course, as you scale out, there’s going to be lots of switches, networks, and other items in between. The latency will go up, but you still have very close load store capabilities between these devices. This is unprecedented.”

Varun Agrawal, product manager for verification IP, virtual system adaptors, and protocol solutions for hardware-assisted platforms at Synopsys, agreed. “If you want to balance between latency and a little bit more of memory, then that is where I think CXL is going to fit.”

“It’s slower, but it’s not that much slower,” said Solomon. “Today people say, ‘I could do terabytes of memory, but I’m going over Ethernet to get to it. I’m going box to box, and that’s slower than snails.’ Being able to put memory on CXL and now utilize the CXL fabrics, all that stuff becomes closer and quicker to the CPU. What some people are pushing is CXL as an enabler for those huge memory pools that those kinds of architectures want. A fabric allows the same memory to be shared by multiple servers and multiple CPUs, which does get us into that sort of parallel computing model where now you can have different boxes, sharing the same memory, as opposed to just different elements on an AI chip.”

It also provides yet another way of keeping DRAM alive well into the future. “For the past 25 years, there was always these replacement technologies coming for DRAM and flash,” said Omar Ma, DRAM marketing manager at Winbond. “There’s always the latest memory technology du jour, and it sounds like a good idea. But they can never reach the costs or the production capacity requirements, so those remain niche markets and eventually die off.”

DRAM isn’t standing still, either, Ma said. It continues to scale, allowing it to store more per DIMM or per module, which makes it even more attractive in a shared setting. This is important in hyperscale data centers, but it’s also important in the growing number of edge data centers where CXL can play a significant role.

GFAM global fabric attached memory
CXL 3.0 also introduces global fabric attached memory (GFAM), which moves away from a traditional processor-centric architecture. Now, the memory is disaggregated from the processing unit, and there is an enormous memory pool that is an entity in itself.

“There could be different types of memory, different speeds of memory, different characteristics of memory, like non-volatile memory,” Pappas explained. “When you think of AI-centric machines, now you have the memory that’s in the middle, and it’s accessed by lots and lots of CPUs and GPUs in a fabric-centric world.”

Fig. 5: Compute Fabric. Source: CXL Consortium

Fig. 5: Compute Fabric. Source: CXL Consortium

GFAM adds another option for data centers. “Lots of topologies and configurations could be built,” Pappas said. “How do the platform architects who are building these systems and the data center architects want to have this? Will the middle of the rack have a big memory box memory/switchbox with all of the computation, whether it be CPU or GPU computation connected into that switch memory? It’s opening up the gate for how data center architecture will evolve.”

Verification
As CXL adoption grows, one major remaining challenge is verification. “A lot of companies struggle with finding a realistic validation environment, which could be a real board to test it against or a real server available on the market, which they can put in and run a real application load on,” Agrawal said. “You don’t have those commercially available validation setups. You just close it based on whatever you have.”

Siemens already has solutions to address verification questions, as shown in Figure 5, according to Richard Pugh, director of product management, Siemens EDA.

Fig. 6: CXL use cases and validation solutions. Source: Siemens EDA

Fig. 6: CXL use cases and validation solutions. Source: Siemens EDA

Conclusion
Memory manufacturers recognize that CXL systems will be composed of vast amounts of DRAM, as well as flash memory, to provide volatile and storage class memory.

Siemens’ Browy said this means the whole storage class memory market and SSD market is going to be adopting CXL into their product architectures. “You’ve got the ability to vastly expand the amount of memory, along with the flexibility of how to provision it and allocate it to different jobs. It’s got huge potential to change the way people do programming in computing going forward.”

Still, that doesn’t mean that the CXL will be the only interconnect going forward. Solomon recalled two years ago at the Flash Memory Summit, people were asking, “‘Why would you put an NVM drive on CXL?'” Well, you wouldn’t. SSD makers may rave about CXL, but they’re not going to sell you a CXL NVM drive. Classic non-smart NIC networking and similar things are going to stay on PCI Express —probably forever.”

Nevertheless, the opportunity is growing, and Micron has gone all in. “CXL is here to stay and will be the de facto standard for universal interconnects for memory disaggregation in the future,” said Micron’s Nain. “It is driven by the commitment from the entire ecosystem of x86, Arm, storage, and memory vendors. For any new technology to be successful, ecosystem adoption is critical. Of particular importance for CXL is the evolution of the software ecosystem to ensure the most optimal usage of this technology. We are partnering with ISVs to enable progress in this area. Furthermore, as we move forward from direct attach use cases in CXL 2.0 toward the CXL 3.0 timeframe, switching infrastructure and management software play an increasingly critical role. Micron is engaged in active collaborations with our ecosystem partners in all these areas to enable broad CXL adoption.”

Related Reading
CXL Picks Up Steam In Data Centers
Market opportunities for shared resources balloon, but verification/validation remains a challenge.
HBM’s Future: Necessary But Expensive
Upcoming versions of high-bandwidth memory are thermally challenging, but help may be on the way.



1 comments

Mark Hahn says:

I’m struck by the abstract, fuzzy, dare I say “vaporousness” of these descriptions. That’s a real problem, because the entire premise depends on tangible parameters (latency, bandwidth, throughput). It’s not going to be attractive to clients if CXL memory, even if pooled and not stranded, is slower. If caching “solved” that problem, we’d all be using MMU-assisted RMA, which has been doable for decades…

Leave a Reply


(Note: This name will be displayed publicly)