Which Chip Interconnect Protocol Is Better?

Experts at the Table Part 1: CXL and CCIX are different, but it’s not always clear which is the best choice.

popularity

Semiconductor Engineering sat down to the discuss the pros and cons of the Compute Express Link (CXL) and the Cache Coherent Interconnect for Accelerators (CCIX) with Kurt Shuler, vice president of marketing at Arteris IP; Richard Solomon, technical marketing manager for PCI Express controller IP at Synopsys; and Jitendra Mohan, CEO of Astera Labs. What follows are excerpts of that conversation. (Part 2 is here)


L-R: Kurt Shuler, Richard Solomon, Jitendra Mohan

SE: Where do you see CXL working versus CCIX?

Shuler: CXL customers are looking at it primarily as a companion chip to an x86 server chip. From a coherency standpoint in the communication that goes back and forth, it’s actually simpler than CCIX.

SE: This was originally an Intel invention, right?

Shuler: Correct. It was meant as an easy way or easier way to make a tighter coupling with a Xeon, and it’s used by Amazon or Microsoft or some of these other companies doing either FPGA-based or custom ASICs. And so there’s been a lot of interest in that. Everybody is still asking, ‘Have you heard anything? What are your customers doing? Everybody is circling around and trying to figure out what everybody else is doing. CCIX is a little different. The idea there was that you would have one or more chips and they would all be one cache coherent system. So in the case of CXL, the coherency is all managed on the Xeon side, and that companion chip is always a slave. It’s different with CCIX. So if you do the bi-directional coherency, which is what people are interested in, it’s one big cache-coherent system. When you get on the digital controller side, you have to duplicate snoop filters and some of the coherency logic to be able to understand what has happened on the other dies that are attached to it. CCIX is more complex from a digital logic standpoint than CXL. Some large chip companies have used CCIX, but I’m not sure how widely it’s been adopted.

Solomon: CCIX is fully symmetric, which has some definite advantages if you’re trying to use it as a socket-to-socket kind of interface. So if you want to take multiples of your CPUs and put them down, you’ve got to have a fully symmetric interface. For accelerators, maybe it’s not quite as important. It depends a bit on the workload and the kind of acceleration you’re doing, and whether you even care about symmetric cache coherency. With CCIX, everybody’s an equal. With CXL, I always describe it as, ‘Hey Dad, can I have the car keys?’ Dad deals with all the coherency stuff on making the transition. He says, ‘Okay, here’s the keys, and now you can run unfettered access to your memory, the host memory, and things like that. It’s definitely a simpler approach. A lot of accelerator folks don’t need fully symmetric coherency. An interesting offshoot is that with fully symmetric coherency, if I’m the accelerator guy and I mess up, the system is coming down. With asymmetric coherency, if I mess up, the rest of the system isn’t necessarily corrupted. If I foul up a symmetric coherency protocol, really bad things happen. Another difference is that when they developed CXL, there was a huge focus on low latency. In simulation we’re seeing latencies that are a fifth or a tenth of what we’re seeing in other places, and that’s really been a compelling story for CXL, along with the simplicity. There are still plenty of people looking at CCIX and using CCIX, and it sounds as if the CCIX Consortium is going to go after improving latency. If you think back to when CCIX first came out, its real strength was being built on top of PCI Express. That was what made it easy for everybody to implement. All these transaction layers were already there. You really just added coherency kind of on top, and bam, away you went. But that hasn’t paid off well in the latency area, which means that the socket folks got more latency than they really wanted. Even some of the accelerator folks are saying, ‘We like the latency of CXL, couldn’t you guys do something like that?’ CXL has really been a big kick for CCIX. And clearly CCIX helped CXL to come out. So they’re improving each other just by being out there.

Jitendra: The fact that CXL was originally an Intel invention is actually a key reason why CXL ecosystem has evolved so quickly. CCIX has been around for some time, as have OpenCAPI and NVLink. All these standards were trying to solve the coherency and latency challenges. However, other than NVLink, which is natively supported and adopted by Nvidia, the other standards did not see widespread adoption, largely because none of the big CPU vendors were participating and adopting them in a big way. In contrast, Intel built the initial version of the spec, included it in their next-generation CPUs, partnered with leading companies from an early stage, and eventually open-sourced CXL. It’s clear to me that CXL will become the dominant cache-coherent, low-latency server interconnect.

SE: Are there specific markets that one is heading into versus another, or specific architectures? So if you’re doing an AI chip, would it be obvious which one you would use?

Solomon: Not necessarily. For chip-to-chip on your AI chip, you might want the symmetric coherency of CCIX. There also will be some instruction set architecture partitioning. Certain instruction sets are going to use certain versions. One of the things I do think CXL did right is that it’s completely instruction set-agnostic. Everyone is saying it’s Intel’s answer to CCIX, but you don’t have to use an x86 instruction set. With CXL, you can do it with an Arm instruction set or even a Motorola 68,000 instruction. It is very agnostic. Both protocols have worked hard to make themselves not be tied to a particular architecture.

Shuler: Both protocols were created with the assumption that they’re going to be in a chip that’s in a device that’s plugged into a wall, and it’s already going to have PCIe on it, so let’s ride on top of that for the physical layer and some of the transport stuff. That is one thing. What we’re seeing is for folks who are doing either edge AI devices or automotive —there is an overlap because most of the recent automotive ADAS architectures are AI, and there’s there’s a bunch of inference stuff going on in parallel — the latency issue is a huge deal. Those chips generally are not designed with a PCIe on board, so they’re looking for alternatives. Something that came up was XSR, the extra short reach standard. Synopsys, Cadence and AlphaWave have SerDes for this. And people are asking, ‘If latency is low enough, can we create our own cache coherent system with two different chips designed in context with each other, and use that instead of CCIX and CXL for these other use cases?’ That’s something we’re dealing with now with our customers. Some of them license the XSR SerDes from other people, some of them do their own. So now, how do you do the PHY adaptation layer from a digital logic standpoint so you don’t have to do custom work for every customer?

Jitendra: A very important application for CXL will be to realize a truly composable server. A composable server architecture is based on resource disaggregation, where you can configure the amount of resources using software. The concept has been around for years, but there was never really a good interconnect solution that delivered the needed throughput and low latency. With CXL-based hardware, cloud service providers can use a single server configuration to provision multiple virtual instances customized for specific workloads that require varying amount of memory, AI acceleration, GPUs and networking capabilities.

SE: Are there cases where both of these protocols are included in a design, or is it one or the other?

Solomon: We get asked that question a lot. For a CPU vendor, that is an attraction because you want to be able to play in multiple markets. Maybe you have some people who need the symmetric coherency protocol, some who don’t, and from an IP standpoint it’s not that painful to include both. The hard part is on the customer side, where they have to implement the coherency protocol inside the SoC. Once customers looked at that, none of them opted to do both. A couple of the CPU guys are headed that way. They already have a bunch of CCIX engagements that they are committed to, but they’re also looking at CXL saying, ‘The adoption of this is faster than anything we’ve seen before.’ So some of them are trying to do both. But the typical implementation is going to be one or the other.

Shuler: It’s one or the other. The system vendors and the hyperscaler companies already have Xeons in their data centers. Now they’re adding accelerators chips and things like PCIe cards. There are other guys starting from scratch. That includes companies in China and researchers in the United States, and some of the superscaler folks. There are no x86 chips in those, or at least in that part of the system within a big data center. Those are the guys who are excited about CCIX and being able to create that scalable, multi-die architecture within a package. Some of these AI chips have a gazillion Arm cores and processing elements in some kind of replicable, tile. They’re usually in some kind of mesh, reticle-size chips. CCIX, is of interest to those guys.

Related
Choosing Between CCIX And CXL
Experts at the Table Part 2: What’s right for one design may not be right for the next. Here’s why.
CXL Vs. CCIX
How the Compute Express Link compares with the Cache Coherent Interconnect for Accelerators.
The New CXL Standard
the Compute Express Link standard, why it’s important for high bandwidth in AI/ML applications, where it came from, and how to apply it in current and future designs.
CXL and CCIX Knowledge Centers
Top stories, white papers, videos and blogs on CXL and CCIX



Leave a Reply


(Note: This name will be displayed publicly)