Experts at the Table, Part 2: What’s right for one design may not be right for the next. Here’s why.
Semiconductor Engineering sat down to the discuss the pros and cons of the Compute Express Link (CXL) and the Cache Coherent Interconnect for Accelerators (CCIX) with Kurt Shuler, vice president of marketing at Arteris IP; Richard Solomon, technical marketing manager for PCI Express controller IP at Synopsys; and Jitendra Mohan, CEO of Astera Labs. What follows are excerpts of that conversation. To view part one of this discussion, click here.
SE: One of the advantages of CCIX is that it allows you to connect heterogeneous components together more easily, right?
Shuler: Yes, it connects the logical hardware.
Solomon: People don’t want to get all of their silicon from the same vendor. Coherency protocols aren’t new. We’ve had them for decades. They were just always closed. Whatever brand of CPU, it had its own coherency protocol and its own coherent interconnect to do multi-package, multi-socket. What’s different here is this is now open. It’s like LEGOs that allow you connect those things.
SE: So what are some of the main challenges that you run into when you’re creating a design and you’re now working with either CCIX or CXL?
Solomon: The first thing for people to consider is whether they need a coherent interconnect or not. We have a large number of people who come to us and say, ‘CXL is hot. How do we design that in?’ We ask what the application will be. If it’s for basic block storage, you don’t need a coherent. So the first real barrier ought to be architecturally, ‘Do you need a coherent interconnect?’ Then, if you do you need to look at that, do you need the symmetry that CCIX provides. If you do, you’re kind of stuck. CXL is fundamentally asymmetric. You’re not going to go do CXL if your design and your system implementation depends on a symmetric coherent interconnect. If an asymmetric interconnect is okay, then you can look at whether latency is important and who are the partners you can in the system space. I tend to look at things from the device side, because the bulk of the designs in the market. There are fewer server and host implementers than there are device-added implementers. So those guys need to look at who they’ve partnered with. If you’re going to use an x86 processor, at least today, you’re probably only going to get CXL on that device. If you’re going to use a different architecture, then maybe CCIX is an option. After that, you’re in to the details of the coherency protocol, and this is not as simple as going from PCI to PCI Express. It’s not a straightforward transition, and you had better be aware of what it means to be coherent. And that leads to power management of the whole system and all the different signaling that has to occur.
Shuler: One of the first questions is how you go about doing this. Is your architecture symmetric or not? Do you control both chips? Are you developing both chips? Or do you imagine you’re gonna be hooked on to something from Xilinx or somebody else? The next question is where you’re getting your PHY from. There’s a standard for everything, but they all have different capabilities and ways of integrating with them.
Jitendra: There is another key challenge when it comes to distributing high data rate signals within systems. At CXL data rates of 32Gbps, and soon going to 64Gbps, the signal cannot reach very far given the signal loss budget allowed by the standards. For example, once we account for signal loss in the CPU and end point packages, we can only ship 32Gbps signals 4 inches on FR-4 material. And even then, a PCIe Gen-5 x16 link operating at standard specified loss limits will have a bit error every second. So we need a solution that can restore the signals and dramatically improve bit error rates, allowing a CPU to reliably reach the farthest PCIe slot or NVMe drive in a server. Fortunately, the PCIe standard defines a retimer class of products to address just this challenge. When it comes to cache coherent interfaces like CXL, it is critical to have very low latency through retimer components. Note that a round trip through a retimed link incurs 2X retimer latency. Retimer vendors are beginning to address these challenges.
SE: People have been looking at asymmetric data flows for a very long time, but that’s not always so easy to implement. There are lots of choices about what stays on one chip, what goes off to another, and whether coherence is necessary. How does the interconnect protocol affect that?
Shuler: When CCIX first came out, there was a lot of discussion about doing larger-scale, symmetric cache-coherent systems. But as you add in die or separate chips, and you have to increase memories and caches — and data for what’s going on in the different die, and locally storing that — there’s an architectural line where it doesn’t make much sense anymore. Are you actually losing more than you’re gaining? It’s really, really hard for architects to figure out where that hump is. Even if you have 20 years of experience as a cache-coherent architect, you can’t figure this out anymore in your head or by using Excel. That doesn’t work with CCIX and CXL.
Solomon: As you go with this kind of asymmetric data flow, where data is unevenly distributed throughout the system, then it’s easy to rely on a symmetric coherency protocol and assume that solves it. Everybody has a cache, everybody’s equal. But it’s also easy to back yourself into a corner. Let’s say you pick CXL. It’s asymmetric, but it’s got low latency and widespread adoption. But if you don’t understand the system architecture, you could make some really bad decisions. At the same time, symmetric coherency is inherently more complicated. So there’s no simple answer. It’s definitely tradeoffs. The AI guys are struggling because they’re building these really cool, massively parallel processing elements, and they’re saying, ‘You could do this or you could do that.’ The problem is that you need symmetric multiprocessing and symmetric coherency, but you don’t want that overhead. So the hardware guys are running around in circles.
SE: So now you have these very complex chips for AI, and suddenly you have multiple elements that may be used differently, age differently, and used for longer periods of time than what they were used for in the past. Is the choice to only move as fast as the slowest element in this system, or is it is it better to set it up so that each one functions more or less independently?
Solomon: That’s a software question. If I have these heterogeneous elements that inherently have asymmetric capability, what’s the best way to use them? It depends on the problem. If I’m looking at a surveillance camera, maybe the first part of my problem is find all the faces in the frame. I may not need the most cutting edge device to do that, so I want to architect my system and put that lower capability device closer to the camera. For the 12 hours a day that nobody walks by, I can leave the rest of the compute elements shut off. But the chip designer isn’t picking that. It’s really the system designer and the software designer.
Shuler: One of the things that does happen when you start with these multi-die designs is you have to deal with the physical effects. It’s great if you stack different things within a package, but in markets like automotive you have to deal with temperature and vibration and how to dissipate heat evenly within those die and across the substrate. That’s a huge challenge. If you got an ADAS system, the camera is on all the time. They’ve gotten smarter. The cameras have some silicon in them and they’re doing some inference and object recognition. The object recognition is converting data into some kind of metadata, which is going to the central brain. But that stuff is running all the time, and that affects the floor plan of what they do on the chip and what gets located where. From a digital logic standpoint it may not look right, but the heat may not dissipate up or down because there’s a memory chip below it or another die on top of it.
Jitendra: You bring up a good point. These server systems are becoming increasingly complex to design, build and maintain. With shortened design cycles, a server design needs to have the flexibility to quickly and seamlessly upgrade from Gen 4 to Gen 5 to CXL. At the same time, cloud customers have a laser focus on uptime. Reliability, availability and serviceability are key for modern systems. Semiconductor chips used in modern data centers need to be purpose-built from the ground up and offer not only high performance and reliability, but also ease of use and smart diagnostics capabilities.
Related
Which Chip Interconnect Protocol Is Better?
Experts at the Table: CXL and CCIX are different, but it’s not always clear which is the best choice.
CXL Vs. CCIX
How the Compute Express Link compares with the Cache Coherent Interconnect for Accelerators.
The New CXL Standard
the Compute Express Link standard, why it’s important for high bandwidth in AI/ML applications, where it came from, and how to apply it in current and future designs.
CXL and CCIX Knowledge Centers
Top stories, white papers, videos and blogs on CXL and CCIX
Leave a Reply