But figuring out which ones to use, and when to use them, isn’t always clear.
Semiconductor Engineering sat down to discuss memory interfaces, interconnects, and memory access scaling with Madhumita Sanyal, senior director of technical product management at Synopsys; Swadesh Choudhary, senior principal engineer at Intel; Siamak Tavallaei, senior principal engineer at Samsung SSI; and Mohsen Asad, senior director of technology at Credo. What follows are excerpts of a discussion at the IMAPS Memory Summit in Santa Clara, California.

L-R: Synopsys’ Sanyal; Intel’s Choudhary; Credo’s Asad; and Samsung’s Tavallaei
SE: In the real world, data movement is messier than the interconnect standards would suggest. Not all data moves at the same speed, there are multiple channels, and all channels are not created equal. And nothing ages equally over time. How do we deal with this?
Sanyal: One way would be to perform end-to-end simulation before actually building the overall system. That means getting the models of the system in terms of the interfaces, the channel itself, and making sure you take into account performance or any discontinuity in the data path from one accelerator to another accelerator or the host. End-to-end simulation would mitigate the risk and increase visibility into the whole system.
SE: We’re in an age where we’re dealing with AI agents, and they are continually adapting and changing. There are thermal gradients due to changing. So while simulation is absolutely critical, will it effectively address these issues over time?
Asad: That’s exactly the problem. The systems may be working pretty well right now, but then under huge workloads they heat up, and sometimes they break. It’s very common. The real world is not as digitized as zeros and ones. There are lots of fluctuations. During product development we have to be able to iterate quickly, build things quickly, and test them quickly to see errors as soon as possible before sending millions of units to the customer. We have to be the master of finding errors, and when you do find them, you need error correction mechanisms and equalizers. Other times something may look like an error, and because it’s a capacitive system, maybe you’re providing eight times more capacity for the core architecture than it needs. But that also can open up a whole new business opportunity.
Sanyal: If there is continuous health monitoring of the overall system, you can predict a failure before it occurs.
Choudhary: There’s increased importance in simplicity and levels of abstraction to all of this. RAS (reliability, availability, and serviceability) gets messy very quickly. The goal is to come up with simple models that scale with the messier systems, and which are able to isolate a problem and buy you enough time to not bring the entire system down. That gives you time to service the part that’s failing or recover from the failure.
SE: As more commercial chiplets are included in designs, do you know how they’re going to behave and what impact that will have?
Choudhary: It’s very tough to predict, especially when you mix in different technology nodes and different packaging technologies. When you look at compliance and interoperability, that’s definitely at the top of the list. When you say your DDR is X, how much margin do you have on that DDR? We are increasingly looking at different features, doing eye margining, runtime monitoring, trying to make sure that we have enough time to at least send a notification before the system crashes. Serviceability is hard to do when everything is packaged together. So it comes down to having redundancy or other capabilities within the chiplet itself, or having alternatives that can come online if your system is going down. We’re looking at it from a package level, and you need to identify common open-grid signals and variables that we can use to notify and broadcast to everybody who needs to take action.
Tavillaei: As the volume of these products goes up, Murphy’s Law becomes more relevant. If anything can go wrong, it will go wrong. This started with a question about specification standards. If we write it down, it’s supposed to be done this way. If things go wrong, we do it this way. Specification provides a blueprint — an architectural framework. Then comes some sort of design. Somebody dreams up a particular use case to implement using a particular specification. That specification has a lot of optional features that people choose based on business requirements, customer requirements, and which portion of the specification to implement based on the value they get out of it. So that is the base specification. After that, there is some sort of design specification, and after that comes a product. When somebody builds a product for the sake of making money, maintaining it, then that particular company is going to be very careful about demonstrating the design points. Why? Because they don’t want to receive a call from a very angry customer saying, ‘Those are not specifications. There’s no standardization.’
Audience Question: Will CXL and PCIe extend across the entire rack?
Tavillaei: There are a number of layered elements with any kind of interconnect. The bottom layer is the physical layer. Then comes the link layer, the transaction layer, and then other things on top of it. CXL follows PCIe, which followed PCI, which followed EISA (extended industry standard architecture), which followed ISA many years ago. We borrowed these things from IBM PC, and then what did we do with it? Beyond the physical layer, people started to build firmware, build debug solution protocol analyzers, and then later, management for a number of software layers. The bottom layer can change. If CXL devices are out there today, and someone implements memory pooling, which device do we have for that person to use today? We don’t have UALink. NVLink does something for itself. But CXL elements are available now. CXL controllers through memory are available now. Switches are available. They start building something based on what’s available. They build value in their software. Later, they pick up a different kind of interconnect. But composing memory on top of the physical layer, security, RAS orchestration — those things at the higher layers need not change.
Audience question: Do you see the coexistence of multiple parts within the same system, or do you build the value around the ecosystem that is available?
Tavillaei: Azure has many, many hundreds of thousands of elements in one data center. The poor guy who has to debug this thing if they’re different in every rack. Technically, it’s possible, but it’s easier to spec a design and then qualify it. You don’t have time to qualify everything, so it’s easier to qualify one from A to Z, and then replicate.
Sanyal: There are interfaces from host to SSDs, or host to accelerator, which are PCIe and CXL. That’s what we see in the market today. Two years from now, people who are starting their design right now with the new design stuff, accelerator-to-accelerator switch, may use UALink, but with CXL. I don’t see UALink addressing CXL or UCIe in the very near future.
Tavillaei: Are you saying CXL memory devices will be out there, but CXL accelerators might not be?
Sanyal: I’m talking about the connectivity between the host and the accelerator, and between the host and SSDs. Those will be CXL. But if a customer is designing an accelerator, it will have CXL, PCIe, and UALink. They may be splitting one accelerator into multiples because they need a lot of lanes, so now it will be multiple reticle-sized dies. UCIe will be there, but when the accelerator is talking to the host, it will be CXL.
Tavillaei: Specialization is required when people want to utilize their hardware fully. They will specialize in different areas, and when money is to be made by specializing and customizing a particular interface, people will definitely do that. But I don’t think there will be a superset. There will be niche solutions for everyone, because nowadays every one of those special things can be high volume because the hyperscale customers have a need for many things of the same type.
Leave a Reply