Navigating a sea of standards and options in a rack and between racks.
Semiconductor Engineering sat down to discuss challenges and solutions for data center build-out and build-up with Gordon Allan, Siemens EDA director of verification IP; Rishi Chugh, vice president of product marketing for network switching at Marvell; Saravanan Kalinagasamy, senior director of ASIC design and validation at Astera Labs; and Jalaj Gupta, product engineering lead at Siemens EDA. What follows are excerpts of that panel discussion.

L-R: Siemens’ Allan; Marvell’s Chugh; Astera Labs’ Kalinagasamy; Siemens’ Gupta.
SE: For data center architectures, which is easier to verify, scaling up or scaling out? And what kinds of problems do you run into?
Gupta: Scale-up is more on the memory semantics. It has its own differentiation, and when you get to verification and look at the switch topology, it would have a management layer and a software layer. You would be concerned about low latency and high bandwidth. You would want to verify the packets at different layers. Scale-out is more on the package side. With scale-out, it would be more related to the integrity of the packets, not what’s inside them.
Kalinagasamy: Both have their own challenges. Scale-up may have hundreds of GPUs or accelerators, and they have to work in a scale-up network. Scale-out is point-to-point. They cannot make that many ports in the scale-out switches. Scale-up can have hundreds of ports, so the verification challenge is greater in scale-up than in scale-out.
SE: These systems grow more complex with each new node and each new server that you add in, and with what you’re doing with these servers. What does that mean from the verification side?
Allan: I’m a UVM guy. I like sequencing and constrained random. But increasingly we’re seeing the challenges of applying the right stimulus at the right level, at all levels of the hierarchy, applying the big picture stimulus. We’re investing a lot in software-driven flows so we can attach our verification IP and our testbenches to software — real-world traffic going on there. When you move from scale-up to scale-out, the semantics are different. You need to think about multiple orthogonal stimuli happening at the system level.
Chugh: Scale-up and scale-out are each going through a different kind of revolution. Verification is important even for scale-out, because you have UEC (Ultra Ethernet Consortium) coming in. Scale-out right now is entering into the phase of AI, where we have dynamic load balancing, Link Layer Retry (LLR) and Credit-Based Flow Control (CBFC). These are heavy-hitting protocols, especially when you do dynamic load balancing and packet spraying, and they have a lot of impact on verification. AI is like an octopus for scale-out. You have high-radix switches running at a high bandwidth, running at 1.6 G on the port density, and you have a large number of them. So it’s quite complex. On the other side, scale-out is a bit more stabilized because you’ve been using it. You know the end points and the players. There are NiC cards and xPU cards, so there’s a little bit of stability in the system. But the system is evolving. There is an incremental jump, it’s quite steep. On the scale-up side, the end points don’t exist today, so you are heavily dependent on the models that are there. That becomes more critical because you are basically running a race with a non-existent endpoint. There isn’t a standard checklist you can run against and say, ‘I’m fully compliant.’ So it’s more challenging for scale-up. Plus, you’re dealing with memory semantics.
SE: There are a bunch of new interface standards for moving data — UCIe, Bunch of Wires, UALink, UEC, ESUN from OCP, and a couple proprietary ones like NVLink from Nvidia and UB-Mesh from Huawei. What are the tradeoffs between open and proprietary?
Gupta: This is more on the application side, and what you’re concerned with is bandwidth, power consumption, and latency. Those will determine which interface you go with. If you are upscaling, you have UA Link and NVLink. But NVLink is proprietary for Nvidia GPUs. UA Link is open-source, and AMD is supporting it. Those are the things that would define which interface you go for. UCIe is for the chiplet. The process you’re going to follow will depend on those factors and determine which interface you’re going to end up with.
Kalinagasamy: NVLink and UALink are very different things. NVLink is proprietary. If you use that, you are locked into one customer. UALink is an open standard, so you’re bringing all the collective knowledge of the industry. The one good thing about NVLink is that it’s mature. It’s proven. But all the industry leaders are coming together to define UALink and take it to the next level.
SE: The EDA industry has a long history of competing standards, and it didn’t always go well. How do we sort through these?
Allan: It takes time. We have a history of being patient in a competitive environment and taking our time to develop technology incrementally. If you think back to CXL, which is a different application domain from UALink and Ethernet, CXL really took three full iterations of the spec before it became a production-worthy standard. Along the way, CXL absorbed some other standards into its specification. We can expect something similar with UALink and NVLink. In EDA, we have the luxury of investing in all of them, supporting all of them for our customers, whether they’re your peers or your competitors. Verifying these systems with multiple cores from different sources, with multiple networking infrastructures from different sources, is a really interesting challenge. But it comes down to solid specifications and solid verification IP.
SE: How does this play out in the chiplet world?
Chugh: We have custom designs and we have standard products. With custom, many things are proprietary. Intel conceived UCIe, but they don’t play in networking or switches. UCIe is like a benchmark, because we are not building CPUs. We are building a network fabric. It’s not like anybody can just throw something out there and it has to be used. When we do xPUs for our end customers, which are system houses or hyperscalers, they have their own set of requirements because they are building systems. They are not printing Intel Inside on it. For Intel it was there for their usage, and it was perfect for them. They did a good thing by putting it in the public domain and telling people that if they want to use it they can. But from our point of view, when a specific customer is looking for something, they will choose what works best for their system and their application at a certain cost point. If some things are not working, they will not absorb it just because it’s an industry standard. They’ll take the goodness of it, modify it or customize it, and then take it into the mainstream.
SE: Some standards, like Ethernet, have been around for 35 years. UALink has been around since 2025. What sort of issues do you run into with verification of the newer standards?
Gupta: There are different challenges. It’s not that one is hard and the other is easy. Ethernet has been there for a long time, which means we have been working with it for a long time. We’ve hit issues in the past. UALink is new. AMD has been backing it, and has been developing it in-house. They came up with an open standard that everyone can use. But it has its own challenges. It has memory semantics, which were not there for Ethernet, although it is using the same Ethernet PHY. It’s going to be a challenge to verify UALink, because when you come to the ecosystem that UALink creates, there will be hundreds or thousands of ports that are connecting together.
SE: The common denominator here is moving more data and moving it faster and securely. Where does co-packaged optics fit into this?
Allan: We’re working with a number of customers who are using optics for the connection. Whenever we have a standard like PCIe, the next generation doubles in throughput, doubles in speed, and it’s already out of date. People are hungry for more bandwidth and better latency, so there is always an optical option to sit on top of this data. It’s a moving target, though. There is lots of silicon involved in connecting our copper-based standards to the optical world, and there’s infrastructure that optics has for proprietary switching applications. It’s less likely to be open, but we see several customers who are pushing the boundaries of standards in CPO.
SE: Do existing tools work for CPO?
Chugh: The tools work for CPO, but it’s more than that. Where we hit a hurdle with CPO is the packaging technology. The fundamental question is, why CPO? There are multiple reasons. One, what’s driving CPO is what’s after 200 gig. What comes next is 400 gig, which is PAM4, PAM6, and PAM8. It’s PAM4 in the optics, PAM6 and PAM8 in the host. That means you’re not dealing with a unified PAM. Number two, at the host point, your package losses are going to be more than 12 degrees — 12 to 15 degrees to launch, 12 to 15 degrees to receive. Where is the remaining margin? You have PCBs, you have connectors. Using 400 gig for a longer-distance transmission creates a bottleneck. We have come to the point where copper cannot take us to the finish line. CPO is something that everybody on the networking and switching side will have to consider.
SE: How about the verification of CPO?
Allan: Verification here means multiple things and different kinds of physics. We have teams and products covering every kind of physics, whether it’s fluid, thermal, mechanical stress, and these techniques apply to many domains. For example, chiplet-to-chiplet connections need all of these things, not just functional verification. It includes heat, thermal/mechanical optics as a scale-out medium, and even as a scale-up medium. There are challenges, but those challenges are already being addressed by the rack makers. they’re dealing with liquid colling, physical constraints, and thermal constraints.
SE: What are the most important metrics here, and have they shifted?
Chugh: Thermal is the main one, because of the power consumption and the heat being emitted from the rack. That’s number one. There is a huge challenge cooling these platforms today. Second, you want to make sure the links are stable when you’re connecting these devices to the endpoints. You have servers from different companies, switches from different companies. And third is the network protocol, which is much simpler.
Kalinagasamy: Another thing is the system uptime. How long is the power so high that you have to cool it down? System stability and uptime are critical.
Allan: Economics and security, as well as all the technological things. It’s critical that we bake in security from the lowest level to the interface protocols.
Gupta: It depends on the application. If you’re streaming things, then you want higher bandwidth. On the latency side, when you have close connections, you have caches involved. That’s where you want lower latency. In terms of scale-up networks, we have both latency and higher bandwidth, but we don’t have coherent systems. That’s something we give up. So there are tradeoffs. If you go for one corner, you might have to give up something else.
Leave a Reply