New Data Center Protocols Tackle AI

UALink scales up, while Ultra Ethernet scales out.

popularity

Compute nodes in AI and HPC data centers increasingly need to reach out beyond the chip or package for additional resources to process growing workloads. They may commandeer other nodes in a rack (scale-up) or employ resources in other racks (scale-out).

The problem is there currently is no open scale-up protocol. So far this task has been dominated by proprietary protocols, because much of the highest-performance computing is being done in large data centers using custom chips and architectures. And while Ethernet is popular for scale-out, it’s sub-optimal for AI and high-performance computing workloads.

But two new protocols — UALink and Ultra Ethernet — aim to address deficits in current scale-up and scale-out communications. UALink is a completely new scale-up protocol, while Ultra Ethernet builds on Ethernet for scale-out.

Multiple communications duties
A “compute node” is an abstract notion describing some locus of computing. It has a finite capacity, with access to a finite amount of memory and other possible resources such as accelerators. By itself, it’s inadequate for high-intensity workloads and relies on other nodes over which to distribute the overall problem. The protocols that provide the communication necessary to exchange data and coordinate operation can generally be split into three categories.

The lowest-level protocol is the die-to-die interconnect, and it’s relevant today because of advanced packaging. What looks like a single compute node in a package may be multiple chiplets working together. The protocols that enable this are UCIe and Bunch of Wires (BoW), as well as some proprietary ones. But all of these communications are invisible outside the package.

A fully loaded compute node can be thought of as a server board with computing, memory, and accelerators attached. There may be more than one processor on the board, however, so system software determines which workloads operate on which processors. But that’s insufficient for the kinds of tasks required for training AI models. That requires reaching out into the rack or pod to leverage more resources.

The goal is to assemble multiple compute nodes while maintaining the feel of a single compute space — multiple processors and accelerators acting as a single large processor or accelerator with unified addresses. This middle communication level is scale-up, it’s where UALink fits into the picture. It works alongside PCIe and CXL, but only UALink has the effect of unifying the allocated resources.

“UALink is designed to connect your main GPU units for GPU-to-GPU scaling,” said Michael Posner, vice president of product management for high-performance computing IP solutions at Synopsys. “And it is designed to increase bandwidth and reduce latency for that connection.”

GPUs are just one type of accelerator, and UALink will work broadly with any type. UALink then abstracts away the divisions between accelerators.

“The idea is to interconnect AI processors to look like one large processor within this pod,” said Jon Ames, principal product manager at Synopsys.

Memory access is an important part of UALink’s role. “UALink optimizes xPU-to-xPU memory communication across accelerators in a pod, either directly connected or through a fully connected high-radix switch,” said Arif Khan, senior product marketing group director for design IP, Silicon Solutions Group at Cadence in a blog post.

Looking beyond the rack
Beyond the resources in the rack lie similar resources in other racks. But those racks aren’t accessible over the same interconnect holding a single rack together. Ethernet typically communicates from rack to rack, and this is scale-out — the top communication level. It resembles scale-up, but with a reach broader than scale-up can provide. This architecture has one network (e.g., PCIe) within the rack and another outside the rack (or another level to the network). That is the primary distinction between scale-up and scale-out.

“Ultra Ethernet addresses scale-out,” said Posner. “It’s built on top of traditional Ethernet.”

Khan agreed. “Expansion across pods relies on Ultra Ethernet to accelerate data-center Ethernet (essentially a replacement for bulk transfers that rely on remote DMA/RoCE today),” he said.

Fig. 1: Four levels of data-center interconnect. Across the data center, moving from rack to rack, constitutes scale-out communication. Within the same rack is scale-up. Within an advanced processor package, die-to-die interconnects handle inter-die communication. Source: Bryon Moyer/Semiconductor Engineering

One fundamental difference between the die-to-die and other protocols is the fundamental nature of the link — serial vs. parallel. UCIe and BoW are both parallel interfaces, typically with forwarded clocks. That provides the lowest latency, while requiring many more pins and making skew a much more important issue.

UALink and Ultra Ethernet employ serial links. That drastically reduces the number of necessary signals, but it adds overhead for extracting the clock and resolving symbol values for formats other than non-return-to-zero (NRZ). This extra processing is what raises the link latency over what die-to-die protocols provide. “The parallel interfaces, like UCIe and BoW, give a very low NoC-to-NoC latency compared to any interface,” noted Pratyush Kamal, director of central engineering solutions at Siemens EDA.

Scale-up: A green field
Today, PCIe and CXL can operate at the rack level, but they don’t provide the semantics that the creators of UALink are designing. The incumbent technology thus consists of a wide range of proprietary solutions. Each company implementing scale-up must dedicate resources to designing a protocol, and multiple companies doing the same thing is an efficiency drain on the industry.

“We see UALink replacing a lot of proprietary interconnects,” said Ron Lowman, strategic marketing manager for IP at Synopsys. “[Designers creating proprietary versions] have used anything from PCIe to Ethernet and everything in between, with customizations to handle the scale up, and UALink is addressing that.”

The UALink Consortium was formally convened last fall with the stated goal of “developing interconnect technical specifications that facilitate direct load, store, and atomic operations between AI Accelerators.” In fact, the UA in UALink stands for Ultra Accelerator. It doesn’t obviate PCIe or CXL, and there’s overlap among the duties of those three. However, UALink is being optimized specifically for AI and HPC workloads.

It consists of three primary layers — a transaction layer on top that manages full transactions, a data-link layer in the middle that manages each hop, and a physical layer (PHY) that deals with signaling. The first two are new, but the PHY layer leverages what’s already in place to speed up implementation and adoption.

To some extent, scale-up has been the domain of PCIe without being optimized for AI. “What you’ll see in PCIe is lots of different chips doing lots of different tasks, whereas UALink is really trying to take an AI accelerator, and scale it from 1 to 1,000 to handle a single workload,” said Lowman. “UALink doesn’t have all the features and backward compatibility of PCIe, but it addresses specific AI workload needs such as global memory addressing and shared memories.”

Two initial versions of UALink will debut, one at 224 Gbps and one that can be relaxed half speed (-200 and -100 versions). Both will feature an Ethernet PHY. After the initial release, a -128 version is planned that will leverage the PHY from PCIe Gen 7.

The consortium developed UALink not to be ideal, but to be realizable quickly because the industry is evolving so fast. “The software for AI hardware is moving much more rapidly than hardware can respond,” said Lowman. “So getting something out that helps with scale up as soon as possible will be beneficial for the whole industry.”

That means re-using as much as possible from existing standards. “The idea wasn’t that Ethernet and PCI were the absolute best options out there,” Lowman said. “The idea was that we could get to market fast with a standardized protocol that did the basic things required for scale-up architectures. So the consortium took existing technologies. UALink 128 is leveraging a PCIe-like PHY, and UALink 200 is leveraging an Ethernet-based PHY.”

UALink isn’t expected to challenge PCIe or CXL. “We’ve had lots of conversations on positioning of PCIe, CXL, and UALink, and we firmly believe they all have their niche within the market,” he said.

The UALink 1.0 specification should be available within the next quarter and will be available for free download.

Scale-out: Building on Ethernet
Ethernet has been widely adopted thanks to its ability to handle a broad range of applications well enough. But some of its policies hurt performance, largely due to tail latency.

Communication latency in Ethernet isn’t fixed or predictable. One transaction may complete with no problems while another may suffer a congested link with dropped packets, necessitating a resend. Even while most transactions may complete in minimal time, these workloads need all nodes to sync up before proceeding, and one link taking longer than the others can hold everything up. It’s the latency caused by these (hopefully) few transactions that the term tail latency refers to. They’re the tail of the latency distribution.

When considering latency, it’s also important to realize that the latency added by a die-to-die connection is more than just the physical layer delay. “The latency that matters is the NoC-to-NoC latency, not the PHY-to-PHY latency,” said Kamal.

This issue is particularly acute for AI and HPC workloads because of the nature of the communication style. Ethernet is most often used to pass streams of data east-west or north-south. There’s a directionality and a sense of, “We completed that flow and that’s the last we’ll see of it.” But AI/HPC workloads relate to the sending out of data for computation and then the returning of results. It’s not just a stream that disappears. It’s data out and results back, over and over. It’s more like respiration than a flow, with each send-out of data being the exhale and the results being the inhale. Each “breath” in or out involves multiple transactions between nodes.

“Ethernet was developed specifically for being a general-purpose network,” said J Metz, steering committee chair for the Ultra Ethernet Consortium. “It’s good if you’ve got north/south traffic or east/west traffic. It’s not so good if you’ve got clustered traffic doing all-to-all, or all-reduce, or any of the other collectives. When you’re passing the messages back and forth so they can do their own processing and then send it back, that’s more like that breathing environment.”

Fig. 2: Ultra Ethernet’s position in the data-center network. Scale-up happens within the node, making the collection of resources look like one virtual node. Ultra Ethernet scales those nodes out. Although not illustrated here, CPUs can participate as well as GPUs. Source: Ultra Ethernet Consortium

Although Ultra Ethernet can connect via a network interface card (NIC), that’s not necessary. “A fabric endpoint (FEP) could be any device that acts with a fabric address, and that could be a suitable Ethernet point on an accelerator itself,” said Metz. “The magic happens at the FEP, which includes congestion, semantic, and packet delivery control.”

Figure 2 illustrates a simplified data-center network with a focus on GPUs. But CPUs also can participate. “AI workflows aren’t monolithic,” said Metz. “There are many stages that boomerang between CPUs and GPUs of different clusters, and even within clusters. Some of the work is best done in CPUs, some in GPUs.”

The Ultra Ethernet Consortium (UEC) is specifically targeting this type of communication with a few mandatory features and several optional ones. Given a transaction, only the endpoints have mandatory behaviors. That’s intentional so that Ultra Ethernet networks can be built with standard Ethernet switches. While not providing all the benefits of Ultra Ethernet, endpoint installation can proceed without having to wait for new switches.

Adding layers to Ethernet
Standard Ethernet specifies functionality for layer 2 (the data links) and below. It has no knowledge of transactions or endpoints. It’s simply moving data hop by hop. Ultra Ethernet builds on this by adding layers 3 (network) and 4 (transport). It’s the transport layer that manages the semantics of a transaction. Must it be secure? Must all packets arrive in order? Must it be reliable?

“The transport piece is a big part of what’s in Ultra Ethernet,” said Ames. “It gives you mechanisms that can reduce the overall system latency.”

The sanctity of the layers hasn’t been well respected in traditional Ethernet. Additional features have crept into some layers that might have fit more neatly into others. Ultra Ethernet is trying to avoid that. “You want to make sure that when you do something in layer two, it does layer two,” said Metz. “You want to do something in layer three, it does layer three. You don’t do you don’t do routing protocols at the MAC layer.”

Layer 3 simply employs the internet protocol (IP), unchanged. “The networking layer is not currently being addressed [by us],” he said. “That’s good in the sense that it helps simplify the process and makes things very easy for traditional data center environments using Clos or leaf-spine configurations. Once you start getting into things like dragonfly, megafly, or torus [network topologies], which you’ll see more often in HPC environments, we’re not focusing on that. We will have to address that in the future.”

The transport layer is the mandatory portion of the standard, implemented in the endpoints. “The source endpoint is going to be the core decision maker, and then the receiving endpoint is going to give the feedback that’s necessary [for those decisions],” said Metz. In the case of a problem packet, instead of sending what would normally be an ACK (acknowledge), the destination sends a NACK (negative acknowledgment) along with some diagnostic information.

“You identify the packet that was either missing or slow and send that back to the source,” explained Metz. “The source marries that to whichever path it had originally chosen and chooses a different path in the resubmission.”

Fig. 3: The ultra Ethernet stack includes transport and network layers, with the transport layer being mandatory. So far, the network layer employs IP with no changes. The data-link and physical layers add new optional features. Elements in blue are required, those in green are unchanged from Ethernet, and those in beige are optional. Source: Ultra Ethernet Consortium.

New features help reduce tail latency
Four features that demonstrate Ultra Ethernet’s approach to reducing latency are out-of-order delivery, link-level retry, flow control, and packet spraying. Many of these transactions are merely sending data from one place to the other, and as long as it all gets there, the order in which it arrives doesn’t matter. One can still choose in-order delivery, but it’s not required.

If some of the data didn’t arrive, it’s not necessary to resend the entire transaction. The destination endpoint can identify any missing packets, with only those being resent. In addition, if along the path an intermediate node receives a bad packet, it can immediately ask for a retry of that one packet without having to move up the stack and deal with it at the transaction level.

“Link level retry prevents a protocol further up the stack from having to determine if something needs to be retransmitted,” said Ames, pointing to the benefits of faster response at lower levels, as well as the need to resend only bad packets rather than the entire transaction.

Because link-level retry is an optional feature, early Ultra Ethernet networks won’t have it until switches are upgraded with the new link layer.

Another link-layer modification relates to flow control. “There’s a flow control mechanism down at the link level [that’s] credit-based,” said Ames.

Finally, standard Ethernet typically picks a path for a flow or transaction and sticks with it for the duration of the transaction. If a congested or otherwise compromised path is chosen, that transaction may take a long time to arrive fully after any necessary retries. Packet spraying, an optional feature, allows the source to make a separate path decision for each packet.

Ames described it by comparing it to standard Ethernet. “If node A talks to node Q, that will go over one path, and if node A talks to node X, that might take a different path,” he explained. “That’s the way the multi-pathing works in regular Ethernet. With packet spraying, you can send the packets over different links, and the network will handle reassembly at the far end. But often this is just a data transfer, so it doesn’t matter if things arrive out of order.”

Ultimately these features provide options for moving packets faster, and with fewer or more limited retries. It could be that some of the features (e.g., security) could add to the latency of a typical transaction, but when the system is awaiting the last packet’s arrival, then tail latency is the limiter rather than nominal latency. Yes, each transaction may arrive a little more slowly, but everyone can get going sooner thanks to the earlier arrival of the last packet.

Similar timing to UALink
Ultra Ethernet’s 1.0 spec is imminent. “We’re looking at an April or May release,” said Metz. “It’s going to be open for everybody to download.” Once released, endpoints can be created quickly, whereas the switches along a route may take longer to upgrade.

“It winds up being quicker to do ASICs for endpoints than it does to do for switches,” said Metz. “Generally speaking, switching ASICs are not single-purpose, and the development cycle is considerably longer than that of endpoints. They have more functional requirements than endpoints and have to undergo considerable regression testing.”

Even though the UEC is using standard Ethernet, which IEEE manages, it plans to maintain Ultra Ethernet control on an ongoing basis rather than turning their results over to IEEE to handle. “UEC is a standards organization,” explained Metz. “We do have a relationship with IEEE to work with them and share information, but Ultra Ethernet is a UEC protocol.”

The challenge there is that IEEE could, say, make some changes to its link layer after Ultra Ethernet 1.0 is locked down. Now the IEEE and Ultra Ethernet versions of the link layer are different, and they could remain different. The organization is aware of this challenge and is approaching it by staying in communication with organizations that have a relationship with Ethernet.

“We’re working with IEEE, OCP, OIF, SNIA, the Ethernet Alliance, and the UALink Consortium, and we’re all working together to make sure this kind of forking doesn’t happen,” said Metz. The UALink Consortium confirmed that they’re working similarly.

In fact, one aspect is already in play — preparing for a 400 Gbps PHY, expected maybe in the 2028/9 timeframe. That may seem distant, but discussions already are underway to coordinate efforts across any organizations that will rely on that PHY. Ultimately, the goal is one uniform set of basic Ethernet features that all the derivatives can build on.

Conclusion
It’s unclear whether HPC on its own could justify the kind of effort going into these new protocols, but AI is everywhere, and it’s acting more as the killer app than HPC. HPC can certainly ride the coattails, even though the specifics of the transactions being sent may be different from AI. And even AI will have different styles of transaction at different times. It’s for this reason that the various options exist, with Ultra Ethernet allowing senders to pick the best semantics that suit a given transaction.

Interestingly, both of these efforts are coming due at approximately the same time, even though there was no coordination between the two organizations. Given spec availability in the first half of 2025, there’s likely to be a review period during which companies evaluate the specs before adopting. Working them into silicon will then take at least another year, so these protocols are likely to start showing up in data centers in the late 2026 timeframe.

Related Reading
Architecting Chips For High-Performance Computing
Data center IC designs are evolving, based on workloads, but making the tradeoffs for those workloads is not always straightforward.
The Secret Life Of Accelerators
Unique machine learning algorithms, diminished benefits from scaling, and a need for more granularity are creating a boom for accelerators.
Mass Customization For AI Inference
The number of approaches to process AI inferencing is widening to deal with unique applications and larger and more complex models.



Leave a Reply


(Note: This name will be displayed publicly)