As data center infrastructures adapt to evolving workloads, parts of Ethernet can be found in scale-up approaches.
Artificial intelligence (AI) workloads are very different from those traditionally run inside of data centers, and while the current infrastructure can accommodate those needs, there is a constant demand for higher performance and better power efficiency.
It can take months to train a large language model, even with a huge number of processing elements. Typically this involves commandeering multiple servers within a rack, a process known as scale-up, with the intent that all the GPUs in the project work together as if they were a single GPU. “Everything’s trying to act as one compute engine, scaling up from tens to hundreds to a thousand GPUs,” said Rob Kruger, product management director at Synopsys.
When that’s insufficient, one must look outside the rack for further resources. That’s called scale-out. Ultra Accelerator Link (UALink), a new industry standard introduced this year, addresses scale-up, while Ultra Ethernet supports scale-out. But Ethernet has begun to compete for scale-up, as well. A new effort, called ESUN (for Ethernet Scale-Up Network), parallel to UALink, has been initiated by the Open Compute Project (OCP) based on contributions from Broadcom. It leverages more of Ethernet than UALink does, and although it’s not yet complete, it resembles Ultra Ethernet in many of the expected modifications to standard Ethernet. There’s also speculation that ESUN and Ultra Ethernet ultimately could merge.
“A lot of people are trying to take other industry interfaces and make them applicable to scale-up,” noted Peter Onufryk, president of the UALink Consortium.
These protocols have largely been inspired by Nvidia’s proprietary NVLink, creating three options, two of which are open. “You have UALink, you have ESUN, and you have what Nvidia is doing,” Kruger said.
UALink is the quasi-incumbent
Although it’s a new protocol, UALink is the first open protocol dedicated to AI scale-up. As a blank-sheet protocol, it’s generally described as clean and simple, which also lends it some performance benefits. UALink switches also can be simpler than typical network ones. “It’s much more like a PCI Express switch than it is a more complex switch, like something from Ethernet,” said Onufryk.
UALink 200 is based on the 200 Gbps Ethernet physical layer (PHY). The consortium recently released a version that can ride atop the PCIe Gen7 PHY at 128 Gbps.
For reliability, lane redundancy can be managed by using two-lane links, which is optional rather than mandatory. “Two lanes per link is probably going to be the most used setup,” said Kurtis Bowman, chairman of the UALink Consortium. “If you lose a link, you drop off in bandwidth. That could be routed to another port, but you could also use the 200 gigabit per second you’d have left for that link.”
A typical network setup also promotes reliability, with every accelerator connected to every switch. “The reason that you do that is to ensure that you have a very high bandwidth, a very low latency path, and essentially no congestion,” said Bowman. “Even if you had an entire switch fail, you’d be able to take advantage of it.”
Some of UALink’s performance gains come from how it deals with packets. A typical network operation would take each transaction, wrap it as a packet, and send it through a switch to its designated destination. All that packetizing effort takes time, and with AI, these are typically small transactions.
Instead, UALink forms a 640-byte payload from individual 64-byte transactions. Each transaction includes a source ID and a destination ID, but lacks the long headers that would otherwise be necessary. Once the packet is full, it’s sent out to a switch, which pulls apart the transactions and sends them to their destinations.
It appears that packets and flits have predictable sizes (although there may be some confusion here). “UALink has 64-byte flits at the transaction layer and 640-byte flits at the data link layer,” said Bowman. “The transaction layer also supports half flits for short requests and responses. Full and half flits are the only sizes supported by the transaction layer.”
That means the first transaction in the packet will have slightly longer latency than the last packet, but the consortium says that average latency is lower than what Ethernet can provide by several hundred nanoseconds, although their data didn’t consider Broadcom’s lower latencies with ESUN.
“What you care about is your average latency, and your average latency would have been higher [had transactions been packetized one at a time],” said Onufryk.
The UALink Consortium also released a roadmap with three new upcoming features. The first brings in-network collectives. The second will introduce a management spec. And the third provides separation between core and I/O chiplets, letting the core handle the upper layers while choosing which PHY to employ according to which I/O chiplet is used.
Priyank Shukla, director of product management, interface IP at Synopsys, explained that third feature. “UALink was accelerator-to-accelerator interconnect,” he said. “But then a lot of accelerators started having I/O chiplets, which are separate from the core chiplet. So UALink has gone one step ahead and drafted how the standard should be implemented between I/O die and core die.”
“We have people who build with switches or other devices,” said Onufryk. “They may want to support 128G and 200G. This allows them to build a common SoC and then swap out the I/O chiplet for the different physical layers.”
These features will play out over the next year. “In-network collectives (INC) and the management spec will be released to members in 4Q25, and on our public website in 1Q26,” said Bowman. “The chiplet spec is targeted for 3Q26.”
Ethernet everywhere
In its 200-Gbps version, UALink relies on the Ethernet PHY only. Ultra Ethernet adopts more of the stack, which occupies layers 1 and 2 of the OSI networking model. This keeps the physical coding sublayer (PCS) but changes some features above to minimize tail latency and jitter. “It’s about low latency and no jitter, so you can’t drop packets anymore because then all the other GPUs have to wait,” said Kruger.
But at the recent OCP Summit, the OCP announced the new ESUN (Ethernet Scale-Up Network) project.
“The goal is not to develop new standards,” said Hasan Siraj, vice president of software ecosystem at Broadcom. “What will come out is, ‘If you are trying to deploy this, here are the guidelines. Here is the profile that you should be using.”
Broadcom, as a vendor deeply involved with Ethernet, has contributed technology to the OCP, including with ESUN. But there’s also a prior piece called SUE-T, or Scale-Up Ethernet – Transport. With the announcement of ESUN came immediate questions as to its relationship to SUE, especially since the names are the same, minus some word-scrambling.
In fact, the two efforts address different problems. SUE-T is a host-side protocol enabling reliable transport. ESUN handles the lower networking layers and would be implemented in accelerators and switches. The two are complementary.
“OCP transport, which is layer four, is already defined by SUE,” said Shukla. “Now layer two and layer three will be modified with ESUN so that they go well with this SUE transport.”
The three major modifications to vanilla Ethernet are link-layer retry (LLR), credit-based flow control (CBFC), and optimized headers. These features are common across the protocols, fueling some speculation about whether protocols may eventually merge. The purpose of the changes is to reduce average latency and jitter, which is effectively latency variation.
“The Ultra Ethernet MAC layer has link-layer retry and credit-based flow control,” said Arif Khan, senior product marketing group director for design IP in Cadence’s Silicon Solutions Group. “Those are the primary innovations in the L1 and L2 stack of the networking piece [referring to layers 1 and 2 of the OSI networking model]. That’s common for scale-out using Ultra Ethernet and scale-up Ethernet.”

Fig. 1: Comparison of Ethernet-based scaling protocols. All leverage the Ethernet PHY (not counting the new PCIe PHY UALink supports in addition). Modifications to layer 2 above the PCS include LLR, CBFC, and header changes — features UALink also includes. Source: Bryon Moyer/Semiconductor Engineering
Technically, ESUN is still a work in progress. But based on its goals, it’s already made progress toward a first version. “We might get a peek at it by the end of this week, but it won’t be public,” noted Khan at the end of October.
Link-layer retry
Ethernet is a link protocol, not a network protocol. Full networks rely on layer 3 (typically IP) to define their behavior. Layer 4 is for transport, dealing with such issues as reliability and security. Ethernet defines what happens at a switch or router when a packet comes in for forwarding on yet another hop. Ideally, it does not know the starting or ending point.
Ethernet is also a best-effort protocol, meaning that if some node along the route overflows and can’t handle the traffic, it will drop packets. Those packets will never arrive at the endpoint. Eventually, the sender will figure that out and have to resend. But depending on the flow-control mechanism, that may not happen until the endpoint decides it’s missed something, or some time window for acknowledgment passes without an indication that the endpoint received the packet.
This retry happens in layer 3. But these scaling standards typically omit layer 3, focusing on layers 1 and 2 while adding reliable transport, which is classically associated with layer 4.
When a packet is dropped at an intermediate node, that node knows it and can immediately ask for a resend without waiting for the endpoint to figure it out, which is a capability that standard Ethernet doesn’t allow, so it’s being added in these variants as LLR.
Credit-based flow control
Flow control attempts to balance traffic and loads so that no switch or server has to drop a packet. It can’t be guaranteed, but it can help. There are, however, various ways of implementing flow control, and they have an impact on latency.
Without flow control, one node will send a packet to the next one without really knowing whether the recipient has room for it. One way to determine that is to send it and wait for acknowledgment. If the sender doesn’t get one, it can assume the packet buffer overflowed or something else happened, and resend. This is similar to the scenario for LLR, and it takes time.
Another approach is for the two ends of a link to negotiate a data rate that they can both handle. This works fine for a steady data stream coming at a mostly predictable pace, but that’s not how AI traffic works. AI traffic is bursty and uneven. Were such a rate negotiation to be employed, the busy times might cause a cycle of renegotiation, only to renegotiate down when things slowed, and then repeat. That’s hugely inefficient.
Credit-based flow control takes a different tack. The recipient will provide the sender with an allocation of buffer space. The sender then can decide what to send, doing so only if it knows there is room. That makes the process more predictable and reduces the number of necessary retries.
Optimizing headers
Packets are encased in headers and potentially footers that provide metadata about the packet, such as destinations and features. Ethernet is a broad protocol addressing an incredible variety of systems. Therefore, it’s not optimized for any of them and must accommodate them all.
But scale-up is a limited application. Essentially, there are a number of processors (probably GPUs today), each of which is connected to some amount of memory. Most common lately has been for processors to include four ports for four HBM3E stacks. With HBM4, we may start seeing eight ports per processor.
Each of these HBM stacks will provide data to its assigned processor, but with scale-up, it also provides data to other processors. As a result, some of a processor’s memory bandwidth will be consumed not by getting its own data, but by forwarding data on to someone else.
The idea of “memory semantics” is that each processor sees the entire range of memory with the same address space. Scaling memory semantics is tough, but scale-up will likely move from dealing with 100 or so processors to 1,000 or more. This means addressing must be able to accommodate the amount of memory connected to all those processors.
The good news is that this is a simpler addressing scheme than what’s used for standard networking. Packet headers intended for crossing the globe to answer an internet query are overkill for scale-up.
For this reason, Ethernet-based approaches to scaling are trying to simplify the headers. “You can optimize headers since you don’t need these very large addressing schemes,” said Broadcom’s Siraj. “Even in UEC, there is this concept of an AFH [AI fabric header].”
Hardware architectures evolving
Networking folks are not the only ones paying attention to data-center workload scaling. Startup Upscale AI has a new packet-processing architecture it describes as being drawn on a clean sheet. The details aren’t public yet, but the company is not defining a new protocol. Instead, it’s defining a new hardware architecture that it expects to excel at handling any of these scaling protocols.
The adaptability they tout contrasts with more traditional fast-path architectures. “If you look at the traditional Ethernet switches in the data centers, they are strictly wire-speed and hard-coded,” said Aravind Srikumar, vice president of product at Upscale AI. “You can skip a complete block and go to another one, but if there is even a slight variation in the packet, the Ethernet switch will drop it.”
The goal is to provide higher flow adaptability without burdening performance. The company hasn’t made the architecture public yet; that’s expected early next year, while product announcements should occur late next year.
Baya Systems also announced a scale-up switch fabric earlier this year. Rather than focusing on die-to-die interconnect, it provides a network-on-chip fabric and protocol.
“For 100 times the port count and 100 times the bandwidth, you can’t do that with a single or multi-chip crossbar solution because traditional crossbars don’t scale that way,” explained Nandan Nayampally, chief commercial officer at Baya Systems. “Our goal is to get 100× the bandwidth while keeping the latency extremely low.”
The company used to describe it as 3.5D, although it has backed off on that based on confusion. “We have the standard 2D matrix,” said Nayampally. “We’re also using a Z router, as we call it, a third dimension, to route around that. We also have an offset matrix, so you could create clusters that you can jump across, over, and above the three dimensions that we’re describing. That was the half-D.”
Baya’s protocol is part of what makes the fabric flexible. “The base transport is a message-passing one, built to be extremely flexible from a topology perspective, i.e., not locked to tree or mesh, etc. This enables custom protocols to be developed for AI acceleration, and other protocols such as AXI and CHI can be built on top very efficiently.”
Winners, losers unclear
There’s not much in the way of competition for scale-out (except for proprietary schemes), but scale-up now has at least ESUN and UALink. With UALink, the good news is that it was designed without the burden of backwards compatibility, it’s simple, and it has demonstrated good performance. “UALink has certain latency advantages,” said Khan. “The protocol is much simpler than PCIe or anything that has to deal with legacy.”
The bad news is that it requires a new switch. That means a data center upgrade would have to replace not just servers, but switches as well.
With the Ethernet-based protocols, performance is best with an Ethernet switch that understands the new layer-2 features, but it will still work with a legacy Ethernet switch. The performance with an older switch may be slower due to, for instance, the lack of LLR, but upgrading a server doesn’t require upgrading the switch at the same time.
This is likely to play into the battle between the two approaches. “You’ve got to remember the things that make a protocol succeed,” said Khan. “Is the protocol simple, easy to adopt? Will there be software applications for it? But the bigger question is, if it relies on a third piece of equipment [the switch] to change, when is that available, and how easy is it to make that upgrade happen? Take the analogy of CXL. The CXL protocol evolved rapidly in the last four years. You don’t see many data centers fully upgraded with CXL switches because it’s a huge uplift to try and get so much equipment changed at one time. What scale-up Ethernet provides is a degree of backwards compatibility with standard Ethernet. Your performance will fall, but you can have these endpoints fall back to a less optimal protocol until the switches get upgraded.”
Ethernet’s latency so far appears longer than that provided by UALink, but both deliver latency of less than 500 ns — even between 100 and 300 ns. It could be that the difference doesn’t matter much at the higher system level, but it’s unclear how that will play out.
In addition, rather than this being a one-or-the-other situation, some suggest that different workloads may benefit from different protocols. “Different models can be optimized using different protocols, each offering innovative opportunities for machine learning model designers,” explained Kruger. “It’s not clear how that would play out. Would different data centers or clusters specialize in one protocol or another? Would all switches need to support both so that, when switching workloads, both are available? And how would a developer know which one to use? All these options are actively being explored.”
What is clear is that each cluster can provide differentiated performance for one kind of model and workload. “The models have evolved so much that the type of cluster we would need in 2027 needs all these options. Maybe machine learning engineers can come up with one unified way, but they need flexibility at this time because they want to experiment a lot with data parallelism,” Shukla said.
Others question whether the time is right for standards. “What we see is that some companies always go forward with proprietary formats,” said Andy Heinig, group leader, advanced system integration, department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It seems some of the standards are not developed fast enough, or it doesn’t allow companies to add enough unique selling points.”
But they also help protect against lock-in by companies with proprietary protocols. “Once you have a proprietary protocol, the customer has to buy your chips,” Heinig said. “The very big companies, like AWS and Google, have their own teams doing hardware development. And my assumption is, if they see this lock-in effect too much, then they will design something on their own.”
These questions will require industry experience to answer. We should have a better sense of it in the next couple of years.
Related Reading
New Data Center Protocols Tackle AI
UALink scales up, while Ultra Ethernet scales out.
Crisis Ahead: Power Consumption In AI Data Centers
Four key areas where chips can help manage AI’s insatiable power appetite.
Money Pours Into New Fabs And Facilities
Investments boom in 2023 as countries and companies vie for supply chain security and technology leadership.
Leave a Reply