For reaching farther into another data center, developers are now talking about scale-across.
Key Takeaways:
As today’s data center workloads — especially for AI and HPC — outgrow the physical, power, and architectural limits of a single rack or single data center, two types of data center scaling approaches are gaining prominence. Scale-up is generally limited to within a rack, while scale-out extends beyond the rack within a data center. If more uncommitted resources are needed than are available in the data center, scale-across is the new buzzword for reaching farther into another data center.
Scale-up focuses on minimizing latency, while scale-out focuses more on jitter. Scale-across has challenges resembling those of scale-out, but it gets a new name because at least some companies address jitter differently over longer distances.
These concepts apply primarily to AI and HPC (high-performance computing), which are the primary workloads for which enormous quantities of computing and memory must be mustered. HPC has had this issue for a while, whereas AI — both training and inference — has slammed up against it at its breathtaking pace of evolution.
“HPC is almost an identical match in terms of technical requirements to AI scale-up, and to some extent, AI scale-out,” said Robin Grindley, product line manager, core switching group at Broadcom.
Some think of scale-up as a north/south network, with scale-out being an east/west network. Scale-across can be thought of as a longer-distance scale-out, so it would also be east/west. “The east/west and north/south networks in the data center are very different,” noted Arif Khan, senior product marketing group director for design IP in Cadence’s Silicon Solutions Group.
That said, all three use cases do the same thing — at least at first glance. They rope in resources from elsewhere. So why do we need three different categories for such similar functions? The answer lies in the details.
Start with scale-up
Scale-up attempts to aggregate compute resources (GPUs, for the most part) to look like one big processor rather than a collection of smaller processors. “The whole idea of this machine learning model is to create a cluster of compute facilities,” explained Priyank Shukla, director of product management, interface IP at Synopsys. “It could be accelerators, it could be GPUs, and you throw a lot of data at this compute facility, and you get a trained model out of it.”
Scale-up is characterized by four primary attributes.
“Latency is key here,” said Gilad Shainer, senior vice president of networking at Nvidia. “High message rate is also important. It’s essentially doing load/store operations, in-network computing, and different levels of reduction on the computation results. You need something that supports massive bandwidth compared to a scale-out infrastructure — 10X the bandwidth. You’re moving data between GPUs, and you want them completely synchronized to become one unit.”
Memory semantics effectively allow direct access to local memory, which provides the lowest latency. Jitter will still be present, just as it is in any cached system. If a requested piece of data is already in cache, latency will be far lower than if it’s fetched from HBM or DRAM. But any extra delays incurred on a trip to the memory instead of the cache would still be far smaller than what scale-out might experience.
“Memory semantics means the memory space one accelerator sees is accessible through other accelerators,” said Shukla. “There is memory coherency, meaning whatever an accelerator sees, other accelerators see the same thing. Different cluster architectures are used to train different models, and in some of these clusters, you compromise on power and area to get this kind of memory efficiency.”
Minimizing hops
The utilization of an interconnect fabric to bring the GPUs together may add additional delays, depending on the routing of the data from a memory to the processor requesting it. Early versions have only one hop to get data, but some are looking at multi-tier switches that may require more than one hop. In that case, jitter rises a little, but in the grand scheme of things, it’s very low.
“UALink 1.0 is defined as a single-hop protocol, and there are discussions within the consortium that a single hop may not be good enough,” said Cadence’s Khan. “You may need to factor in multi-layer switch topologies, which you can populate with static routes if you know what your data-center topology is for scale-up and how the jobs are partitioned.”
The orchestration of the appropriate resources also happens at program-load time. The fact that it’s a static configuration eliminates additional delays that might result if resources were acquired in real-time.
It’s also a domain that is still resisting the temptation to move to optical interconnect. For shorter distances, copper can be a lower-power technology because the drivers must move data only as far as the height of the rack (or half the rack if using a middle switch). Optical requires power simply to generate a beam before the data travels anywhere, making short connections inefficient compared to copper. Longer copper lines would need stronger drivers, and that’s where optical becomes competitive.
“We’re trying to utilize copper as much as we can in that area,” said Shainer. “Copper is most cost-effective, it’s highly reliable, and it consumes zero power.”
We’ve seen at least two standards for scale-up (UALink and a new variant of Ethernet). The good news is that the physical layer is the same; it’s the protocol that differs.
“Regardless of the technology or the protocol, it’s a 224G serdes (if you ignore the UALink-128 that AMD is trying to do as a side project), but everything else is based on the Ethernet specification,” Khan noted. “That helps, because now your physical medium is the same regardless of the protocol. What varies is the protocol stack above it.”
“The protocols are all trying to accomplish the same thing, having a slightly different tack on features and the underlying details,” said Rob Kruger, product management director at Synopsys. “Different customers will pick one or the other for some reason, whether it be legacy or whether it be some inherent feature of that protocol that they think is valuable to them.”
Scale-out for more resources
Gordon Allan, director of verification IP products at Siemens EDA, provided an analogy for scale-up vs. scale-out: “You’ve got a human with a brain. They’re intelligent. If you need more intelligence, you can either hire a human with a bigger, better brain, or you can hire 10 people, put them in a room, and have them collectively be intelligent. The interface protocol within the brain is scale-up. The interface protocol from person to person is scale-out.”
Scale-out brings data from a different rack. It has the following characteristics:
This is more of a networking paradigm. Whereas scale-up involves addressing memory, scale-out means sending packets. Clearly, having low latency is good here, too, but the concern is more about jitter.
Specifically, if a packet is dropped, then it must be resent. If a computation has many entities awaiting data so that they can proceed in synchrony, then resending that one packet will force everyone else to wait. It’s not good to have expensive hardware sitting around waiting.
Protocols, therefore, must be lossless. Best effort isn’t good enough.
Different clusters can share data from their respective memories, but they’re different memory spaces. “Based on the size of the model, some data can be worked in one cluster and some in another,” said Shukla. “They can share data — not memory, but pointers and tensors.”
Here, there are new machine-learning models. “We are not talking only about transformers,” Shukla said. “We are also talking about mixture-of-experts and agentic models. These models create one cluster for one model and have another cluster that runs another version of that model. And they need data to be transmitted between these two, and you can’t connect all of them with single-hop switches.”
Unlike scale-up, where the configuration is established at startup, scale-out resources are typically added dynamically. Some configurations may require coherence between racks, but not usually.
“Accelerators split work in parallel, and then they accumulate results together,” Shukla noted. “So you don’t need full memory coherency for most of the models being trained today.”
Different scale-out options agree on one thing
Ethernet dominates here, even though Nvidia initially focused on InfiniBand for scale-out. But the company now supports Ethernet, as well. “The reason we created an Ethernet version is we saw the need for AI to be used everywhere, including cases where people are not familiar with InfiniBand and have invested a lot in Ethernet,” explained Shainer. “For them, it would be easier to continue to use Ethernet. The problem was that there was no Ethernet designed for distributed computing workloads, so we built an Ethernet version for AI.”
The goal, then, is to manage flows and congestion to keep any one node or cluster from becoming a bottleneck.
“Super-low latency probably isn’t going to matter much, because you’re going through three or five hops,” said Grindley. “There’s a much bigger chance of queuing delays and traffic congestion across the bigger network. It’s going to be more about congestion control and traffic management. In a scale-up domain, it’s highly managed, orchestrated, and scheduled so that all the accelerators can directly talk to each other and go as fast as they possibly can. They’re not really worried about traffic management.”
Portions of the network with high congestion — also known as hot spots — must be avoided. “The only way for you to eliminate hot spots is to identify the case where there are many senders and a single receiver,” Shainer explained. “We reduce the amount of traffic those senders actually send to that receiver to what the receiver can handle. For example, if I have four senders and one receiver, I’d like those four senders to utilize 25% of the available bandwidth.”
Ethernet and other protocols have flow-control mechanisms to assist with this, but they require analyzing the packet headers to get warnings that things are congested, and that can take too long. “We have or created a telemetry probe,” Shainer explained. “It’s a small element that runs on the wire between senders and receivers and is sent every once in a while. It gives you an indication of the latency between the sender and the receiver. If the latency measured is the latency of the wire, then I know that the path is clear. If the latency starts to increase, I know that congestion is building, and it’s much quicker than a switch marking packets.”
How data is managed en route also matters. “We wanted the switch not to worry about data ordering in off-the-shelf Ethernet,” said Shainer. “The switch very much cares about data ordering, and that’s why you either use flowlets (which means that the switch will not change the route of a stream of data, except when the stream is over and there is enough time between one stream and another stream), or you use switch buffers to break up the packet and then put it back together. Using a flowlet means you’re not utilizing all the routes that exist. If you use buffers, then by definition they create jitter.”
Instead, it’s possible to have the switches ignore the ordering and let the application direct where the data goes. “The NIC (network interface card) puts the data directly into the GPU memory,” Shainer said. “An easy way would have the NIC put the data in a side buffer, get it back in order, and then put it in the GPU memory. But that means the NIC needs to deal with double the traffic. Now the NIC can put the packets in the right place, not necessarily in order. If packet number four is received first, we’re going to put it in the fourth place.”
Optical connections, when implemented, can reduce latency. In this scenario, however, it’s racks talking to each other over fiber or copper. That’s switch-to-switch, so if it’s fiber, any co-packaged optics would be in the switches rather than the servers themselves.
Scale-up may scale-out in some countries
The scale-up definition reviewed here applies to most of the world. But in some places, its definition overlaps a bit with that of scale-out. This impacts the expectation that scale-up involves only one rack.
In countries where GPUs aren’t as powerful, given geopolitical restrictions, companies may try to bring two or three racks into the scale-up cluster in order to have sufficient GPU power.
“In China, the GPUs are more performance-constrained,” said Maurice Steinman, vice president of engineering at Lightelligence. “If individual nodes have a fraction of the performance of a strong Nvidia or AMD node, the scale-up domain needs to be wider for a given unit of performance in a cluster.”
Assuming one would connect only to a rack on either side of the “main” rack, latency can still be kept low. But the networking would need to change, because going from one rack to another would require traveling through two switches, which kills the hope for one hop. But one hop isn’t essential. It’s just highly preferred to keep both latency and jitter low. It may be a necessary tradeoff to achieve the desired computing power.
Something along the same lines appears to be happening in Japan, but for a different reason. “I’ve heard anecdotally a similar thing happening in Japan, not necessarily because the performance of each node, but the overall power budget of a rack is scaling slower there,” said Steinman. “They’re stuck on fewer kilowatts per rack, and it’s causing a similar constraint. If you want to have a cluster of X performance, you have to deploy more racks.”
Indeed, Japan is working on power regulations that should be posted any time now. “Japan’s energy efficiency legislation is being expanded to include minimum performance standards and information-reporting requirements for data centers,” wrote Peter Judge in a post on Uptime Intelligence. “…A task force led by the Ministry of Economy, Trade and Industry (METI) and the Agency for Natural Resources and Energy, with support from industry groups, including the Japan Data Center Council, is developing the regulations. The final version has not yet been published, but the law is scheduled for revision by March 31, 2026, with implementation expected in April 2026.”
Those power limits may result in insufficient computing for a desired cluster. For that reason, scale-up may here again reach out to a neighboring rack or two to bring together enough GPUs.
When a data center runs out of resources
Scale-up and scale-out have been around for some years, and we’ve looked at some of the newer protocols being employed, such as UALink for scale-up and SUE for scale-out. While scale-out might look the same as scale-up except for distance, we’ve seen that they operate differently.
But at some point, one runs out of resources in a data center. “I can have enough power to do 100k to 300k servers, but if I want to do a million, then I will have to have the data centers in different places, and I do need to connect them in a way that lets me run one single workload across,” said Shainer.
Dealing with that would seem to imply further scaling out to another campus. But apparently that distance matters. Scale-across works much like scale-out, but the algorithms and approaches for handling congestion change. That makes scale-out and scale-across far closer in nature than scale-out is to scale-up.
Here, Shainer provided an analogy. “If you’re driving inside the city, and you want to go quickly from one place to another, you’re going to drive very close to the car in front of you,” he said. “If the car in front of you hits the brakes, you’re going to hit the brakes too, because you have little space. So you’re going to be more aggressive in the way that you control traffic. If you go much longer distances, then there will be more of a gap [between vehicles], and therefore, if you see a car in front of you that brakes, there is enough time for you to react, so you’re going to be less aggressive and you control it differently.”

Fig. 1. Scale-up, scale-out, scale-across. The copper-colored connections in scale-up represent copper wire. Scale-out and -around show blue fiber, although with scale-out, there is still a mix of copper and fiber. Source: Bryon Moyer/Semiconductor Engineering
Each data center will go its own way
Every data center — or at least every AI data center — will implement these scaling strategies, but they’ll likely do them differently.
“[When training GPT3], Nvidia used NVLink as a scale-up protocol and InfiniBand as scale-out,” said Shukla. “At the same time, Google has always used ICI [inter-chip interconnect], which is based on PCIe, for scale-up, and they have used Ethernet for scale-out.”
It’s important to remember that these descriptions apply to networks and data centers today. The definitions aren’t necessarily fixed. We’ve already seen some blurring between scale-up and scale-out depending on the country. That may continue as data centers evolve. “The lines between scale-up, scale-across, and scale-out seem to be blurring,” cautioned Steinman.
Related Articles
New Data Center Protocols Tackle AI
UALink scales up, while Ultra Ethernet scales out.
Optimizing Data Movement
Problems and solutions for improving performance with more data.
Multiple AI Scale-Up Options Emerge
As data center infrastructures adapt to evolving workloads, parts of Ethernet can be found in scale-up approaches.
Leave a Reply