Inference is reshaping data center architecture, introducing a new and less forgiving set of network requirements.
Recent industry trends, including the release of NVIDIA’s Rubin platform (developer.nvidia.com), point to a growing consensus that AI inference is reshaping data center architecture in a fundamental way. As inference workloads become dominant, the data center network is no longer just a communication layer between servers. It is increasingly part of a distributed memory and storage hierarchy, with direct impact on performance, efficiency, and cost.
To understand why this matters, it helps to look at how classic data centers were designed, how AI training changed those assumptions, and why inference introduces a new and less forgiving set of network requirements.

In the classic data center era, workloads were dominated by microservices and client server interactions. Traffic patterns were largely probabilistic, bursty, and dominated by North-South flows. Memory accesses and data fetches were governed by cache miss probabilities rather than deterministic access patterns.
This probabilistic behavior shaped both on-die and network architecture. On-die fabrics such as SoC meshes became popular because they were cost optimized for average case behavior rather than worst case determinism. At the network level, oversubscribed 3-tier leaf/spine/core networks were widely deployed. Latency was important, but tail latency did not have stringent requirements, and short-lived congestion events were acceptable because traffic was sparse and bursts were brief.
In this environment, buffering and statistical traffic distribution schemes such as ECMP worked well. Even when multiple flows converged, local network congestion was typically short-lived, and the system-level performance could be sustained on average by taking advantage of local network link buffering absorbing smalls bursts of local overloading.
Large-scale AI training fundamentally changed traffic patterns inside the data center. Training workloads introduced structured, server-server East-West communication dominated by collective operations. These collectives generated bursts of short-lived communication between GPUs that were highly synchronized and bandwidth intensive.
Training workload behavior drove changes across the stack. On-die, high bandwidth memories such as HBM became critical to provide sufficient capacity and throughput for training algorithms. At the network level, rail-optimized fabrics emerged as a relatively cost-effective approach to providing Scale Up domain non-blocking guaranteed bandwidth and low latency between tightly coupled GPUs.
An important and sometimes underappreciated point is that training fabrics rely on buffering and internal speedup to handle temporary incast, a traffic mapping where multiple sources are sending line rate traffic to the same oversubscribed destination. During collective phases, many GPU links may briefly target the same destinations, but the congestion is transient. Buffers absorb the bursts, traffic drains quickly, and subsequent phases shift communication patterns. Schedulers help align work, but they do not need to perfectly avoid incast for these systems to perform well.
Inference workloads, especially for large language models with long context windows, introduce a very different stress pattern. Inference requires scaling memory and storage beyond what can fit on device, leading to frequent loading of KV cache state from pooled memory and storage systems.
These accesses generate sustained elephant flows. Unlike training bursts, these flows are long-lived and continuous. A single inference request may require tens of gigabytes of KV cache data to be streamed from remote DDR or flash-backed SSD tiers, and similar requests arrive continuously. The result is persistent pressure on the same network paths.
In networks designed around short-burst training behavior or probabilistic microservice traffic, this leads to sustained congestion rather than transient congestion. Buffering no longer saves the system because the sustained flows eventually overflow any reasonably sized buffers. ECMP randomness does not help with these long windows where there are sustained elephant flows for large KV cache loads and stores, while other paths are not used at all.
This is where a key architectural shift occurs. Storage is no longer an external service accessed occasionally through North-South paths. It becomes part of the high-performance network-based memory fabric itself, and the network must be designed to sustain parallel, long-lived throughput at scale.
At a high level, inference networks may still resemble rail-optimized designs. High bandwidth domains connect GPUs with nearby memory and storage resources, while connectivity across domains is more limited to control cost and complexity. In that sense, inference builds on lessons learned from training.
However, the performance contract inside these domains must change. Inference requires deterministic, sustained throughput rather than average throughput over time. Memory and storage endpoints must be treated similarly to on-die memory controllers in a Scale-Up system, with the fabric providing predictable access under continuous load.
High-radix switches, which became important for scaling training clusters, take on an even more critical role. By enabling flatter and wider fabrics, they reduce choke points, limit long-lived contention, and make it feasible to integrate large pools of memory and storage into a non-blocking high performance domain.
AI training taught the industry how to scale compute and data movement efficiently for short-burst communication patterns. AI inference is now teaching the industry how to scale memory and storage access under sustained load. As inference becomes the dominant workload, performance will increasingly be determined not by raw compute capability, but by how efficiently GPUs can access distributed memory and storage through the network.
This shift demands innovation on the network side. Higher link speeds, higher-radix switches, and flatter non-blocking fabrics are becoming essential to deliver extreme performance guarantees without relying on statistical randomness, while still keeping infrastructure cost, power, and development schedules under control.
The data center network is no longer just connecting systems. It is becoming the memory fabric that defines AI performance.
Leave a Reply