Dynamically share memory, storage, and accelerators across multiple compute nodes.
As the demand for AI and machine learning accelerates, the need for faster and more flexible data interconnects has never been more critical. Traditional data center architectures face several challenges in enabling efficient and scalable infrastructure to meet the needs of emerging AI use cases.
The wide variety of AI use cases translate into different types of workloads. Some require high compute resources, some need vast memory or storage resources, while many others have their own unique requirements. Servers are often over-provisioned to accommodate peak demands. This means that resources remain idle during less demanding workloads, leading to inefficiency and unnecessary costs. With traditional architectures, communication between different peripherals (e.g., CPU to GPU or GPU to memory) must often pass through the CPU, adding latency. In some cases, when multiple compute devices are required to service a workload, data replication is required across these multiple devices, further exacerbating bandwidth and storage overhead.
By leveraging PCI Express (PCIe) and CXL to enable disaggregated compute, systems can dynamically share memory, storage, and accelerators across multiple compute nodes resulting in increased utilization and avoiding issues of over-provisioning. For example, CXL-enabled memory pooling allows multiple CPUs to access a shared memory pool, thereby maximizing memory utilization.
Compared to traditional Ethernet-based connectivity, which introduces latencies in the microsecond range, PCIe interconnects offer sub-microsecond latencies. For instance, PCIe Gen 5 interconnect through retimers and switches can achieve latencies on the order of a few hundred nanoseconds for device-to-device communication. Using PCIe peer-to-peer, devices like GPUs can access memory or storage directly without CPU intervention, further reducing latency across the PCIe fabric.
CXL interconnects can offer further reduction in latency compared with traditional PCIe interconnect. CXL offers a lower latency via .mem path when the CXL controller architecture is optimized for latency, offering as much as a 50% reduction in latency through the controller. A typical PCIe controller roundtrip latency can be between 30-40ns, and a latency-optimized CXL controller can reduce this by greater than 50%. If the Root port and Endpoint CXL controllers are used in a representative disaggregated system, a rough estimate 40ns reduction in latency can be realized vs. PCIe.
The reach of PCIe signaling is also a significant constraint in traditional setups. PCIe Gen 5 and Gen 6, for instance, support trace lengths up to approximately 14 inches (~35 cm) on a PCB. Extending connectivity beyond this range typically requires additional hardware like retimers or switches, which introduce complexity and costs. There’s a desire to extend PCIe and CXL beyond single servers to enable rack-to-rack or data center-wide connectivity. Technologies like CopprLink for PCIe Gen 6 support up to 2 meters of copper cabling, and optical interconnects can stretch up to 100 meters—sufficient to cover entire racks or data center rows.
While data-centric disaggregation promises numerous benefits, several challenges must be addressed:
Rambus can help system and chip designers address these challenges with a broad suite of PCIe and CXL IP solutions enabling next-generation performance in data center computing.
Links
Leave a Reply