Tackling the challenges of high-bandwidth, low-latency connectivity and efficient resource management.
By Ron Lowman and Jon Ames
AI workloads are significantly driving innovation in the interface IP market. The exponential increase in AI model parameters, doubling approximately every 4-6 months, stands in stark contrast to the slower pace of hardware advancements dictated by Moore’s Law, which follows an 18-month cycle. This discrepancy demands hardware innovations to support AI workloads, calling for greater computational capacity, enhanced resources, and higher bandwidth interconnects.
Even more, hardware performance has surpassed the standard reticle limit. Both CPU and GPU designs are pushing the reticle size due to the extensive arrays of compute units and associated memory. AI accelerators and GPUs now require a new, ultra-efficient networking infrastructure to scale workloads beyond a single die, demanding low latency, high-density connections, and optimal energy efficiency in chip-to-chip communications.
This article delves into the technical aspects of how scaling up and out is becoming a critical need for HPC and AI chip developers, and how new standards such as Ultra Ethernet and Ultra Accelerator Link (UALink) aim to tackle the challenges of high-bandwidth, low-latency connectivity and efficient resource management.
Scaling up and out of chip-to-chip architectures is essential, driven by the demands of AI workloads. Transitioning from monolithic dies to multiple dies and leveraging parallel interfaces like HBM and UCIe has become necessary. These solutions support both homogeneous and heterogeneous compute architectures, utilizing traditional connections through PCIe and CXL for memory expansion, and Ethernet for broader network architectures.
Fig. 1: Scaling to hundreds of thousands of AI accelerators.
Two new standards have emerged to specifically address AI scaling needs:
Ultra Ethernet is an open, interoperable, high-performance architecture tailored for AI, supported by industry leaders across switch, networking, semiconductor, and system providers, as well as hyperscalers. UALink, on the other hand, facilitates the direct operation of accelerators with specific memory-sharing capabilities, backed also by significant players in the semiconductor industry.
As AI and HPC traffic grows, traditional networks using RoCE or proprietary solutions are showing their limitations. These include strict in-order packet delivery, inefficient flow-based load balancing, and cumbersome retransmissions in RDMA operations if a packet is dropped, which can be very costly for AI operations. The Ultra Ethernet Consortium (UEC) technology addresses these issues by offering a more efficient, scalable, and robust networking solution tailored for the high-performance demands of AI and HPC workloads.
Fig. 2: Ultra Ethernet cluster diagram.
The Ultra Ethernet system is composed of clusters that include nodes and fabric infrastructure. Nodes connect to the network via Fabric Interfaces (network cards), which can host multiple logical Fabric End Points (FEPs). The network is organized into multiple planes, each containing interconnected FEPs typically linked through switches.
The clusters can work in two main modes to handle different tasks.
Fig. 3: Ultra Ethernet redefines Ethernet with a next-generation transport protocol designed specifically for AI and HPC applications. (Credit: The Ultra Ethernet Consortium)
As AI models grow larger, the need for more compute and memory resources increases significantly. Traditional interconnects are not specialized to accommodate dedicated AI workload networks. UALink, a scale-up fabric, enables a standards-based network of extremely high-bandwidth connections among dozens to hundreds of dedicated AI accelerators. This is a significant market advancement for scale-up networks moving away from ad-hoc network configurations to more standardized networks enabling higher radix systems complete with dedicated UALink switches.
Fig. 4: UALink enables an open ecosystem for scale-up network and switches for AI accelerators. (Taken from: HiPChips at MICRO-2024)
UALink creates a high-speed, low-latency network that connects multiple accelerators (such as GPUs) within a pod. This allows each accelerator to directly access the memory of others, making the entire pod function like a single, massive GPU. This allows each GPU to directly access and modify the memory of any other GPU within the same scale-up network. From a software perspective, this interconnected group of GPUs appears as a single, massive GPU.
Synopsys offers UALink and Ultra Ethernet IP solutions designed to connect massive AI accelerator clusters.
Fig. 5: Ultra Ethernet and UALink IP solutions.
The Synopsys Ultra Ethernet IP solution offers up to 1.6 Terabits per second, enabling up to a million endpoints. Additionally, the Synopsys UALink IP delivers up to 200 Gigabits per second per lane, connecting over a thousand accelerators. These solutions are optimized for AI scale-up and out, providing high bandwidth and lightweight protocols essential for AI communication.
As the AI landscape continues to expand, the adoption of standardized interfaces will be crucial in driving innovation, reducing complexity, and enhancing overall system performance. The future of AI infrastructure lies in these collaborative, open-standard solutions that empower the industry’s growth and efficiency. Synopsys is at the forefront of AI and HPC design innovation, offering a broad portfolio of high-speed interface IP. With complete, secure IP solutions for PCIe 7.0, 1.6T Ethernet, CXL, HBM, UCIe, and now Ultra Ethernet and UALink, we are enabling new levels of AI and HPC performance, scalability, efficiency, and interoperability to help ensure our customers achieve first pass silicon success.
Jon Ames is a senior staff product marketing manager for Ethernet IP at Synopsys.
Leave a Reply