How Ultra Ethernet And UALink Enable High-Performance, Scalable AI Networks

Tackling the challenges of high-bandwidth, low-latency connectivity and efficient resource management.

popularity

By Ron Lowman and Jon Ames

AI workloads are significantly driving innovation in the interface IP market. The exponential increase in AI model parameters, doubling approximately every 4-6 months, stands in stark contrast to the slower pace of hardware advancements dictated by Moore’s Law, which follows an 18-month cycle. This discrepancy demands hardware innovations to support AI workloads, calling for greater computational capacity, enhanced resources, and higher bandwidth interconnects.

Even more, hardware performance has surpassed the standard reticle limit. Both CPU and GPU designs are pushing the reticle size due to the extensive arrays of compute units and associated memory. AI accelerators and GPUs now require a new, ultra-efficient networking infrastructure to scale workloads beyond a single die, demanding low latency, high-density connections, and optimal energy efficiency in chip-to-chip communications.

This article delves into the technical aspects of how scaling up and out is becoming a critical need for HPC and AI chip developers, and how new standards such as Ultra Ethernet and Ultra Accelerator Link (UALink) aim to tackle the challenges of high-bandwidth, low-latency connectivity and efficient resource management.

Emergence of new standards

Scaling up and out of chip-to-chip architectures is essential, driven by the demands of AI workloads. Transitioning from monolithic dies to multiple dies and leveraging parallel interfaces like HBM and UCIe has become necessary. These solutions support both homogeneous and heterogeneous compute architectures, utilizing traditional connections through PCIe and CXL for memory expansion, and Ethernet for broader network architectures.

Fig. 1: Scaling to hundreds of thousands of AI accelerators.

Two new standards have emerged to specifically address AI scaling needs:

  • Ultra Ethernet for scaling out
  • UALink for scaling up

Ultra Ethernet is an open, interoperable, high-performance architecture tailored for AI, supported by industry leaders across switch, networking, semiconductor, and system providers, as well as hyperscalers. UALink, on the other hand, facilitates the direct operation of accelerators with specific memory-sharing capabilities, backed also by significant players in the semiconductor industry.

Ultra Ethernet: Scaling out AI workloads

As AI and HPC traffic grows, traditional networks using RoCE or proprietary solutions are showing their limitations. These include strict in-order packet delivery, inefficient flow-based load balancing, and cumbersome retransmissions in RDMA operations if a packet is dropped, which can be very costly for AI operations. The Ultra Ethernet Consortium (UEC) technology addresses these issues by offering a more efficient, scalable, and robust networking solution tailored for the high-performance demands of AI and HPC workloads.

How Ultra Ethernet works

Fig. 2: Ultra Ethernet cluster diagram.

The Ultra Ethernet system is composed of clusters that include nodes and fabric infrastructure. Nodes connect to the network via Fabric Interfaces (network cards), which can host multiple logical Fabric End Points (FEPs). The network is organized into multiple planes, each containing interconnected FEPs typically linked through switches.

The clusters can work in two main modes to handle different tasks.

  • Parallel Job Mode: the system runs tasks to completion and allows multiple nodes to communicate at the same time. This is ideal for high-performance computing tasks that need a lot of parallel processing.
  • Client/Server Mode: the system is set up for storage tasks. Here, a server continuously handles requests from many clients, with communication happening between specific pairs of nodes. This mode is ideal for reliable and consistent data access and management.

Key technical features of Ultra Ethernet

Fig. 3: Ultra Ethernet redefines Ethernet with a next-generation transport protocol designed specifically for AI and HPC applications. (Credit: The Ultra Ethernet Consortium)

  1. Physical Layer: Compatible with IEEE 802.3 standard Ethernet, with optional performance monitoring based on FEC (Forward Error Correction) codewords. Metrics like UCR (Uncorrectable Codeword Rate) and MTBPE (Mean Time Between Packet Errors) provide insights into transmission performance and reliability.
  2. Link Layer: Introduces LLR (Link Level Retry) protocol for lossless transmission without relying on Priority Flow Control (PFC). This ensures faster error recovery, eliminates unnecessary end-to-end retransmissions, and reduces tail latency.
  3. Packet Rate Improvement (PRI): Compresses Ethernet and IP headers to improve packet rates, addressing inefficiencies caused by legacy features and redundant protocol fields.
  4. Link Negotiation Protocol: Extends LLDP with negotiation capabilities to detect and enable supported features like LLR and PRI.
  5. Transport Layer: Designed to address the limitations of traditional RDMA networks, featuring selective retransmit, out-of-order delivery, packet spraying, and advanced congestion control mechanisms. It supports multiple transmission modes, including Reliable Ordered Delivery (ROD), Reliable Unordered Delivery (RUD), and Unreliable Unordered Delivery (UUD).
  6. Congestion Control: Implements features like incast management, accelerated rate adjustment, telemetry-based control, and adaptive routing via packet spraying to minimize tail latency and enhance network performance.
  7. Security: Incorporates job-based security at the transport layer, leveraging IPSec and PSP capabilities to minimize encryption overhead and support hardware offloading.

UALink: Scaling up AI workloads

As AI models grow larger, the need for more compute and memory resources increases significantly. Traditional interconnects are not specialized to accommodate dedicated AI workload networks. UALink, a scale-up fabric, enables a standards-based network of extremely high-bandwidth connections among dozens to hundreds of dedicated AI accelerators. This is a significant market advancement for scale-up networks moving away from ad-hoc network configurations to more standardized networks enabling higher radix systems complete with dedicated UALink switches.

How UALink works

Fig. 4: UALink enables an open ecosystem for scale-up network and switches for AI accelerators. (Taken from: HiPChips at MICRO-2024)

UALink creates a high-speed, low-latency network that connects multiple accelerators (such as GPUs) within a pod. This allows each accelerator to directly access the memory of others, making the entire pod function like a single, massive GPU. This allows each GPU to directly access and modify the memory of any other GPU within the same scale-up network. From a software perspective, this interconnected group of GPUs appears as a single, massive GPU.

Key technical features of UALink

  1. High Bandwidth: UALink delivers up to 200 Gbps per lane, enabling efficient data transfer between accelerators.
  2. Lightweight Protocols: The protocol is designed to be lightweight, reducing overhead and ensuring efficient communication.
  3. Efficiency: Sub-microsecond latency improves inference performance and allows scaling beyond eight GPUs without partitioning workloads.
  4. Open Standard: UALink is an open industry standard, promoting interoperability and reducing vendor lock-in.
  5. Memory Sharing: Specific capabilities for memory sharing allow accelerators to access shared memory resources efficiently. Supports load, store, and atomic operations between hundreds of GPUs, minimizing end-to-end latency and reducing power consumption.
  6. Synchronization Features: UALink includes synchronization features to ensure coherence and efficient operation across multiple accelerators.
  7. Complementary with UEC: Works well with the Ultra Ethernet Consortium for broader scalability.

Enabling massive AI clusters with Ultra Ethernet and UALink IP solutions

Synopsys offers UALink and Ultra Ethernet IP solutions designed to connect massive AI accelerator clusters.

Fig. 5: Ultra Ethernet and UALink IP solutions.

The Synopsys Ultra Ethernet IP solution offers up to 1.6 Terabits per second, enabling up to a million endpoints. Additionally, the Synopsys UALink IP delivers up to 200 Gigabits per second per lane, connecting over a thousand accelerators. These solutions are optimized for AI scale-up and out, providing high bandwidth and lightweight protocols essential for AI communication.

Conclusion

As the AI landscape continues to expand, the adoption of standardized interfaces will be crucial in driving innovation, reducing complexity, and enhancing overall system performance. The future of AI infrastructure lies in these collaborative, open-standard solutions that empower the industry’s growth and efficiency. Synopsys is at the forefront of AI and HPC design innovation, offering a broad portfolio of high-speed interface IP. With complete, secure IP solutions for PCIe 7.0, 1.6T Ethernet, CXL, HBM, UCIe, and now Ultra Ethernet and UALink, we are enabling new levels of AI and HPC performance, scalability, efficiency, and interoperability to help ensure our customers achieve first pass silicon success.

Jon Ames is a senior staff product marketing manager for Ethernet IP at Synopsys.



Leave a Reply


(Note: This name will be displayed publicly)