Integrating Ethernet, PCIe, And UCIe For Enhanced Bandwidth And Scalability For AI/HPC Chips

Efficiently connecting the multiple CPUs and accelerators, various switches, and numerous NICs in modern data centers.

popularity

By Madhumita Sanyal and Aparna Tarde

Multi-die architectures are becoming a pivotal solution for boosting performance, scalability, and adaptability in contemporary data centers. By breaking down traditional monolithic designs into smaller, either heterogeneous or homogeneous dies (also known as chiplets), engineers can fine-tune each component for specific functions, resulting in notable improvements in efficiency and capability. This modular strategy is especially advantageous for data centers, which demand high-performance, reliable, and scalable systems to process large volumes of data and complex AI workloads.

Hyperscale data centers, with their intricate and continually evolving architectures, can leverage various types of multi-die designs:

  • Compute Dies: These are responsible for core processing tasks, including general-purpose CPUs, GPUs for parallel processing, and specialized accelerators for AI and machine learning.
  • Memory Dies: These provide the essential storage and bandwidth for data-intensive applications, supporting various memory types such as DDR, HBM, and new non-volatile technologies.
  • IO Dies: These manage input and output operations, ensuring efficient data transfer between compute dies and external interfaces like memory, networking, and storage, thus guaranteeing high data throughput and low latency.
  • Custom Dies: These can be tailored to meet specific needs or optimize certain functions, including security dies for enhanced data protection, power management dies for efficient energy consumption, and networking dies for advanced communication capabilities.

This article explores how integrating multi-die designs with PCIe & Ethernet, in conjunction with UCIe IP, maximizes bandwidth and performance, facilitating the scaling up and out of modern AI data center infrastructures.

Why scaling up and scaling out is key for data center connectivity

One of the most significant challenges in constructing an AI infrastructure lies in interconnecting tens of thousands of servers spread across multiple data centers to form a vast network capable of handling AI workloads. AI data center’s complexity features multiple CPUs and accelerators, various switches, numerous NICs, and a host of other devices. Connecting these components seamlessly requires an efficient network. This is where scaling up and scaling out technologies become key. IO disaggregation provides an opportunity to address the scale-up and scale-out strategies. In a scaling-up scenario, PCIe & UCIe, leveraging UCIe IP for die-to-die connectivity, can act as the internal network fabric. Meanwhile, in a scaling-out scenario, Ethernet & UCIe IP can be used to enable high-speed, low-latency links between servers.

Multi-die designs with Ethernet and PCIe

As shown in figure 1, there are many opportunities for multi-die designs to enable scaling up and out. Multi-die designs with PCIe, Ethernet and UCIe IP are essential to address the time to market, costs, and risk reduction challenges, while offering full architectural flexibility. Let’s dive into the main types of IO chiplets for multi-die designs, including very large AI training chips, switch SoCs, and retimers.

1. Very large AI training chips

AI chips must become significantly more efficient at both computation and data management to handle the massive data models of today. Specialized AI training chips are designed to meet these immense computational and data processing demands, integrating multiple processing units, memory, and interconnects on a single silicon die to deliver unparalleled performance and efficiency. This is where multi-die designs, integrating 40G UCIe and 224G Ethernet, step in to enable AI training efficiently. Instead of relying on thousands of huge GPUs, data centers could run their AI training with significantly less beachfront in SoCs while achieving unprecedented bandwidth and extended reach with the least latency and power overhead.

224G Ethernet PHY IP provides a robust and customizable interface. With CEI-224G in development, achieving 224Gbps per lane while maintaining ecosystem interoperability and optimizing power is critical for AI training operations. Additionally, UCIe IP can deliver up to 40Gbps of high-speed, low-latency, energy-efficient data transfers across multiple dies, significantly enhancing the scalability and modularity of these chips.

Fig. 1: 224G/UCIe muti-die design for AI training chips.

2. 100T switch SoCs with electrical or optical co-packaged interfaces

AI accelerators are of course a big part of the equation, but how do you connect them together? It takes a lot of switches. Switch SoCs are emerging as another solution for scaling out AI and HPC data centers while maintaining power efficiency and can provide both electrical reach of 3-4 meters or optical reach of 10-100 meters. These SoCs integrate both electrical and optical interconnects directly into CPUs and GPUs, enabling scalable and efficient network optimizations essential for resolving connectivity bottlenecks as cluster sizes rapidly grow. Electrical I/O supports high bandwidth density and low power but is limited to short reaches, optical interconnects can extend data reach significantly. Pluggable optical transceiver modules can increase reach but at unsustainable cost and power levels for large-scale AI workloads. In contrast, co-packaged optical I/O solutions can support higher bandwidths with improved power efficiency, low latency, and extended reach—precisely what AI/ML infrastructure scaling demands.

Optical and electrical IOs can support multiple high-speed channels running at 224Gbps while consuming significantly less power compared to traditional pluggable QSFPDD or OSFP transceiver modules. Furthermore, integrating advanced standards like UCIe and high-speed Ethernet addresses the limitations of traditional interconnects by facilitating high-speed, low-latency communication with the main die.

Fig. 2: 100T optical/electrical switch SoCs.

3. High BW IO for retimers or extended reach ​

Retimers and extended reach solutions are also becoming indispensable due to their critical role in maintaining signal integrity and reducing latency over long distances. Retimers support advanced protocols like PCIe and CXL, ensuring seamless integration into modern data center architectures and enabling substantial memory expansion without requiring an overhaul of existing systems. This compatibility is essential for handling memory-intensive AI inference operations and overcoming signal integrity challenges posed by newer standards like PCIe 7.0.

The convergence of PCIe and CXL protocols is reshaping data center architectures by enabling memory pooling and dynamic, cost-effective memory allocation. For retimers to be effective in this new landscape, they must be protocol-aware and capable of adapting to the rapidly evolving CXL standards. Features such as on-chip diagnostics, secure boot capabilities, and low power consumption are critical to ensuring security, ease of debugging, and sustainability. The industry’s shift towards multi-die designs further underscores the necessity for versatile, high-bandwidth I/O solutions, which simplify system design and accelerate time-to-market. These technological advancements are not only crucial for meeting the current demands of AI and high-performance computing but also for future-proofing data centers against the ever-increasing computational and bandwidth requirements.

Fig. 3: Retimers or extended reach IO design.

Example of a multi-die implementation with Ethernet, PCIe, and UCIe IP

Figure 4 shows an example of a multi-die design with 224G Ethernet PHY and integrated 1.6T PCS and MAC Ethernet controllers, PCIe 6.x or 7.0 PHY and controllers, security IP, sensors, DFT and UCIe PHY and controller IP. This design can be reconfigurable to enable 1.6T/3.2T/6.4T of throughput for a variety of channels, including, 45dB LR, MR, and VSR Ethernet as well as PCIe 6.x & 7.0 reaches.

  • 45dB Long Reach Ethernet & UCIe retimer die-to-die design
  • Combo PCIe/CXL/Ethernet and UCIe die-to-die design
  • 1.6T/3.2T/6.4T scalable IO design for switches

Fig. 4: Multi-die design block diagram.

This multi-die design supports a configurable number of lanes for 224G data transmission in both directions, accommodating up to 45dB insertion loss. It aims to meet the increasing demands of AI infrastructure for higher bandwidth, reduced power consumption, and extended reach. This example implementation enhances scalability for CPU/GPU cluster connectivity and innovative compute architectures, such as coherent memory expansion and resource disaggregation.

Summary

To enhance bandwidth for multi-die designs, integrating high-speed interfaces such as PCIe and Ethernet, along with UCIe IP and link health monitoring features, is crucial. Synopsys offers comprehensive IP solutions that support UCIe up to 40Gbps, featuring signal integrity monitors and testability, 224G Ethernet, and PCIe 7.0. These solutions ensure maximum bandwidth, low latency, and scalability. Synopsys IP solutions for multi-die designs are aligned with evolving standards, ensuring interoperability with ecosystem products and achieving successful silicon implementations across various technologies. This makes it a reliable choice for developing the next generation of AI chips for data centers.

Synopsys’ comprehensive and scalable multi-die solution, encompassing EDA and IP products, enables early architecture exploration, fast software development and validation, efficient die/package co-design, robust die-to-die connectivity, and improved manufacturing and reliability.

Aparna Tarde is a senior staff technical product manager at Synopsys.



Leave a Reply


(Note: This name will be displayed publicly)