Systems & Design
SPONSOR BLOG

UEC-LLR: The Future Of Loss Recovery In Ethernet For AI And HPC

Link layer retry fixes packet loss locally and avoids expensive recovery mechanisms.

popularity

As Artificial Intelligence (AI) and High-Performance Computing (HPC) systems become the backbone of modern data centers, they generate and consume a massive amount of data. Traditional Ethernet was not built for such high-bandwidth traffic.

In HPCs and AI models, computations are distributed across the nodes and the data is shared in real time with low latency and lossless communication. As all the processes are synchronized with each other, a slight delay or packet loss can slow down the whole system. The packet loss leads to major performance degradation, and traditional recovery mechanisms at higher layers are too slow.

Hence, the Ultra Ethernet Consortium is introducing a link-layer retry mechanism to avoid costly recovery at higher layers.

UEC – Link Layer Retry

As we discussed, in traditional Ethernet, packet loss is usually handled at higher layers like TCP, and it leads to high latency. There are some alternate solutions available, like RDMA, which is used in InfiniBand or RoCE, but they are complex and vendor-specific.

LLR offers a middle ground by providing reliable loss recovery without RDMA and using existing Ethernet, providing a simpler solution.

In LLR, instead of involving the whole protocol stack, including CPU, Packet loss is detected and retried at the local link layer, resulting in better throughput and low latency.

How UEC-LLR works

In layman’s terms,

  • Frame transmission with sequence numbering: Every frame between nodes is tagged with a local sequence number, and these sequence numbers/IDs are not visible to the upper layer.
  • Frame buffering: The sender buffers the sent frames in a local retry queue and holds them in a local buffer until the receiver acknowledges them.
  • Acknowledgment: The receiver will send the ACKs back to the sender once each frame is received successfully.
  • Loss detection: If a frame is dropped or corrupted, the receiver will notice a gap in the sequence number and send the Negative Acknowledgement (NACK) back to the sender. In other cases, if the sender does not receive an ACK or NACK within a time frame, it will initiate a timeout-based retry.

The sender retransmits only lost/corrupted frames from the local retry buffer, avoiding transport-level retransmission. Once the Sender receives the ACKs, it will remove the entries from the local retry buffer.

UEC outlines a layered Ethernet stack consisting of:

  • Standard Ethernet PHYs (100G/200G/400G/800G)
  • LLR Layer – Link-level recovery from transient packet loss
  • UET (Ultra Ethernet Transport) – Manages ordering, congestion control, and flow control
  • Application Layer – AI frameworks and HPC workloads.

LLR sits above the PHY and below the UET layer, ensuring the loss resilience before the higher layers get involved.

By fixing packet loss locally, Link Layer Retry avoids expensive recovery mechanisms and provides a more open, scalable, and efficient solution for AI networking and HPCs.

With the availability of the Cadence Verification IP for Ethernet UEC, adopters can start working with these specifications immediately, ensuring compliance with the standard and achieving the fastest path to IP and SoC verification closure. Incorporating the latest protocol updates, the mature and comprehensive Cadence Verification IP (VIP) for the Ethernet protocol provides a complete bus functional model (BFM), integrated automatic protocol checks, and coverage model. Designed for easy integration in test benches at IP, system-on-chip (SoC), and system levels, the VIP for Ethernet helps you reduce the time to test, accelerate verification closure, and ensure end-product quality. The VIP for Ethernet runs on all major simulators and supports SystemVerilog and e-verification languages and associated methodologies, including the Universal Verification Methodology (UVM).



Leave a Reply


(Note: This name will be displayed publicly)