Physical I/Os can be a chokepoint for high-performance chips and high-speed interconnect protocols, requiring design tradeoffs and extra reliability measures.
Key Takeaways:
The growing adoption of AI everywhere, from artificial general intelligence (AGI) to drug and materials discovery, is shifting the focus from just building the fastest chip at the latest process node to architecting a system around the rapid movement of huge amounts of data.
Systems need to be balanced across processors, memory, and interconnects, and they need a level of orchestration, as well, to ensure data gets to where it needs to go at the right time. Blazing-fast processors developed at 18 angstroms are wasted if they’re sitting idle waiting for data to be retrieved from memory or because some lower-priority job is clogging the data paths.
“Not only do you want a crazy amount of low-latency communication between all your chips, but you want it to be deterministic,” said Saurabh Gayen, chief solutions architect at Baya Systems. “When you’re typing to your AI chatbot, you don’t want to sit around waiting for it to think forever. You want it to start talking to you and have that conversation going. That fundamentally changes how you are thinking about network and I/O connectivity.”
Multi-die assemblies and advanced packaging have multiplied the number of decisions that need to be made around I/Os and interconnects, and that’s only exacerbated by complex and rapidly changing markets.
“With new packaging technologies, a lot of system-level analysis and budget-type analysis are important,” said Hee Soo Lee, high-speed digital design segment lead at Keysight EDA. “Also, at an engineer’s level, being able to make really clean channels at the physical layer and ensuring those I/Os get data out in a cleaner way is very important. That is where a lot of learning and adapting is happening, using new technologies, EDA solutions, and tools. The learning curve is demanding, but that’s the core factor for success — getting into a new market, and getting market share out of these harsh market conditions.”
Balancing tradeoffs
Numerous I/O tradeoffs must be correctly balanced to ensure the commercial success of an AI chip. “What tradeoff you make will have an impact on your airflow, cooling, rack design, the power coming into the rack, and so on,” said Arif Khan, vice president of product management and marketing for the Silicon Solutions Group at Cadence.
This isn’t a one-size-fits-all solution, however. Data movement can change over time, and it can vary by workload.
“There are tools that are helping, but the decisions are not at that micro level,” Khan said. “Agentic AI and other AI capabilities are getting added to pretty much all the tools in the design flow. In some places, they’re more mature than others. The complex, physics-based AI is not all there yet. Some of these are very hard physics problems to solve in terms of system implementations, including thermal. As of today, there are some uses, but not to the degree that it can significantly accelerate things.”
Others agree. “The nastiest I/O design problems today show up where physics and integration economics collide,” said Andy Nightingale, vice president of product management and marketing at Arteris. “This includes chiplets, or multi-die in 2.5D/3D packages, plus leading-edge compute tiles that push power density and clocking. Advanced packaging shortens interconnect distances, but it also multiplies interfaces, clock domains, power islands, and ‘unknown unknowns’ in signal/power integrity, thermal gradients, and test/bring-up.”
Heterogeneous integration only compounds those issues. “Chiplets and 3D multi-die packaging present the hardest I/O challenges due to heterogeneous interfaces, signal integrity constraints, and rapidly scaling bandwidth requirements,” said William Wang, CEO of ChipAgents. “Engineers must understand signal integrity, power delivery, retry mechanisms, protocol stacking, and thermal-bandwidth tradeoffs as AI massively increases data movement pressure across dies.”
Solving one problem at a time is possible, but they all need to be solved at once. “The coupling of multiple elements is the hardest challenge,” said Ashish Darbari, CEO of Axiomise. “Any individual solver — thermal, mechanical, electrical — has gotten genuinely good. The problem is that the coupling is bidirectional and spans time and length scales that are orders of magnitude apart, and the tools involved don’t naturally talk to each other. Signal integrity in multi-die systems has the same character. UCIe and BoW links running at 32, 48, or 64 giga-transfers per second are going into bumps whose impedance shifts with temperature and mechanical strain. Heterogeneous integration makes the bookkeeping painful — a compute die on N2, I/O on N5, SRAM on N3, analog on N16 — different PDKs, different reliability models, different thermal coefficients. Making them produce a coherent picture under one workload is where most multi-die programs quietly lose months.”
I/O and interconnect designers need to make a series of choices, each impacting the next. “Chiplets and 3D do not just add another integration option. They multiply the number of boundaries you have to manage,” said Lou Ternullo, senior director of product management for silicon IP at Rambus. “Now you are choosing where protocols terminate, where coherency lives, and what traffic stays on-package versus what has to survive board-level channels. You also inherit new physical realities, like thermal gradients, power integrity, and tighter signal budgets that can change what looks ‘best’ on a block diagram. The result is that interconnect is no longer a single choice. It is a hierarchy of choices across package, board, and rack.”
In data center design, decisions are made in layers. “You start out knowing how much energy is delivered to a data center, to a rack, and you’ve got power budgets and cooling budgets and so on that you’re going to operate in,” Khan said. “Then there is a budget that gets handed down to a system maker. Not all the components are coming from the same company, either. The budgets are apportioned, then the person building a system may acquire multiple pieces of the equipment from various vendors. How the decisions then get made at each level is going to be a little bit different based on the technical budget and PPA, in addition to cost.”
Distance between devices is a central consideration. “The main challenge is routing,” said Satish Radhakrishnan, head of GTM for semiconductor and electronics at Vinci. “Interconnect protocols span very different physical distances, from millimeters inside a package to meters within a rack. As more devices are connected and brought closer together, routing becomes much harder. Designers must manage congestion, signal integrity, power delivery, thermal impact, and the physical limits of the package, board, or rack.”

Fig. 1: AI scaling architectures with 1.6T Ultra Ethernet, UALink, and OSFP (Octal Small Form Factor Pluggable) I/O connectors. Source: Synopsys
I/O reliability and redundancy
In high-performance computing systems, both I/O subsystems and interconnects are significant sources of faults and performance degradation.
“Reliability of fault prone I/Os is tied to the physical implementation,” said Vinci’s Radhakrishnan. “The protocol may define how data moves, but the system still has to support that movement reliably under real power, thermal, mechanical, and manufacturing conditions.”
Redundancy is essential. “In HPC, things that were nice to haves before, like silicon life cycle management, now are must-haves,” said Rob Kruger, product management director for multi-die strategy and 3D IP at Synopsys. “Reliability is a key factor, and we follow OCP (Open Compute Project) standards for reliability, but we also add features such as redundant links in there.”
Further, I/Os can fail during assembly or in the field. “Say you’re doing a 3D link and there’s a failed hybrid bond, which could be a problem during assembly. We have redundant links to replace those in the system,” Kruger explained. “The same is true for UCIe links that connect to chiplets. You could have redundant links to repair broken links during manufacturing, or in the field five years from now. Software can monitor, test, and repair these links.”
Telemetry plays a significant role here. “You might have sensors for process, voltage, and temperature, and signal integrity, for example. Then you aggregate that data and send it off to higher levels of the network,” said Kruger. “How do you aggregate that data? Do you do that with software? Software is fine, but maybe you have thousands of I/Os in the data center, all running software. You might choose a hardware-first approach instead, and the software is there as a backup. In that case, coordinating with the higher-level system is another challenge.”

Fig. 2: A simplified data center network displaying connectivity required, with UCIe an option for 1.6T interconnects in I/O chiplets for AI data centers Source: Synopsys
Clusters for giant models
One of the problems the HPC ecosystem is trying to solve is how to make a whole cluster of compute nodes act as a single computer, with I/Os a central consideration.
“If you see the evolution of computing, there was an era before 2012 where, within a processor, you had multiple cores integrated,” said Priyank Shukla, director of product management for interface IP at Synopsys. “Then there was a time that within a rack or unit of a server, you had multiple processors, increasing the throughput. But after ChatGPT, we realized a lot of unstructured data can be processed with a different accelerator. You don’t just need a processor. You need a workload-specific accelerator, and that needs to be connected. The scaling laws for large language models — not for CMOS — dictate that if you throw a large quantity of unstructured data at it, you get a very well-trained model. We are trying to create a large cluster that can act as a single, unified compute facility, which is different from what we are doing now. When the whole cluster has to act as a single computer, you need to provide memory to each compute node. You have to pass information across chips, and these offer different kinds of challenges for interconnects and I/O, along with other components.”
This passing of data is critical. “When we say interconnect, we generally talk about the physical layer, or just one layer above,” said Shukla. “These are very fundamental at the protocol level, as well. The idea here is you can pass data — not just simple data, but coherent memory data — to different nodes. There is no end to how much data.”
A unified cluster can significantly boost performance. “From an I/O point of view, what it means is you have to escape as much as possible from one chip,” Shukla explained. “The limitation is you can’t stuff more compute within a chip, so two chips have to act together. The bottom line is how fast they can talk. The chip has a limited edge on the die, so you want to utilize as much bandwidth from a limited beachfront.”
Beachfront density refers to the edge of the die and how much data can be transferred per millimeter. “If you have an accelerator, the code is matrix multiplication or TensorFlow, but then the accelerator needs to talk with others, so they need an I/O, and that’s why people say chiplet I/O,” explained Shukla. “But how do you integrate it? There are different considerations. If you co-package them within a package and your I/O die is at the top, it gets a chance to be at the faceplate of die. With liquid cooling, you can cool this die. But the base die, which is buried under it, doesn’t have a path to extract heat, and that brings a different challenge.”

Fig. 3: AI connectivity in the data center. Source: Synopsys
Particularly in the AI space, these innovations are necessary because designers recognize that compute is not the only limitation to growth. “Data movement and memory access are really the problem within a chip, as well as across chips,” Baya’s Gayen said. “How are we able to make sure that these crazy, giant AI models, which have gigabytes and gigabytes of footprint, perform efficiently? This is where you get a lot more emphasis on the rack scale design that Nvidia has, for example, with the NVL72 exascale system.”
NVL72 was a milestone to show connectivity is king, Gayen observed. “How do you move the data across the GPUs? The whole idea is that you don’t want a single GPU — you want a ton of GPUs to act as if they’re one giant GPU. That’s where NVLink, and the NV switches associated with it, allowed Nvidia to build a comprehensive system that was not just hyper-focused on compute.”
From a chip architect’s perspective, clusters raise four practical concerns, according to Axiomise’s Darbari:
Congestion challenges and specifications
AI era challenges include internet, cloud, and data traffic generated by video-on-demand and voice commands, along with new types of traffic from AI training data centers, as well as bursty traffic from AI inference.
“GPU clusters will process the data, and then at specific times, they exchange the results to traffic patterns called collective communication library (CCL), which generates huge amounts of traffic, requiring a huge number of ports at high speeds,” said Razvan Arhip, product manager, AI and Network Test Solutions, at Keysight Technologies, in a recent webinar.
With this traffic approach, designers need to avoid idle GPUs caused by network traffic bottlenecks. “GPUs are expensive, and the clusters are expensive, so you don’t want to idle them because of the network,” said Arhip. “You have to have latencies very low and loss close to zero to avoid retransmissions, which eat up time. You can no longer rely on the upper protocols like TCP (Transmission Control Protocol) in the data centers to fix the loss. You need to minimize the congestion that is generating the loss, so you need to deal with the loss at as low a layer as you can, and this is why new congestion control mechanisms were adopted, like the DCQCN (Data Center Quantized Congestion Notification). This is also why the Ultra Ethernet Consortium released the Link Layer Retry (LLR), which performs retransmissions at layer two. These and CBFC (Credit-Based Flow Control) are driven by big companies in this space.”
Finally, to mitigate network interconnect failures and associated I/O congestion in large-scale AI training clusters, the OCP recently published an open-source Multipath Reliable Connection (MRC) protocol [1].
According to the related technical paper [2], “a new RDMA (Remote Direct Memory Access)-based transport protocol, MRC, sprays across many paths and actively load-balances between them, eliminating the issue of flow collisions.” Further, it lowers latency, many more nodes are reachable in one hop, cost and power consumption are reduced, the impact of an in-network failure is much less, and it is possible to lose a NIC-T0 link without bringing down the training job.
Conclusion
The competition to deliver accurate, lightning-fast AI capabilities across every sector is fierce. Every element of the HPC system or cluster is under pressure to perform optimally without failure.
“These chips are not being built for an academic purpose,” said Cadence’s Khan. “You want to have the best performance, showcase great systems, sell a lot, and make a lot of money, so the tradeoffs are multi-pronged.”
That means chip architects must weigh the options in terms of I/Os and packaging. “When you go back to even the simplest case of an SoC being disaggregated, there’s a cost,” Khan noted. “You add some latencies. You add power bonds at the interface. Now, is this going to fit in your budget, or are you better off with a monolithic solution?”
The bottom line is there are many options. The challenge is to build a balanced system that optimizes whatever is most important for the end user. “At the end of the day, design teams are trying to solve a multi-dimensional problem, and each type of I/O and advanced packaging has its own challenges and advantages.” Synopsys’ Shukla said.
References
[1] Multipath Reliable Connection (MRC) Specification (Open Compute Project)
[2] Resilient AI Supercomputer Networking using MRC and SRv6 (AMD, Broadcom, Intel, Microsoft, NVIDIA, and OpenAI)
Related Articles
Confusion Grows With More Interconnect Options And Tradeoffs
Each standard serves a specific use case, so chip architects are choosing more than one for a single design.
Swapping Out Chiplets: I/Os Vs. Compute
Multi-die assemblies give chip architects the option to change some dies while keeping the rest of the system intact, but which is best to keep?
Scale Up, Scale Out Get A New Partner
For reaching farther into another data center, developers are now talking about scale-across.
Interconnects Essential To Heterogeneous Integration
Chiplet communication will be impossible without interconnect protocols.
Leave a Reply