More Data, More Redundant Interconnects

Circuits are being pushed harder and longer, particularly with AI, speeding up the aging of data paths. Photonics adds its own complications.

popularity

The proliferation of AI dramatically increases the amount of data that needs to be processed, stored, and moved, accelerating the aging of signal paths through which that data travels and forcing chipmakers to build more redundancy into the interconnects.

In the past, nearly all redundant data paths were contained within a planar chip using a relatively thick silicon substrate. But as chipmakers migrate from planar SoCs to multi-die assemblies, many of the data paths in a package are external. Chiplets need to communicate with other chiplets and various memories scattered throughout a package, and they need to move more data back and forth, which generates heat due to the resistance of the wires. Compounding that, the substrates need to be thinned out because that speeds up the signals, which reduces thermal conductivity.

Potential problems only increase from there. Advanced packages — especially those used in large data centers — are packed with different processing elements, many of which are running at full speed for longer periods of time. Those processors collectively generate more heat, and when added to the resistance of the interconnects, the buildup of heat accelerates electromigration, which reduces or completely closes off the data paths more rapidly than in the past.

To extend the lifetime of these very expensive multi-die systems, chipmakers have started to add in more signal paths than will be used at any one time, or possibly even during the lifetime of a chip.

“In machine learning training, the implications of errors are surprisingly catastrophic,” said Kevin O’Buckley, senior vice president and general manager of Intel Foundry. “We marvel at all the innovation that’s gone into scaling from a compute socket to a board to a rack to an entire data center acting as one coherent system. It’s amazing. But then, from an error standpoint, you start going to the second power to the third power, to the power to the power, and those errors multiply. So there’s an incredible amount of extra attention being applied now that I haven’t seen outside of certain very unique HPC applications. That’s for things like memory correction. We’re adding a lot more redundancy, a lot more parity, a lot more fancy algorithms and codes for doing memory checking. But it’s also extending to the connectivity links. Having redundancy and parity in some of our connecting links is really, really important.”

It’s also something most chipmakers never had to consider before stacking chiplets on interposers and on top of other chiplets.

“There is more connectivity, more wires within the package, 3D stacking, so we need to create a redundant structure for repair during manufacturing or during the device’s lifetime,” said Letizia Giuliano, vice president of IP product marketing at Alphawave Semi. “We need to think about lane repair, redundancy for that die, and we need to think about correction. You create more points of failure, and now you need to find a way to correct when they fail. Advanced packages are going to fail more than standard packages because you have more lanes and because the package interconnect is more complex.”

Manufacturing and assembly processes contribute to the need for redundancy. Those processes can take years to mature when there is sufficient volume. But there are so many processes and packaging options available, and so many one-off types of designs developed by large systems companies, that many of those processes will never fully mature. As a result, design rules need to be more conservative in order to yield sufficiently, which in turn requires more redundancy.

“There always have been mechanical issues with multi-die, especially as the size of the package gets very big” said Mick Posner, senior group product director at Cadence. “You get flexing and things like that. But once you add a 3D stack to that, your mechanical stresses become really legitimate. On CoWoS and EMIB, the manufacturing of those links is still a maturing process. UCIe itself defines redundancy. So in an advanced package with UCIe, you have 64 transmits and 64 receives, and part of the specification defines four redundant links per group. Fundamentally, it boots up, and when the link comes up it looks to see if that link is alive, or if it’s a dead link. UCIe can remap one of those interconnect lines, and that’s built in to the specification. What we’ve seen is you need to go above that.”

Physics considerations
Heat-induced electromigration is well understood, and can be effectively simulated early in the design flow. Less clear is the amount of heat generated by specific workloads and the impact of customization. And there are some new elements thrown in here, as well, particularly hybrid bonding.

“If you look at the TSMC roadmap, they started with CoWoS-S, and moved to CoWoS-R and CoWoS-L,” said Steven Tsai, senior vice president at ASE. “Because the die size has become bigger and bigger, the interposer cannot operate with cost efficiency, so they have to move to RDL and also bridges. That is easier for routing for the design, and in terms of cost, probably better than silicon in the process. It’s an industry trend. But hybrid bonding will create more issues because the pitch is getting narrower and narrower.”

Different materials also have different coefficients of thermal expansion. Thermal effects caused by lattice mismatch can be magnified by high utilization of transistors and signal paths. Likewise time-dependent dielectric breakdown, although well understood, can accelerate with data-intensive AI activity.

In the past, most of the thermal-related redundancy was confined to mil/aero and automotive applications, where the impact of a failure was safety critical. Until recently, nearly all of the chips developed for those markets relied on older, time-proven process technologies. But as AI takes on a larger role for those applications, those markets are shifting toward the most advanced process nodes and multi-die assemblies of chiplets, and redundancy and resiliency are playing a much bigger role in those designs.

Economic considerations
No matter which end market is targeted, there is a growing demand for chips that can process more data in less time due to the AI-driven data explosion.

All of this adds cost, but the comparison of a multi-die design versus a monolithic planar SoC that is developed at reticle limits, or stitched together to exceed those limits is not a simple formula. Yield, reliability, and life expectancy each carry a price tag, and requirements for each of those can vary by application and workload.

“If you think of a compute tray in a data center, there’s an argument that it should all be on one monolithic motherboard,” said John Koeter, senior vice president and head of Synopsys‘ IP Group. “Daughter cards create signal integrity issues and other issues, and potentially a performance limitation. But if a daughter card goes bad, you can remove it and plug in a new one, rather than throwing out the whole tray. So there are all these interesting dynamics going on here. Lane redundancy and repair for die-to-die, given the number of parallel signals, is absolutely mandatory.”

Others agree. “In the past, people had 100, 300, 500 SerDes on a chip,” said Ramin Farjadrad, CEO and co-founder of Eliyan. “It’s still a big number. But if you start going to thousands of them, thousands of ports, because die-to-die gives you this fine-pitch bump that is standard in advanced packages and more, then if one of them fails, you don’t want to throw away the whole chip. And so you start to think, this is just another interconnect. In the past, if something failed, it wasn’t that significant. But if you have 10,000 of these interfaces, the chances of them failing is 110,000. That’s where you want to add redundancy, and it’s a very efficient way to reduce that probability.”

It’s also one of the reasons hardware-accelerated verification is becoming so ubiquitous, whether on-premise or in the cloud.

“Emulation is becoming extremely important from an IP vendor perspective to simulate those long packets, because you just can’t find them in dynamic simulation,” said Koeter. “Another thing that emulation is really useful for is to inject an enormous amount of noise into the link and make sure the link either recovers gracefully or goes into a detect mode. So emulation as an IP vendor has multiple benefits. There’s the standard verification acceleration. But there’s also a real-world emulation in terms of noise and signal integrity. And all of the physical IP that we ship comes with firmware. We’re constantly tweaking that firmware to make sure we’re getting the best performance out of the PHY. We’re working with some of the lead hyperscalers, and they’re being very clear about, ‘Let’s say the channel length described in PCI Express is 32db [channel loss budget]. Your IP better be able to handle 35db.’ And if the spec says 10-6 bit error rates, we have to be able to handle two orders of magnitude above that. This is so the IP is not just compliant with the spec, which is a bare minimum and, frankly, insufficient. It needs to have orders of magnitude of margin to work in a very robust data center environment.”

Photonics
It also needs orders of magnitude improvements in performance, which has provided a huge incentive for photonics. All of the major foundries and OSATs are now talking about photonics as a very real future direction, rather than a vague possibility.


Fig. 1: TSMC future platform for HPC/AI. Note the photonics on the right side of the image. Source: TSMC


Fig. 2: Samsung Foundry’s roadmap for photonics. Source: Samsung


Fig. 3: Intel Foundry’s photonics roadmap. Source: Intel

“People are going to be able to transfer more data faster and with bigger bandwidth,” said Michal Siwinski, chief marketing officer at Arteris. “That’s happening today between server racks. What that means for chip design is that the bandwidth of the lanes gets much wider, because all of a sudden you not just limited based on what the physical layers can do. The demand is to go ultra-wide and ultra-fast. So basically you’re moving from a two-lane highway to a five-lane highway. As you go wider with bigger fiber-optic, more optics, better connections, better A-to-D and D-to-A converters, the actual amount of computing being done can increase because you’re no longer limited.”

The upshot is that it doesn’t matter whether the signals are all electrical, or whether they start out electrical, then get converted to optical and back to electrical. Processing and storage are still electrical. So while photonics is faster, and uses less power to move optical signals, sooner or later those signals must be converted back to electrical. And more data leads to more thermal issues, and the need for more redundancy.

“If you’re going from a billion parameters to a trillion parameters, the data bits of information you’re computing is increasing,” Siwinski said. “The buses on the chip are getting large because you need more address space to keep in line with the exponential growth in the number of parameters on the software side, but you’re linearly expanding the widths. So now you have this gigantic highway of all the data that is moving at the same time, and it has to be at the same frequency, same latency, same everything. And you’re doing more and more of that because you’ve got all these LLM models and you have to compute them within the die and across multiple dies.”

Photonics adds other issues, as well. “The photon is going to be a better mode to get data off the chip,” said Todd Bermensolo, product marketing manager at Alphawave Semi. “The question now is how close you can put silicon photonics next to electron components when you’re stacking them on an interposer. As you slam two wafers together, you have to have some kind of interconnect that allows you to stitch them together, like a build-up layer. But that ends up being a very challenging interconnect. The SerDes has to now deal with those very different kinds of elements and make sure you have the reach to get it to where it needs to go.”

Conclusion
Redundancy in data paths isn’t a new concept. What’s different now, though, is that redundancy is moving off-chip due to immature processes and higher utilization of all components. Circuits are aging faster under heavy use, and so are their connections.

This adds yet another level of complexity to advanced designs, because full or partial closures of data paths needs to be flagged, and the system needs to be smart enough to reroute data to other circuits. And all of that needs to be designed in up front and monitored throughout a chip’s lifetime — which done wrong may be significantly shorter than design teams expect.



Leave a Reply


(Note: This name will be displayed publicly)