Problems and solutions for improving performance with more data.
Demand for new and better AI models is creating an insatiable demand for more processing power and much better data throughput, but it’s also creating a slew of new challenges for which there are not always good solutions.
The key here is figuring out where bottlenecks might crop up in complex chips and advanced packages. This involves a clear understanding of how much bandwidth is required to move data between specific elements in a design and under specific workloads, as well as a collective assessment of the throughput of that data across multiple elements, such as processors, memories, I/Os, and even racks of servers in a data center.
“When our customers talk about throughput and bandwidth, sometimes they talk about the whole cluster, and the whole network is acting like a compute cluster to train large AI models,” observed Priyank Shukla, director of product management, interface IP at Synopsys. “What these engineering teams want to understand is the throughput of a cluster and the bandwidth of different interconnects within this cluster. They want to find the bottlenecks. There are multiple rack units in a data center, and if you open one of the racks, you will find one of the rack units. You will find a processor, and through a network interface card (NIC), it connects to different racks. You have some connectivity between each SoC.”
Understanding how data will flow is essential to optimizing it.
“When compute-cluster companies train AI models, they get a lot of diagnostic information from the network and they know precisely where the delay is,” Shukla said. “For example, you may have a big workload that gets thrown into this cluster, and that workload gets processed in a GPU or accelerator. Somehow a piece of that workload has to reach a GPU, and it does that through a bunch of storage, which will be connected somewhere else in this network. From that storage, data is pulled, and through a direct memory attach (DMA) of a CPU. The CPU runs these algorithms or any process linearly, like first line, second line, third line, so they get the chunk of data and then pass it on a scale-up network to one of the accelerators, and the accelerators work together to process this data.”
There are thousands of interconnects in a complex design. “When the data needs to be pulled from storage, that will be over an Ethernet network that will go to a CPU and through a DMA, or through a GPU in a scale-up network,” Shukla noted. “And all of these have different interconnect bandwidth. If they are running over, let’s say, PCIe 7.x, they are running at 128 Gbps. If they are running over Ethernet, they are running at 112 or 224 Gbps per lane. That will be the bandwidth of each of those interconnects, but the total throughput will be how the whole cluster performs.”
Fig. 1: Different architectures to solve different problems in a data center. Source: Synopsys
SerDes
One of the workhorses in this scenario is serializer/deserializer (SerDes) technology, which compresses parallel data into serial data and back again so it can move more quickly through a limited number of physical channels, or pins.
“Data goes faster, but you don’t have to do it through a large path,” said Todd Bermensolo, product marketing manager at Alphawave Semi. “You can have minimal number of pins that are giving you as much as you can, so it’s more economical. Also, you can have a lot more of those if you need to have the high data throughput.”
However, this also adds complexity on the transmit and receive ends. “The goal is to reduce this data into the smallest number of physical channels you can,” Bermensolo said. “At the point that it’s going to speed up, you’re going to do a lot of tricks to compress it into there, and then when it comes back, you have to reverse it. You have to take this very fast, but very physical efficient data transfer, and then spread it back out to your slow and wide again. This process started as we started integrating compute, and it became very important physically because we can’t cable everything together the way we want. So we started out, back in the 1 Gbps regime, when we had something like a transmitter driving a simple receiver, and we could go pretty fast. It gave us a good parallel to serial conversion. But now we’re shipping 100 Gbps, so we’ve had to extend this very simplistic view over the past 15 years to where we are now. We’ve got 200 Gbps coming next, after which is 400 Gbps. And with the AI applications, this can’t come fast enough.”
However, that kind of speed adds new problems.
“On the serializer state (the transmitter) and on the deserializer (the receiver), we add more complexity on the transmitter, and maybe that burns a little more power,” he said. “Then what we do is look for more advanced silicon process nodes to make it smaller in order to get this extra performance, but not spend any more power while using similar wires or similar cables. On the other side, we add more processing. Maybe we won’t do a simple differential compare. Maybe we have to add some gain stage. And maybe we have to add some advanced equalization, like a feed-forward equalizer or a decision feedback equalizer, or some newer implemented tricks like maximum likelihood sequence detection. So you hold this physical channel, trying to make sure that people can do their one-to-five meter cabling — or in optical — the kilometer range. You’re still trying to hold that form factor. But if you’re trying to drive the data faster, doubling every generation, we have to put more compute and more smarts in to help recover that signal while going at these crazy speeds.”
Caches
A commonly cited bottleneck is off-chip memory, particularly in data-intensive applications such as AI training. The inability to scale SRAM quickly enough has forced chipmakers to rely on high-bandwidth memory — stacks of DRAM with more data lanes in the interconnects — for L3 cache. And while this is an improvement over other forms of DRAM, it’s still not as fast as SRAM, which in turn creates the so-called memory wall.
While not perfect, there are options to improve performance short of full 3D-ICs (which are just now starting to be designed). “This is typically mitigated using on-chip caches, which bring frequently used data into very fast on-chip memory that can be accessed between 10X and 100X faster than going to off-chip DRAM,” said Rick Bye, director of product management and marketing at Arteris. “Modern SoCs may have a hierarchical cache architecture, where small and very fast single cycle access caches are embedded in the CPU core designs, often with separate caches for program and data storage, L1 caches. There also may be a larger and slightly slower second level L2 cache that combines program and data, and in multi-core systems, perhaps a third level L3 cache serving several CPU cores. Beyond that, there may be last-level cache (LLC) or system-level cache (SLC) that is shared by the entire SoC, including CPU cores, GPUs, NPUs, display processors, and image processors connected to cameras.”
That works most of the time. But sometimes data that needs to be read is not available in an on-chip cache — called a cache ‘miss’ — which requires slow off-chip DRAM accesses. “Similarly, data writes, such as camera data, may fill up the caches more quickly than they can complete background writes of data out to slower off-chip DRAM,” Bye said. “This bottleneck can be mitigated by increasing the number of DRAM channels. For example, instead of having only one off-chip DRAM there can be four, which quadruples the DRAM bandwidth but not necessarily throughput. But this is only effective if the data can be evenly distributed across all DRAM channels with memory interleaving, which complicates system design. Standalone cache IP could be used for any cache level in the hierarchy, but especially the LLC/SLC. Or, a combined cache and network-on-chip (NoC) IP can manage the need for coherency of cached data shared by two or more processors to ensure that the processors don’t read or write ‘stale’ data to cache and subsequently off-chip data that has since been updated by other processors.”
Multi-die integration
The disaggregation of planar SoCs into chiplets, coupled with a huge increase in the amount of data that needs to be processed with AI, has dramatically increased the focus on data movement.
“You move from a paradigm where wires on chip are effectively free, and then you break it apart into multiple chips,” said Kevin Donnelly, vice president of strategic marketing at Eliyan. “Wires on chip are tiny, and wires that go across the package are necessarily bigger, but you can’t get as many. So they’re restricting the amount of data you can move between chiplets. If there are two chiplets with a limited number of wires, you’re trying to get as much bandwidth as possible between those two chiplets. If there is a standard UCIe or Bunch of Wires interface, where each wire is carrying data in one direction, you have a transmitter transmitting data to this chiplet, and a transmitter transmitting data back to this chiplet. That’s very common, and it gives you a certain amount of bandwidth per wire. But we’ve got a lot more data to deal with now.”
For chips used to train AI models, or those developed for high-performance computing, utilization of those wires is much higher than it was in the past. “This tremendous explosion of bandwidth needs to be communicated between chips, and your options are either to add more wires or to try to get more bandwidth per wire,” Donnelly said. “To get more bandwidth per wire, you have to think about the signal integrity of the connection and in all of the signaling, whether it’s SerDes or die-to-die connections. You need to look at the Nyquist rate, how fast you can run within a given interconnect, depending on the distance and the capacitance and the resistance of the interconnect. From that you can figure out how much bandwidth you can communicate over that medium.”
One solution is to transmit in both directions at the same time on each wire, which gives twice as much bandwidth on the same number of interconnects. “It’s like a two-lane highway on each road, as opposed to individual split roads,” Donnelly said. “Ultimately, in all PHYs, you’re going to put in parallel data, serialize it, and transmit it. Whether it’s uni-directional or, in our case, simultaneous and bi-directional, it looks identical to the user. It’s just more parallel wires for a given area than you would see before, so you get better bandwidth efficiency for your silicon area, but no other difference. As soon as you disaggregate chips and you have to connect them, you have to look at high-speed connections across them and their signal integrity, power integrity concerns, and it really is more analog than digital on those interconnects.”
Adding chiplets in a package helps overcome the limitations on processing clock speeds, but it does add other challenges. “When you move away from the monolithic die so that you can do chiplet-oriented design, you get the best silicon process for your SerDes, for your compute, and maybe for your memory,” said Alphawave’s Bermensolo. “You don’t have to use one process for everything. You can specialize and you can use chiplet-based integration to allow you to bring it together. But then you’re adding in some new interfaces. And when you do the die-to-die, you’re not doing the SerDes. Rather, you’ll do a standard like UCIe, which adds a little bit more complexity. So even while it solves some problems, it introduces others.”
Intertwined challenges
It’s hard enough to solve each of these issues individually, which was the classic divide-and-conquer approach used for SoC design. But with multi-die integration, issues need to be addressed concurrently and early in the design flow. That includes larger simulations, and more of them, in order to map the flow of data, as well as trading off some performance for flexibility to future-proof designs.
“We want to move toward heterogeneous integration, where you can buy chiplets and integrate them just like you are buying IPs,” said Vidya Neerkundar, Tessent product manager at Siemens EDA. “As an industry, we need to work together in terms of how we can accomplish that. In an IP world, you have an interface, and through that interface you can check if the IP is active or alive. After that, the IP owner gives you a pattern that you can run them. But for a chiplet, we don’t have that yet. The design kits are coming into play. We need to iron out what the minimum things are that we need.”
It’s one thing to establish a data path. It’s another to ensure it will work as expected. “You may be connecting by through-silicon vias, through interposer layers, EMIB, you name it,” Neerkundar said. “There are so many options for connecting to them, and you’re not using the same path that you use for testing it in the wafer. In the wafer, you were using sacrificial probe pads, but now you’re using through silicon vias. It’s a different path in which you’re going and accessing the chiplets. Now there is scan fabric, which is a highway that’s used to send test data between the chiplets and collect test data back from the chiplets. You can think of it as a bus, but you can do only very minimal things. You can scan out to a smaller widths, and you can use that to get access to the data from across the different chiplets. But industry-wide, we need a general solution.”
Throughput in a 2.5D or 3D-IC is extremely complex. In addition to the number of elements that need to be considered, it can vary by workload, and it can be affected by various types of noise and physical effects such as heat.
“When you assemble the stack, you need to make sure that you’re validating everything, that the signal is coming in, going out, and that you’re modeling it correctly,” Neerkundar said. “It’s a lot of pieces, especially when there might be 150 different vendor aspects that you need to get going in order to get the whole stack in play — from the fab through the TSV. The assembly could be done by somebody else. It’s the same for the software and the micro inspection. Maybe you’re buying the bumps from a different vendor, and material from another vendor. It could be a lot of things that are in play. I do believe agentic AI can at least check the connectivity is good, allowing you to do the next step.”
It’s not just confined to electrical signals, either. “In an electrical die connected over UCIe, it could as well be a photonic die, photonic integrated circuit,” said Synopsys’ Shukla. “That way, from this beachfront, you can go to very large distances. We needed 200 Gbps because we wanted to go to 1m or 2m, and that’s why the beachfront was higher. But now photonics provides a very beachfront-efficient manner of escape. In a UCIe beachfront, you can go longer distance using photonics, and that’s a new technology that will improve beachfront density and total bandwidth and throughput correspondingly.”
Additionally, when trying to maximize beachfront density, there are a number of silicon-related complexity issues that need to be accounted for. “For example, if you’re placing them so close to each other, the effect of crosstalk on multiple lanes needs to be modeled and analyzed,” Shukla said. “So if an architect is just focused on cramming in as many of these as possible, and the verification flow or the signoff flow doesn’t involve these things, it becomes challenging in due course, when the design goes for production.”
Dependencies and interactions all need to be considered up front. “It is pointless to have the latest, fastest CPU cores with large, fast caches if the interconnect does not have the throughput capacity to supply the data that the CPU needs,” Arteris’ Bye said. “Traditional crossbar interconnect architectures do not scale well as the number of CPUs and other IP increases, and cascading smaller crossbars quickly introduces bottlenecks. The solution is to use a packetized NoC that is provisioned with sufficient throughput to ensure that no IP is ever starved of the data it needs or gets stuck waiting to store data it has produced.”
Conclusion
Optimizing data movement has always been a challenge, but it’s becoming much more so.
“Whether it’s a smartphone being able to access something on the data center in real-time, or AI applications that are spread from one data center to another, the system is expanding and there are pieces that need to be improved, but it’s all being graded at a higher aggregate level,” said Alphawave’s Bermensolo. “How do we focus on bandwidth and reach power and latency to make that very big, macro, scale problem better? It’s no longer an individual experience. We’re now relying on groups that haven’t had to interact before. They need to talk, and they need to collaborate, because we no longer have a good feeling for what success is. When you have these big data center developers, they know their electricity bill. They know exactly what success is, and they can tell us what that translates to down to each one of these little SerDes interconnects, or each of these silicon chiplets. They know what they need to see because when they scale it up by a million or a billion, it becomes very clear to them what’s critical for the next generation of development.”
For everyone else, putting these pieces together is going to require a lot more work, more standards, and more interactions with groups that traditionally have been siloed from each other.
—Ed Sperling contributed to this report.
Related Reading
Future-proofing AI Models
The rate of change in AI algorithms complicates the decision-making process about what to put in software, and how flexible the hardware needs to be.
This is very true for AI model TRAINING but very much less relevant for AI model INFERENCING. Two very different computing environments required!
And, AI inferencing is now the focus — it was well said on LinkedIn recently:
“We’re not training models from scratch anymore. We’re not doing cutting-edge research. We’re wiring up pipelines, integrating hosted models, and deploying apps built on LangChain, LlamaIndex, or whatever framework is trending this week.
It’s smart work. It’s hard work. But it’s not building AI.
The real model-makers?
They’re at MOGA: Meta, OpenAI, Google, Anthropic.
Everyone else? Mostly model-takers. And that’s fine, let’s just be honest about it.”