The chip industry is exploring multiple avenues for simplifying multi-die integration, but difficulties remain for optimizing designs.
Managing chiplet resources is emerging as a significant and multi-faceted challenge as chiplets expand beyond the proprietary designs of large chipmakers and interact with other elements in a package or system.
Poor resource management in chiplets adds an entirely new dimension to the usual power, performance, and area tradeoffs. It can lead to performance bottlenecks, because as chiplets communicate across boundaries there inherently is more latency than within a single die. It also can drive up development costs, because each chiplet added to a system also adds complexity on multiple levels. And it can impact power consumption, which becomes more challenging to manage as the number of chiplets in a design increase and must continually communicate with each other.
The largest system and processor vendors have used this approach effectively, improving performance by adding more compute density and lowering costs by improving yield. But optimizing these systems using third-party chiplets is a much harder problem, and one that will require time to solve.
“Chiplets are brand new, and every company besides the big guys like NVIDIA is doing it for the first time,” noted John Lupienski, vice president product engineering at Blue Cheetah. “You could argue a lot of people are trying to catch up. One big mistake people are making right now is they’re not working from the concept of inside-out, so inside of their chiplet-out. A lot of people are getting really caught up on interoperability and being universal, and they’re not focusing on the power, performance, and bandwidth they need. I’ve seen companies where they’ve been focused on the I/O interconnect, basically spun their wheels for a year, and when they finally realized they needed to work on the system side, they had to restart everything from scratch. To do that correctly requires working from the system bus and NoC, so from inside that core out. What you’re doing is optimizing your NoC and system bus, as well as the protocol. It could be CHI or AXI, or whatever you’re using for that specific application and bandwidth, which can be configured in many different packet sizes. There are all kinds of different variables that you can customize for CHI and AXI, so there’s a lot of flexibility there. It’s no different in the chiplet space. But you need to tune that for the specific application, to the marketplace you’re targeting, and to hit the power, performance, and area. As soon as you nail that down, then you can work from the inside out.”
The system buses must be the same for all chiplets. “In a perfect world the I/O interconnects are universal and connect up perfectly, which one day we hope to be the case. But it is not the case today,” Lupienski noted.
This adds complexity and new challenges for the IP suppliers in the chiplet subsystem space that are constantly adapting to the new permutations of the customer base.
Target markets
There are three main markets for chiplets — captive, local ecosystem, and open marketplace.
“The captive market is just one vendor talking to itself,” said Manmeet Walia, executive director, product management at Synopsys. “At the other extreme is the open marketplace, which is the grandiose vision that many vendors believe will come with chiplets, and they’ll be connecting and talking to each other. In the middle is this local ecosystem, where a group of five to seven companies come together such that, ‘You work on this one, I’ll work on this one, and we’ll work on this one.’ They come up with very tight specs to interoperate with each other. Great examples within the local ecosystem are some of the automotive vendors in Japan and Europe. The other such local ecosystem that we have seen is with the RISC-V processors, where a bunch of RISC-V companies that want to compete with Arm came together.”
Today, roughly 95% to 99% of the chiplet market today is captive, which may be one vendor or multiple vendors designing to a spec. “What they demand is the best resources,” Walia said. “They care about key performance metrics (KPMs). These are typically the big market makers — the top four hyperscalers in the U.S., the top few hyperscalers in China, and so on. They want the best KPMs. They love UCIe, but they don’t want to use the term UCIe because they want to go above and beyond anything that’s out there in the standard. They want to excel over everyone else. That means they want the most optimized die-to-die, and they want the best metrics in terms of power and bandwidth of beachfront and latency and performance, and they want to go into advanced packages. For them, we have three classes of UCIe. We have the compliant version. Then we have the custom version, where we go above and beyond the spec. And because they do not care about the UCIe interoperability, a lot of them want lower drive strength or a different way of lowering latency, and so on. Hyperscalers want the best KPMs, the best in class. They want to be one generation ahead (OGA), so getting it first-time-right is absolutely important because the spec cycle is so short. If we do not get it first-time-right, then we lose the market window.”
Partitioning for chiplets
The industry is just beginning to evolve to the next stage of chiplets, from the captive to local ecosystem. At the same time, chiplet developers are looking for the best ways to architect their chiplets.
“We’re now at the stage of evolving to the local ecosystem, but it’s still limited,” said Simon Rance, general manager and business unit leader, process and data management at Keysight Technologies. “An example of this is the Cadence-Arm partnership. When you start bringing in more, then the complications become more challenging. The beauty of Cadence choosing to do this with Arm is that most IP that’s used in today’s SoCs and chiplets is Arm IP. There are common buses. The protocol is known, the handshaking is known. So is the timing — even test cases and verification. Tools are geared toward how to leverage those IPs and integrate it. Then you’re only having to deal with the other pieces of IP, whether that’s from Cadence or others, that you’re integrating it with, whether it’s homegrown or otherwise. When you get out to other IP, especially RISC-V IP, that could be a little bit more all over the place, which would be a really big challenge. I haven’t seen the right playbook for how to do this, how to manage the resources, and how to ensure the interoperability. At the moment, it’s more about what has known protocols, and the known timing and sharing and interoperability.”
One way to approach the complexity of chiplet design is through partitioning. “As a first step, it’s simplified,” Rance said. “This means partitioning the chiplet by technology — taking any parts that are analog, which can be at a higher process node. It’s a simple level of partitioning. Based off that, as you partition those pieces out, what are the ideal buses and protocols for those IPs to work together? While the analog and mixed signal are at higher process nodes, you have to deal with thermal issues and energy issues with radio frequency and electromagnetic type designs. At the lower technology nodes, like 4nm, those devices that now have CPUs, GPUs, AI accelerators, all of these things that are energy consuming. They all need to be close to each other. But when you have all of these things together, now you’ve got a big energy problem, because it is sucking the life out of that part of the chiplet. With SoCs, we just had processors and GPUs, and we needed some sort of cache coherency between those devices to ensure synchronization of video and audio. Bringing in AI/ML chips to go alongside these other big power horses adds a challenge we haven’t really dug into enough yet to know how to solve them. We can partition it, but then all the stuff that’s power hungry and on a low process node is still going to be there in one part of the chiplet.”
Others agree. “When we envision chiplets, it’s a means to build bigger systems without necessarily putting all the systems on the same silicon,” said Moshiko Emmer, distinguished engineer at Cadence. “That’s the mindset. If we had a reference of what we want to build on a single SoC, it will always be faster than putting it in separate chiplets in terms of the interaction and connection of the components within that chip. But it’s not going to be better from a cost perspective, from a time-to-market perspective, from a modularity perspective. And the way we envision chiplets is to build bigger systems without necessarily putting more on the same SoC. The way we want to approach this is by splitting it based on functionality.”
Consider what’s happening in automotive design. “ADAS is a good example of physical AI, which blends compute capabilities with camera, vision, and DSP processing AI, as well as all the interfaces that are required,” Emmer said. “Obviously, you need that for cars. You need that for robotics, for example, or drones, and aerospace and defense systems. Ideally, you want to have as many functions as possible on the same system, and if you think about a car today, and you look at the number of ECUs in the car, there are approximately 30 in a low-end car and more than 100 for a high-end car. There are a lot of costs around that, because it requires different platforms. In addition, the software is becoming very complicated to manage, and sometimes you want the ability for over-the-air software updates and patches to upgrade and verify the software, and all of this becomes very complicated. In thinking about chiplets, you have the ability to bridge more functionalities in a single system or single platform. I can say, ‘These are all the functionalities I need to have on an ADAS solution.’ I can group around specific functionality. For instance, I can group the CPU as a single chiplet, design it in a dedicated process technology, then connect it to the other chiplets. And this could serve as the heart of the system, and contain the system-level components, all the control from the system perspective — security, safety, power, clock, DFT, debug, you name it. The system-level connects or integrates with each of the other discrete chiplets that handle specific functionalities.”
The local ecosystem approach also opens the door for innovation from smaller IP providers in the chiplet space. Sue Hung Fung, principal product line manager for chiplets at Alphawave Semi explained that as an Arm Total Design (ATD) partner, her company is developing chiplets in partnership with Arm. “And with that kind of compliance with the compute chiplet, we have to be in alignment with the Arm Chiplet System Architecture (CSA). We need to be able to just build our compute chiplet, our I/O chiplet, hub chiplet around those definitions and specs in that document, as well as their Base System Architecture (BSA). So for any chiplets that are defined, if other developers want to create something, they can connect to our compute chiplet. If they make an I/O chiplet, we want to be able to fit all that together. It’s more of this open chiplet ecosystem. Arm is helping to define that, and we’re being compliant to that by building around it. Once we build something that’s compliant to the specs that Arm puts out, then other people who also build in compliance to that can connect to us. The AMBA CHI Chip-to-Chip (C2C) coherent fabric between the compute and accelerator is a good example, because if we’re building one piece, someone else is going to build the other piece and want to connect to us.”
Misconceptions abound
Still, there is much work to do to make this all work. “When it comes to the interoperable chiplet marketplace, everybody’s talking about it, but it’s going to be a lot more difficult to achieve than many people realize,” said Ashley Stevens, director of product management at Arteris. “For the short term, we believe that if you’re not trying to achieve interoperability, then there isn’t really any additional benefit to a worldwide standard because you’re only interoperating with things that you’re doing yourself or with your close partners. The idea of interoperability is very attractive to many, but it’s going to take some time. Currently, whenever people design chiplet systems, they’re verifying those together before they go to silicon to check if it’s going to work, whereas what people are talking about now with an open chiplet marketplace is designing a chiplet, or there’s someone else designing a chiplet completely independently, then putting them together and expecting them to work.”
Stevens believes this will work only if there is very good and agreed upon verification IP. “Rather than checking your work with different chiplets, you check that you work with this verification IP, and if the other person checks their work with that verification IP, hopefully the same one, then it ought to work. But at the moment, there is no industry-wide standard verification IP that I know of for chiplets that is going to solve that problem, except perhaps at a very low-level, interface-level kind of thing like UCIe. But that’s just a sort of point-to-point connection. It doesn’t give you the actual protocol to communicate that you need to be able to interoperate and for the chiplets to understand each other. I heard someone say recently on a webinar that as long as you’ve got UCIe you can switch in another chiplet. That’s absolutely not true. That’s the lowest level of standard. It allows you to transfer ones and zeros from A to B, but it doesn’t mean you can understand what’s at the other end. Something needs to understand it. So it’s much more complicated than that, and for the short term we believe that global standards aren’t that important. But they will be later on.”
Others agree. Kevin Donnelly, vice president of strategic marketing at Eliyan, said he knows of companies that purchased IP from other vendors (not Eliyan) and discovered it doesn’t work with anyone else’s IP. “They’ve already built the chip, and they’re disappointed,” he said. “They’re trying to figure out what to do in their next generation. PCI Express is a good example that people like to point to. They want interoperability at a physical layer, because there’s a whole bunch of software stacks that go on top of them that make it work. It’s the same with USB, HTML, or whatever physical layer chip-to-chip interconnection. The designer thinks they will just plug the chiplets in and it’s going to work. It doesn’t. Intel tried to tackle that for PCIe, but it’s just not a common application. The application people usually want to use AXI or CHI or some Arm-based processor, and they want to extend that to some chiplet. But none of that’s defined in the industry, so none of the chiplets that are built are going to be interoperable until there’s more software-level definition for how to make a chiplet interoperable. It will happen. We’ve been promoting the open chiplet marketplace. But it’s a common misconception that just choosing a particular PHY and nothing else on your chip is somehow going to make them talk to each other.”
So why so many misconceptions? Tracing back the evolution of chiplet infrastructure, it all started with a die-to-die interface that was driven mainly by the semiconductor manufacturers. “They first came up with a way to integrate things together within a package. ‘Hey, we have a cool interposer, or we have this silicon bridge, why don’t you try it out?’ That’s how the industry really took the chiplets,” said Pratyush Kamal, director of central engineering solutions at Siemens EDA. “Vertical houses that were dealing with the reticle size limitations, or that were dealing with a very high known-good-die cost, because they had too big of a die to yield properly in some of the advanced nodes, were pushed in this direction. And a lot of work has been done in the vertical design houses, as far as chiplets go. But now we start to see, in the last couple of years, companies starting to talk about participating in an open chiplet economy. They look at a future where we can have plug-and-play of chiplets from multiple vendors in a single package. Initially, it’s a whole new science, layer by layer, to peel the onion and get to the core of it when you get the ideal architecture. When you look at the UCIe 1.0 or BoW 1.0, there was a lot of emphasis on the physical layer itself initially, because that’s where the whole thing started — from the physical availability of a smaller bump pitch. It’s the same thing today. We are starting to learn in the industry that we will very soon have physical availability of a micron pitched via or micron pitched hybrid bond that we can use to do 3D stacking with chiplet. We are given these technology leapfrog moments by the foundries, and as a design house, we are just learning to gather ourselves and think about how to make the most use of it.”
Next steps
The chip industry is just at the start of that journey. When it was just looking at the die-to-die interface definitions, it was focused on data movement, not resource management. “A chip essentially has some functional cores, a CPU, a GPU, the memory controller that goes and talks to the SRAM,” Kamal explained. “Then there are external I/O peripherals to fetch the data from the external world and send data back to the external world. But there is something very critical here. How do you build a robust chassis? Each core has its own demand. The clock has dynamic voltage and frequency scaling. All of that is tied into the architecture today. The cores are designed to do one single task. Then it’s up to the chassis manager. Some companies call them chassis, some companies call them simple resource managers, but they encompass the whole plane of everything that is non-functional data, that is about managing the cores and their interactions. Broadly speaking, you have test, you have debug, you have clock, you have power decisions, you have security. All of this forms a chassis. And when we’ve defined these die-to-die interfaces until now, we focused on data movement. We were just blindly looking at that, but we were prudent enough to leave scope for it, and when we come around to defining these chassis or resource management schemas, we will have the die-to-die interface to at least support that.”
But what if there are 1,000 chiplets in a package? “If you’re aware of what the government in the U.S. is trying to do through its National Advanced Packaging Manufacturing Program (NAPMP) program, the challenge that they have thrown at the industry is to give them a fully automated, 1,000-chiplet package, design flow, analysis flow, simulation flow,” Kamal said. “When you think about a system of that scale, just a simple boot process could be very time consuming. There could be multiple thousands of dependencies there, so how do you define your root of trust, your chain of trust, because every chip will have different layers of authorization, and those have to be established and defined as part of the chiplet standards.”
Related Reading
What Exactly Are Chiplets And Heterogeneous Integration?
New technologies drive new terminology, but the early days for those new approaches can be very confusing.
Chiplets Add New Power Issues
Well-understood challenges become much more complicated when SoCs are disaggregated.
Chip Architectures Becoming Much More Complex With Chiplets
Options for how to build systems increase, but so do integration issues.
Leave a Reply