Partitioning In The Chiplet Era

Understanding how chiplets interact under different workloads is critical to ensuring signal integrity and optimal performance in heterogeneous designs.

popularity

The widespread adoption of chiplets in domain-specific applications is creating a partitioning challenge that is much more complex than anything chip design teams have dealt with in previous designs.

Nearly all the major systems companies, packaging houses, IDMs, and foundries have focused on chiplets as the best path forward to improve performance and reduce power. Signal paths can be shortened with proper floor-planning, and connections between chiplets and memories can be improved to reduce resistance and capacitance, which in turn can reduce the overall power. But so many new features are being added into devices — specialized accelerators and memories, CPUs, GPUs, DSPs, NPUs — that mapping out optimal data paths, load balancing options, and workarounds for aging signal paths is an enormously complicated task, and one that may change from one design to the next.

Underlying this shift is a continuing slowdown in scaling benefits. Shrinking features to add more compute density into a planar SoC is no longer cost-effective for many applications, and it limits the overall number of features.

“Physical limitations on large designs are hitting the reticle limit,” said Arif Khan, senior product marketing group director at Cadence. “There are technology scaling limitations and cost drivers that propel designs towards ensuring the right transistor is targeted for the right process node for each design, among other considerations.”

Assembling chiplets into some type of advanced package is another way to achieve these goals. While most of the chiplets being used today are developed in-house for internal use, that balance is starting to change. Third-party chiplets, particularly HBM and standardized functions such as I/O, are being developed by multiple vendors for different sockets. That both simplifies and complicates the partitioning challenge, because these new chiplets need to be characterized in the context of proprietary architectures.

“For vertically integrated designs that have been the norm so far, a single entity has control over the entire specification and optimizes the partition to suit their intended purposes,” Khan said. “Manufacturability and process optimization, maximized reuse to generate future SKUs, control over the micro-architecture, power delivery, communication between chiplets, data transfers — that information is needed for a given project, and forward-looking roadmaps drive decisions for partitioning.”

Chiplets are a fundamental change in how chips are designed, manufactured, tested, and assembled. “In working with chiplets, one of first things that we need to understand is how we partition the system,” said Letizia Giuliano, vice president of product marketing and management at Alphawave Semi. “We have this big AI chip. We used to design 300mm2 chips. Now we cannot fit everything in one chip, so we start partitioning. The first things that comes naturally to disaggregate, are I/Os and connectivity. Those types of building blocks don’t scale with process nodes, so it’s easier to keep those in older process nodes and put your compute power in advanced technology nodes. The first things we do with our customers is help them disaggregate the system correctly. We talk about I/O disaggregation, memory disaggregation, and then the compute. They can take advantage of the latest technology and the latest power and performance benefit using a leading-edge technology node.”

At the heart of any design involving chiplets are partitioning and prioritization. “This is because what is best implemented in 7nm, for example, an I/O, a SerDes, or PCIe PHY is really optimal technology node for that particular function,” Giuliano said. “This can also be referred to as domain-specific disaggregation. We have a PCI Express PHY that can perform optimally in a 5nm or 7nm technology node. Then we have extended our AI accelerator that can take the benefit of the 2nm or 3nm technology node.”

Partitioning determines how an application is mapped into the chip/chiplet architecture, where the processing occurs, whether that processing is asynchronous or synchronous, and how processing and data movement and storage are prioritized. “Do you put the processing on a central compute, or an AI accelerator, or a GPU, or a very dedicated accelerator? “Do you put it in off-chip or on-chip memory? What type of interconnect do you use? That’s all partitioning that an architect already needs to do for a system on chip,” said Tim Kogel, senior director, technical product management at Synopsys. “With chiplet design, designing a multi-die system means you have the additional dimension in your design space of which parts of the application do I group together on the same chiplet, or where do I need to map functions into different chiplets. This is driving the data flows, and the communication across the boundary of chiplets.”

These top-level decisions have significant implications for power consumption, performance, as well as another set of decisions regarding which technologies and what type of package should be used. “Should a simpler type of integration with the classic organic substrates be used, or more advanced silicon interposers, or silicon bridges? All those decisions are design points,” Kogel said. “As a designer, now you have all these additional aspects to consider when you are designing or putting together a multi-die system. So even with chiplets, partitioning is still the process of mapping an application into the resources that are supposed to execute that application. But it has become an even more multi-faceted problem, with many new additional dimensions in the design space.”


Fig. 1: The multi-die design convergence path. Source: Synopsys

Ashley Stevens, director of product management at Arteris, agreed, pointing to two different approaches to partitioning. “Do you have a full picture of everything, a top-down view, or do you take a bottom-up approach where you do something and then connect it to something else? The top-down approach is much simpler because you know what you’re going to talk to, and because you know how everything is partitioned within the system. For example, in the memory map, you know the memory map of the complete system, you know what’s there, versus if you have a system whereby you intend to connect to arbitrary chiplets, third party or otherwise, then it’s much more complicated for several reasons. One of them is verification, because when you have the top-down approach, you can verify the whole thing together. But if you take the bottom-up approach, if we don’t have the other part of the system, then you need very well defined interfaces, both in hardware and software.”

However, one of the issues with this today concerns maturity of those interfaces. “AMBA CHI Chip2Chip is being pushed heavily by Arm and many others in the industry, and it’s probably a good technical solution,” Stevens said. “But it’s very new. It was first published earlier this year, and it’s never been in silicon, so it’s not proven in silicon. Trying to build a system where you’re interfacing to something that’s never been proven in silicon is inherently difficult.”

Future-proofing
One of the reasons chiplets are so popular is the flexibility they offer. As long as sockets for chiplets are well defined, chiplets designed for those sockets should be able to plug into a design with different features. This has made the concept particularly attractive to automotive OEMs, which are looking at chiplets as a way of customizing different SKUs and updating different technologies as they evolve throughout a vehicle’s lifetime.

And in theory, that should simplify the partitioning, enabling what are essentially plug-and-play chiplets. But the reality is quite a bit more complicated. In addition to electrical and thermal characterization, chiplets, interconnects, substrates, and packages need to be characterized for mechanical effects, as well. Thinner substrates and new use cases can cause thermal gradients, which can accelerate aging of different chiplets under different workloads. So instead of simply adding a new chiplet, the entire system may need to be re-partitioned. Mechanical issues range from cracking, warping, and voids. And they can cause — or be caused by — variations in temperature, vibration, and signal interference.

David Fritz, vice president for Hybrid and Virtual Systems at Siemens Digital Industries Software, observed that for decades only a few companies controlled the processing within systems. But now, with the rise of HPC, IoT, and other segments, the OEMs have decided they need to have a lot more control over that decision making process to differentiate and meet their bottom line.

“The whole idea behind chiplets is the realization that if you had the right chiplets, and you got to choose which of those chiplets you were going to use at different process nodes, then you could put together a competitive, differentiated solution without having to hire thousands of engineers, paying fortunes, and dealing with re-spins and all those other aspects,” Fritz said. “It sounds like nirvana, and of course there are some technical issues behind that. Is it PCIe over UCIe? Is it CHI, chip-to-chip? What exactly are those things? Those are being sorted out right now in many research organizations.”

Fritz believes the crux of the issue is how chiplets should be partitioned, and says the semiconductor ecosystem is at risk right now of this embryonic approach being stifled because the wrong people are deciding the partitioning for the wrong reasons. “At a recent chiplet conference in Europe, I did a keynote in which I was talking about the why, the what, and the how. If you’re thinking about chiplets, I used to explain the why. Why is this important? Why is it worth doing? But what’s happening is those same companies that were making fortunes building complex SoCs are jumping straight to the how, and they’re forgetting that intermediate step of what exactly needs to be done, and therefore the connection back to the why is lost. The automotive Tier Ones are saying they love the chiplets idea, but they need chiplets that will do ‘this,’ and all they’re getting is chiplets that do ‘that.’ They don’t see that they will be any better off than they were before. And while it’s a different way to get there, and the value proposition will be different and all those sorts of things, the same companies are primarily influencing the limited set of choices. So it’s headed in the wrong direction.”

What gets partitioned
Certain functions are clear candidates for partitioning into separate chiplets, while other chiplets can be grouped together.

“When you’re going to move something off chip, what do you move? The pain points are going to be around the interconnect between those pieces that you move,” said Kevin Donnelly, vice president of strategic marketing at Eliyan. “You’ll hear often that for on-die connections, wires are pretty much free. You can add a lot of them and get huge amounts of bandwidth back and forth on a chip. However, as soon as you are connecting chiplets, you’re constrained in different ways. It depends on the technology you’re using and the package you’re using. But if you look at a lot of the partitioning decisions being made, there are things that are extremely tightly coupled and have to talk to each other all the time, so you’d rather keep them on the same chip. Then, the things that are more peripheral to the conversation, the historical peripherals of an SoC, you can move off chip.”

That includes such areas as I/O, as well as IP that talks to high speed SerDes or other connections outside the chip. “Memory is another one that got moved off because memory is often different process-wise,” Donnelly said. “It’s a different technology, and function-wise it is not as tightly coupled as an array of small GPUs or TPUs, or whatever the tiny cores are that are interconnected. Those are harder to break up. Big picture, the partitioning is looking at overall architectures and determining what we can move off based on the kinds of interconnect bandwidth we can get between chiplets versus on a monolithic die.”

The chiplets themselves must be designed with partitioning in mind, too. “Due to the large number of standard interfaces and how certain interfaces are better suited for certain applications, implementing the interface in a chiplet has many advantages,” said Ashraf Takla, founder and CEO of Mixel. “Interface functions typically don’t need to use the most advanced process technology, while processors do. In the most advanced technologies, the process does not necessarily provide thick-oxide transistors, and thus for drivers that need higher voltage than core voltage, the design becomes more complicated. Precipitating out the interface function into a chiplet using the most suitable process technology reduces the complexity of the design and provides added flexibility, but it also lowers the cost of the BOM.”

In a third-party chiplet ecosystem, which is the goal of the semiconductor ecosystem, it’s necessary to create a reusable system architecture definition that applies to designs that are more widely usable. This is what’s behind Arm’s initiative to define the chiplet system architecture in conjunction with industry partners, addressing the needs of designs in various segments.

“Both on-chip and chip-to-chip interconnects play a vital role,” said Cadence’s Khan. “Off-chip interfaces like UCIe connect to on-chip interconnects, such as the AMBA AXI, using the CHI chip-to-chip standard. Designers will need tools that help capture their design intent, allow them to model, simulate, and verify that chiplet partitions accomplish their functional and performance goals. And while vertically integrated designs can create bespoke partitions as needed, third-party chiplets are more likely to be built based on projected demand and for commonly used abstractions such as I/O chiplets and the like.”

Today, except for some standard memory components, there is no consistency in partitioning of chiplet-based systems. “Each system design makes bespoke choices about where to partition that system between chiplets,” observed Rob Dimond, system architect and fellow at Arm. “Sometimes the partitioning scheme optimizes for the specifics of a particular system (for example, only putting the logic on a leading-edge process chiplet that benefits from that silicon process). Some partitioning decisions are arbitrary, simply because there are no suitable standards to reference. A changing industry with new silicon form factors and larger units of IP requires new architecture specifications to ensure the potential benefits aren’t lost due to unnecessary and avoidable non-differentiating fragmentation.”

Following work with partners, Arm released the AMBA CHI C2C specification that defines how data between chiplets is packetized. This builds on existing AMBA specifications that have been used in billions of devices. In addition to this, as the industry looks to re-use portions of designs, Arm is working with a group of more than 40 partners to develop the Arm Chiplet System Architecture (CSA) to standardize system design choices for different chiplet types. The goal is to enable partners to decompose an Arm-based system across multiple chiplets, in the same way a monolithic chip is composed of IP blocks.

Prioritization
In many cases, optimal partitioning is done along the lines of hierarchy of the design.

“There are two reasons for this,” according to Mayank Bhatnagar, director of product marketing at Cadence. “First is process/cost optimization. Different sections need different metrics, such as speed, leakage, and device-matching profiles. These sections can then be implemented in process nodes that are optimum for them. For example, RF circuits may want to stay in older nodes to avoid redesign and save wafer costs, whereas CPUs move to newer and faster nodes. Second is scalability. If you split along design hierarchies, you can scale up the design much later than implementation, such as splitting GPU as a chiplet of its own, allowing the user to use as many of those as needed when implementing a full SoC.”

At the same time, depending on the application, the partitioning will be different. Arm’s Dimond points to two different scenarios. “In the first, peripherals get aggregated from across the motherboard as chiplets into a package,” he said. “The partitioning schemes look identical to the established board-level approaches, with a different physical layer to take advantage of lower power, latency, space, and cost. In the second, an SoC gets disaggregated into multiple chiplets. New standards are needed here because there are no established partitioning schemes, which is why we’re developing the Arm Chiplet System Architecture to enable consistency for the Arm ecosystem.

Chiplet signal integrity considerations
Chiplet design also introduces unique considerations that impact partitioning, such as signal integrity challenges.

” You might ask why we were not doing this a while ago,” said Chun-Ting “Tim” Wang Lee, signal integrity application scientist, high speed digital applications product manager at Keysight Technologies. “It’s because when you break things up, you increase the design complexity. The signal integrity between the die becomes the main issue, because signal integrity is mainly concerning the interconnect that’s connecting the transmit to the receive. In chiplets, because all the die are broken up, we have a lot of different die-to-die connections, which means the signal integrity becomes very important. Then, when you have different die, you have power that’s going to be on a different die. How are you going to distribute the power to all these other dies? That’s why power integrity becomes a problem. Then, once you have power, you have thermal, so it adds onto itself.”

There are three main challenges with signal integrity in the chiplet system, he said. “First is the reflections, i.e., impedance mismatch. You don’t want things bouncing back. Second is loss, because if you’re operating at higher frequencies, if you’re going through different stacks and layers, you don’t want the signal loss to be too high. Third is cross-talk. Since it’s going to be jam-packed in a very small space, there will be cross-talk between the wires. And, you don’t want to solve these issues when you are making it in production. You want to make sure to do pre-silicon validation, before you put into the production. You want to validate that what you are making is correct.”

Architectural partitioning
Knowing how to partition chiplets is very much application- and workload-dependent.

“The main thing that people need to realize if you’re using chiplets is for the wire density you can get across that interface, whether it’s actual electrical wires or equivalent bandwidth, you’re going to have to pay something to maintain that across the chiplet boundary,” said Elad Alon CEO and co-founder of Blue Cheetah. “It’s fundamental physics. This means within a given chiplet, you get the 40nm or 100nm or 120nm pitch density of wires that you can get across any 2D boundary. Once you’re in this 2.5D or even 3D integrated design, you’re now in the microns — tens of microns to maybe 100-plus microns. If you’re going to introduce this data interface, it behooves one to do it in a space architecturally, where whatever overhead you end up picking up — whether the overhead is higher power or higher area or higher latency, or some mix between all three — that it’s at an appropriate place in your architecture to pick up that overhead. You need to think about this from the outside-in, where the outside is the chiplet boundary and you’re deciding what that boundary is. That’s the baseline starting point.”

The chiplet architect also needs to consider that these things tend to be much more intimately linked with the details of the SoC itself.

“This isn’t like the PCIe world where we can say, ‘I’ve got a PCIe interface and it’s fully interoperable. Everything is there, and it works,'” Alon said. “There’s a lot of overhead that one pays to do that type of interface and that type of design, whereas what you’re typically doing in a chiplet is saying, ‘This would have been a single SoC if I could have built it that way, but for one reason or another that doesn’t make sense. That’s not optimal in one way or another.’ When you’re starting to break things up in that way, you’re starting from a point where your latency or other overhead was very low. You had many wires and very high bandwidth. The point there is that the more you understand about what you’re doing with that bus, what the traffic patterns are, what you can tolerate in terms of latency versus throughput and latency determinism, the more you can engineer the data interface to align as best as possible with that specific application. So you really need to think from outside-in about the overhead. This data interface will never be as good as what you did on the die. but the more you’re able to know and reflect what it is that you’re trying to do, the better you can make it so that it looks as transparent as possible.”

In fact, there are multiple partitionings that must happen, and they affect each other dramatically, Siemens’ Fritz said. “You could say this is an NP complete problem, and it is. If I change one thing here to optimize it, I’ve de-optimized something else. So that is, essentially, the hardware and the software, and the software stack all the way up and down the line. With the hardware, what do I need to accelerate? What can I buy off the shelf? There are too many decisions. The tooling must take the approach of iterative approximation. In other words, let’s make a best guess for both. Let’s measure it. Let’s see where it needs to improve, and we measure it again. What’s cool about that is it’s essentially the idea behind software-defined vehicles. They are touch points of iteration. You’re doing them both at once. You could do it at an SoC level, which is becoming a higher percentage of the total end product system, because it’s consolidating all these different ECUs and LRUs together into one big SoC. So that’s headed in the right direction.”

Conclusion
There has been much discussion, hope, and work in the direction of building a plug-and-play chiplet. It still has a long way to go, and partitioning is a key element on the technical challenge side of the equation.

“What we very much believe is going to happen, and is happening already, is a multi-vendor ecosystem, where appropriate groups of companies will get together around a specific end product line that they are all going after,” said Blue Cheetah’s Alon. “They will define a chiplet socket, or sets of chiplet sockets that will, in combination, be able to address these markets. If it’s multiple vendors, partitioning is done in a way that reflects the expertise, specialization, and business directions of the individual players. You must step into this world where you have these very concrete definitions of what each of the chiplets really are so you can then say, ‘Now that I’ve defined this thing, if somebody came in and met that exact same socket spec, fine. It’s plug-and-play.’ HBM is very much the model of this where it’s well defined for the thermal, mechanical, electrical, etc., limitations perspective. We believe that things will evolve this way on the logic side of the world, as well.”

At the end of day, partitioning must be part of the overall chiplet architecture because if it is an afterthought, it leads to poor results compared to scenarios where partitions are planned during the architecture definition (i.e., the multi-chiplet architecture), Cadence’s Mayank said. “There are many parameters to consider that can impact yield, PPA, and overall cost.”

Arm’s Dimond agreed. “Partitioning is the first architecture to define if you want to build an ecosystem around chiplets. You need a common language across the value chain to drive the necessary compromises. As a chiplet consumer, if I use a standard chiplet type, others will use it as well, and I don’t have to pay all the non-recurring engineering costs. As a supplier, I can aggregate multiple customers under a common type. To get to something like a marketplace, there is much more architecture needed — protocols, physical layer etc. Common partitioning is the first step to realize some benefit. For example, you don’t have to discard your entire system design every generation, and a proven partitioning scheme reduces design risk and has an IP story behind it.”



Leave a Reply


(Note: This name will be displayed publicly)