Disaggregration requires traffic cops and in-chip monitors to function as expected over time.
The shift from SoCs to multi-die assemblies requires more and smarter controllers to be distributed throughout a package in order to ensure optimal performance, signal integrity, and no downtime.
In planar SoCs, many of these kinds of functions are often managed by a single CPU or MCU. But as logic increasingly is decomposed into chiplets, connected to each other and memories by  TSVs, hybrid bonds, or standard copper wires, there are many more interactions, a greater potential for data path slowdowns caused by process variation or uneven aging, and a growing need to manage where processing happens due to different workloads, domains, and physical effects such as heat and noise.
The over-arching challenge is to allow sufficient performance gains and power savings to warrant the investment in these kinds of chips — which can top $100 million for highly customized AI chips in hyperscale data centers — while also building in enough commonality to be able to re-use various design and manufacturing processes, materials, and IP (which increasingly includes chiplets). That requires more management of more elements in a multi-chip device, and enough resilience to be able to re-route data traffic as needed with minimal disruption.
“This is all about the ability for the system to tune itself over time based on what’s happening,” said Mike Ellow, CEO of Siemens Digital Industries Software. “Otherwise, everything will grind to a halt. When we build this complex system of systems, everything comes together at some integration point and it works. You roll it out, and it starts performing in the field. Then you run your first over-the-air update of everything. Will it all come together? Now you’re relying on either virtual models or hardware-in-the-loop models, so you have the actual physical system that you’re now taking a look at to evolve your software stack, which is an integration point from many different suppliers. That has its own ecosystem associated with it.”
Keeping these devices operating within spec throughout their expected lifetimes adds a whole new level of complexity to chip design. “We’re going to see microcontrollers or small processing units built into almost every chip and almost every function,” said John Koeter, senior vice president and head of Synopsys‘ IP Group. “For example, all of our high-end physical interfaces, whether it’s 224 gigabit Ethernet or PCIe Gen 7 or DDR interfaces have microprocessors built into them to directly control and authenticate and provide firmware updates to handle different channels. What we’re doing at a micro level here in the IP world is going to play out across different chips throughout the entire database.”
One of the biggest challenges in these multi-die designs is thermal dissipation, due partly to the density of the transistors, partly to the higher utilization of compute elements than in the past, and partly to the resistance associated with pushing more electrons through wires.
“Heat management is big consideration,” said Ramin Farjadrad, CEO and co-founder of Eliyan. “For some solutions you build your logic in CMOS at 3nm or 2nm. Maybe you don’t care as much about heat, but you have HBM on one side, you have co-packaged optics on the other side, and you have to consider how heat impacts them. The industry is still trying to figure this out. You have to think of all the different components and make sure the same issues we had to follow and have in mind in building a system are now in a package. If you want to cool the whole system, it’s a lot of cooling. The question is how to design the cooling around that.”
Others agree. “Thermal issues, and more specifically heat density, stand out as the most critical concern, especially in 3D designs,” said Nir Sever, senior director of business development at proteanTecs. “One major concern often overlooked is testability, both pre- and post-assembly. Assembling an SiP, only to discover a fault in one of the dies, can be enormously costly as it risks scrapping the entire package.”
At least some of that scrap can be reduced by shifting testing and outlier detection left into wafer sort. In addition, chipmakers can mix-and-match chiplets, better pairing them by selecting dies that are best suited for co-packaging, Sever said.
Inside vs. outside
There is no standard for where to place this intelligence, but there are two generalized options for how to utilize it. In one scenario, monitors placed inside a package communicate external to a centralized dashboard where corrective, or in some cases preventive action can be taken. This appears to be heading to some type of virtual/digital twin model, which has been more a statement of direction than reality due to the complexity of these devices. Nevertheless, big EDA companies, equipment makers, and analytics companies all view this as a huge opportunity that goes well beyond just the chip.
“When you have enough things inside the package, we cannot test anymore using traditional methods,” said Letizia Giuliano, vice president of IP product marketing at Alphawave Semi. “For example, for our servers and data we have to monitor all this inside a device. You can detect everything happening within the die over its lifetime through just access to registers. So we have all these process, voltage and temperature sensors everywhere in the chip. It’s like a big network of monitors inside the chip that you can access through the register interface. A live monitor is very important, because all this server links, did-to-die links, need to be monitored throughout their lifetime. We have a live monitor while the chip is working.”
Others agree. “Diagnosis of problems is hard in monolithic die and even harder in chiplet-based designs. Access to the test and diagnosis information is very limited. ‘No Trouble Found’ is a common outcome of diagnosis or failure analysis when it’s not based on deep data measured when the SiP was still running in mission mode,” said proteanTecs’ Sever. “You need to have visibility inside of each chiplet that continuously monitors vital information while the chip is operating at test, or during mission-mode, processing that data inside the chip by hardware, storing key statistical data that can be extracted at any time for analysis, and issuing proactive alerts when an abnormal behavior is detected. Our applications are design-aware and leverage sophisticated algorithms, dashboards and guided analytics to make the root cause analysis more systematic and fast, resulting in much higher chance of accurate resolution, even in-context, not necessarily requiring the SiP to be sent back as an ‘RMA’.
A second approach is building enough logic into in-chip sensors to be able to take action themselves. The advantage here is automated real-time response, but it requires more area and power, and it behaves more like a black box, where it may be difficult to discern exactly what is going on inside a package.
“We’re building a framework, because while the function of the multi-die or the chiplet changes, what’s not changing in the whole infrastructure around it,” said Mick Posner, senior group product director at Cadence. “Everyone’s going to have some sort of interconnect, like UCIe, and everyone needs some level of security for either chiplet authentication or secure boot. Everyone needs management processes. So we’re developing a platform that allows us to very rapidly deploy a chiplet. If a customer comes to us wanting physical AI, something for the edge, or for automotive, we’ve already announced that we did our tape-out of a physical AI chiplet, which is back in-house. We’re planning to do an extension of that base chiplet. And a Neo AI chiplet, which can be configured to work alongside other CPU chiplets, are all based on a common framework. So each chiplet has its own management, security, and UCIe connectivity that we will deploy to shorten time-to-market. Standardization has been a bit all over the place. UCIe hasn’t solve the security challenges, and the interface doesn’t solve the problem.”
In some applications, some of this data can be collected through built-in self-test (BiST), which is a well-proven technology. The problem with using BiST in SoCs is that it consumes significant area, which is less of a problem in a package. But its use is limited in always-on applications, because it needs to take over circuitry, which may not be an option in AI data centers. Many of these complex multi-die devices are utilized more heavily than in the past, particularly if they are used for AI training and inferencing, so parts of the device would have to be disabled in order to test them.
In automotive and mil/aero systems, however, this is not a major constraint because testing can occur on shut-down and startup. Now, as automakers push into multi-die systems with chiplets, both to improve yield and shorten time to market with customized features, BiST looks increasingly attractive. Moreover, in safety-critical systems there is a higher level of redundancy required, which allows for testing even while other circuits are operational.
“People are trying to make sure there is more resilience in the way the logic designs are done,” said Michal Siwinski, chief marketing officer at Arteris. “The closest thing to making sure you actually build safe systems is having the right logic. So first of all, making sure the interconnects are pretty resilient is a step down from the old aerospace approach. That continues to evolve. What people are designing for data centers is actually a subset of that to make sure they have high reliability. So there’s reliability in the chip itself. And then, people build in redundancy, which adds more logic on the chip, and potentially on multiple chiplets. That means more chiplets, because that’s really how people are going to be solving this.”
That redundancy still needs a switch, though, to redirect data traffic when there is a slowdown in signals, and that requires additional monitoring.
Conclusion
Multi-die assemblies are much more complex than fitting everything into a planar SoC, but there are significant benefits to breaking apart single dies into chiplets. Advanced packages can be designed to accommodate more logic and memory, achieving potentially orders of magnitude better performance with less power. But different workloads for different domains also requires much more extensive real-time monitoring of how these complex devices are behaving to ensure performance is continuous and optimized, and that these devices continue to function as expected over much longer lifetimes.
Leave a Reply