Observability Is Essential For Modern Silicon

What on-die visibility reveals, and why it’s especially important for AI, automotive, aerospace, and advanced packaging.

popularity

Experts At The Table: In-silicon observability — also known as on-die or on-chip visibility — is becoming increasingly important for managing the performance, reliability, and security of today’s high-performance systems. Semiconductor Engineering sat down to discuss this with Andy Nightingale, vice president of product management and marketing at Arteris; Nandan Nayampally, chief commercial officer at Baya Systems; Moshiko Emmer, distinguished engineer in the Silicon Solutions Group at Cadence; Pedro Merlo, manager of strategic planning, D2D and Edge Computing at Keysight EDA; Vikram Karvat, chief operating officer at Movellus; Lee Harrison, director, Tessent automotive IC solutions at Siemens EDA; Randy Fish, product management director at Synopsys; and Satish Radhakrishnan, head of GTM at Vinci. What follows are excerpts of that discussion.

Clockwise from top left: Arteris’ Nightingale; Baya Systems’ Nayampally; Cadence’s Emmer; Keysight EDA’s Merlo; Vinci’s Radhakrishnan; Synopsys’ Fish; Siemens EDA’s Harrison; Movellus’ Karvat.

SE: Why should on-die visibility be included in designs today? What are some recent, real-world examples?

Nightingale: Optimization is one of the key reasons. Once you have a level of visibility on the die, you can start to act on it.

Nayampally: Apart from optimization and efficiency gain, you have a regular need, as workloads constantly change, to be able to understand and adapt. Visibility is a strong part of it — not just the reliability debug story, but also from a security perspective with anomalous behavior. We look at it from those multiple angles.

Emmer: Nandan noted many aspects of on-die visibility. [With the growth of chiplets], this challenge becomes even bigger because you don’t necessarily have one fabric or one chain where you can see all internal states of the silicon. You must coordinate multiple silicon dies. Maybe not all of them are owned by you, and in a big ecosystem there are many controllability and visibility problems. In terms of real-life examples, there are two good examples I can think of. First, aerospace and defense, which are mostly focused on security parts in that sense, beyond optimization. Second, automotive, which obviously looks more toward the safety and reliability because a car must always operate correctly, no matter what.

Merlo: With on-die observability, everybody will agree that chiplet designs are extremely complex. It’s very challenging to qualify designs and then test during manufacturing. And the problems don’t end there. Once the silicon is deployed in the field, many things can go wrong. Continuous testing and monitoring of the silicon while it operates in real-world conditions is a must in these critical systems that we see are being built all around the world, and soon also out of the world in space.

Harrison: I’m quite heavily involved in the automotive space, and having on-die visibility is key to traceability within that whole automotive supply chain to avoid counterfeit products, repaired products, and non-genuine articles ending up in vehicles and making them unsafe. Traceability is essential across the automotive supply chain.

Fish: It’s all about the system. The classic examples we’re all citing now are these big training workloads, and so the visibility is across that whole space. What does a compute system look like? It’s a fabric that includes a lot. That scope is interesting. As far as when you talk about sensing or monitoring, there’s a real level of definition needed. Some of us believe that DFT or test is a form of monitoring. It’s very constrained in some ways, and deterministic and all, but the data is very interesting — not just during manufacturing, but during in-field use as well. As far as use cases, things like PVT sensors have been in place for many years, and virtually all finFET designs have something like it. Key applications are probably things like AVS or DVFS, and most companies are doing something there. The automotive example Lee mentioned mainly illustrates basic use cases — providing visibility into what’s happening inside the chip. What is the real mission profile? You know what you estimated early on, and you know that the workload is not different. So being able to see temperature or voltages or glitches across the life of that product — the real-world data, which isn’t even analyzing the data — just the ability to have that, to see it, and understand it has value.

Radhakrishnan: If you look at any current package or system, they’re called heterogeneous integrated systems, because they’re all 2.5D or 3D packages, chiplets, chip GPUs, HBM. Everything is coming from different companies, and each is designed by a single company on its own. But when they all come together, you have this cross-talk — multiple interactions between what’s happening in one chip and other chips. That’s another reason why you have designed one thing, but how it’s going to be used is going to be completely different. You want to be able to predict what’s happening in your chip or in the entire system when you’re looking from a data center or something. Ideally, you want to be in a predictive mode to see what’s going to happen here so you can make some decisions on it and not be in a reactive mode. And because not everything is in your control, you want to be able to find out what’s happening in the system, because you have designed it, and you want to be able to protect it, etc.

Karvat: Board and system-level visibility has an important role to play in the overall platform management stack. However, on-die visibility provides data with a high degree of spatial and temporal granularity, and in many cases, important ‘signals’ may simply get attenuated/aggregated out as you move from on-die to package to board levels, making these events ‘invisible.’ Real-world use cases include things like reactive droop mitigation, Vmin search, PDN optimization, and PDN-related debug.

SE: Are there different considerations for on-die visibility for chips versus chiplets versus the system?

Fish: For chiplet versus system, what you’re monitoring may be different for chiplets. Something we’re involved in is monitoring the interconnect. In things like UCIe, you’re not just testing or repairing, you’re also monitoring the degradation of the signal, or a partial view of the eye. That’s something you’re doing between dies. Then, you can identify trends or infer from that when a failure may happen and mitigate it somehow. The system level is interesting and having a coherent fabric that goes from chiplet to a heterogeneous chip to a system — and whether it’s Open Compute or somebody else driving that — that’s still an open issue for all of us. How do you really share data at the broad system level?

Nightingale: In terms of systems moving to multi-die, the challenge here is consistency. For maintaining visibility across these boundaries, approaches that align observability with the communication fabric tend to scale more naturally because they follow the same data pass regardless of whether the system is on a single die or multiple dies, which is really what enables reuse across different classes of systems.

Emmer: I will take a different aspect of that. One thing we must think of and take care of, which previously was done only after silicon, is the package side and the integration itself. Multi-physics adds a lot of new challenges, whether it’s how you craft dies side-by-side, or one on top of the other, in all the different types of integration. So even if the UCIe works perfectly, that integration level is very important for the reliability of the system. There are many new concerns that you need to be able to monitor and make decisions upon. It can be multi-physics related, thermal, and mechanical. It depends on the application, but there are many aspects to that. This is something you also have to think about ahead of time. If you take the system level, in a world where there will be multiple chiplet options and maybe multiple vendors providing such chiplets for others to integrate at the pre-design stage of the entire system, thinking of everything ahead may not necessarily exist, and you have to compromise with whatever you’re getting from a third-party chiplet. This is something that brings, again, new challenges of how to manage it, how to work with it, and so on. And the optimization point, obviously, can be different. It’s one thing to play with the timing margins that you have by tuning the frequency and the voltage on one die versus doing it in some orchestrated manner in multiple dies, when you have multiple dies in the system that share that same power budget.

Nayampally: You raised a good point here. Effectively, the EDA folks are also going into this multi-scale, multi-domain modeling to try and get a much bigger view of the system, which is necessary because we have moved from a monolithic chip to multiple chiplets, and potentially beyond in terms of capabilities. That makes the challenge that much more intense. Some have standards, especially when going across, some don’t. In reference to what Andy said, from Baya’s standpoint, the fabric is consistent, and the telemetry that you put in is consistent, which will help you understand performance optimization and traditional amounts of debug and trace that are seen on-die, as well as what happens across die. With the right monitoring setup, if there is an issue when you go across chiplets, is it an ESD or some other analogous behavior? The complexity is going up, and you need to have tools to address it.

Merlo: Circling back to what Randy said about chiplets and on-die monitoring time, I’m thinking about the tiniest parts of the system. Can these parts communicate with each other? Are they communicating well? You monitor different things, and you try to uncover different problems. When we talk about chiplets, it’s more about the communication between the parts that compose the chip. When we talk about the chip itself, it could be temperature. Now we’re looking at it maybe at a PCB level. Is the chip operating well, or as expected, as part of the system? And then, if you extend the visibility to the system itself, you now define the system. Is it a PCB? Is it the entire data center? Having this distributed insight across hundreds of thousands of racks unlocks a level of insight that was never available before. So depending on what problems and optimizations you’re going after, that will define where to focus.

Harrison: Taking a slightly different angle on this, focusing more on the identity, if we think about an automotive example, and a system being an automotive ECU, you may have a common ECU that goes into many different vehicles. But that ECU might have different silicon from different vendors. Then, within that silicon, you could find that the silicon has chiplets from different vendors. In terms of optimization and making sure the software you’re running on these ECUs is the most efficient for those particular ECUs, you can pretty much customize your software almost down to an individual ECU. Based on the fact that you’ve got the identity of all the individual SoCs in the ECU, and all the individual dies within those SoCs, you could have many different software versions that are optimized for each of those. Because this is a challenge for the automotive market, they tend to second-source a lot of these things, so tying up the right software version with the right hardware version is always a challenge. Identity and visibility are critical.



Leave a Reply


(Note: This name will be displayed publicly)