Observability Is A Missing Layer In AI-Era Chiplet Design

In next-generation silicon, AI can interpret system behavior at scale, but only if observability is designed into the fabric as a first-class architectural capability.

popularity

Key Takeaways:

  • In chiplet-based architectures, observability must be designed as a fabric-aligned, cross-die telemetry plane so architects can correlate traffic, latency, congestion, and fault behavior across package boundaries without losing system context.
  • AI can extract value from high-volume silicon telemetry only when the architecture provides consistent instrumentation, near-sensor reduction, programmable collection, and software-accessible data models that scale from on-die monitors to fleet-level analytics.
  • A viable multi-vendor chiplet ecosystem will require standardized, secure telemetry schemas and access frameworks that let integrators localize faults across die, package, and interconnect domains while protecting sensitive operational data.

Experts At The Table: In-silicon observability—also known as on-die or on-chip visibility—is becoming increasingly important for managing the performance, reliability, and security of today’s high-performance systems. Semiconductor Engineering sat down to discuss on-chip data analytics and resilience with Andy Nightingale, vice president of product management and marketing at Arteris; Ashish Darbari, CEO at Axiomise; Nandan Nayampally, chief commercial officer at Baya Systems; Moshiko Emmer, distinguished engineer in the Silicon Solutions Group at Cadence; Pedro Merlo, manager of strategic planning, D2D and Edge Computing at Keysight EDA; Vikram Karvat, chief operating officer at Movellus; Lee Harrison, director, Tessent automotive IC solutions at Siemens EDA; Randy Fish, product management director at Synopsys; and Satish Radhakrishnan, head of GTM at Vinci. What follows are excerpts of that discussion. To view part one, click here. Part two is here.


Top Row, L-R: Arteris’ Nightingale; Axiomise’s Darbari; Baya Systems’ Nayampally; Cadence’s Emmer; Keysight EDA’s Merlo.
Bottom Row, L-R: Movellus’ Karvat; Siemens EDA’s Harrison; Synopsys’ Fish; Vinci’s Radhakrishnan.

SE: How is artificial intelligence currently being implemented in the processes of collection, analysis, or action? Also, are these functions presently performed by AI, or are other types of software being utilized for these purposes? If AI integration is not yet available, what is the anticipated timeline for its implementation?

Radhakrishnan: AI is increasingly being used in chip design, making both the creation and verification processes faster. It’s applied not only in design verification and rule checking, but also to accelerate the overall chip development cycle. Another area where AI is making a difference is simulation. Think of robots or digital twins, which require speedy and accurate systems to predict outcomes effectively. In the long term, collecting and analyzing data helps move from reactive to predictive approaches. By continually training models, systems can learn and adapt to subtle differences, such as data degradation, allowing agentic AI to act proactively before issues arise. Ultimately, AI will impact all stages — design, validation, and ongoing monitoring.

Darbari: Collection remains largely conventional, and perhaps for a good reason. We need determinism in the capture phase. Deterministic hardware logic and firmware collect defined signals and logs. Nobody wants a ‘smart’ monitor that sometimes decides not to capture a critical error. I believe that AI has a great impact on the analysis phase. For example, we could use it for anomaly detection on high-dimensional telemetry, clustering of field incidents, predicting problem devices or lots from subtle patterns in test and early-life data, and recommending operating points such as per-die voltage optimization on the tester. The action side is still mostly rule-based at the moment. That pattern is likely to persist and increase in usage. In the near term, AI remains a powerful analyst and advisor, particularly in cloud and data center environments. Over the next few years, there will be more AI-assisted controllers, which are likely to be bounded by clear safety envelopes and overseen by traditional control logic. Formal methods would play an interesting role. Formal properties can capture the invariants that AI-derived policies must respect, and these could be validated and verified in a formal tool to provide a way to verify that no data-driven controller can take the system into provably unsafe states.

Nightingale: I agree. We’re seeing agentic AI integrated into the front-end design flow, and AI is increasingly used to manage observability data at scale. The main challenge now isn’t collecting data, but interpreting it. Modern systems produce vast telemetry from numerous probes that can’t be analyzed manually or with static rules. Machine learning helps identify patterns such as anomalies, performance issues, and bottlenecks, often moving toward optimization. Models recommend or even adjust settings, scheduling, and resources. Essentially, AI is augmenting engineering judgment, enabling teams to advance from reactive debugging toward predictive analysis and closed-loop optimization. It’s an exciting field, and I’m glad it’s being discussed.

Harrison: I’ve recently seen an example where, because you have all those different monitor variants, if you build an AI model around that in terms of reliability, it’s possible to start to predict in the context of a data center almost down to the day when the silicon is going to fail. That’s an effect of having all these different monitors, all different types of data. If you were to try to analyze that yourself, it’s just too much to do, so AI starts to look for trends and comparisons. It can start to get down to that level, which is what all the data center companies are looking for. They want to extend the useful life of data center hardware as much as possible.

Fish: The LLMs and AI applied across EDA, as well as post-silicon for production analytics or yield analysis, is absolutely full-on happening. Another interesting area is the small language models, or TinyML, that are deeply embedded in chips. People have used perceptrons or neural net approaches for branch prediction and things like that over the years, and I’m assuming some of the people in the virtual room here may use them for some of their own needs. They are analyzing data on chip in a tiny footprint, whether it’s a hard implementation or whether it’s soft. It’s interesting, and there’s not a whole lot of content out there to consume in the literature.

Nayampally: In traditional design feedback, a lot is going on already. There was a discussion about fleet and cloud, which is the more standard model, where there’s a lot of data that has been scraped and then worked on. What is interesting is the platform. A lot of the production AI that works today on different questions of, ‘Do I understand these performance capabilities? Do I understand this? How do we manage it?’ That’s where there’s a lot more innovation in understanding anomalies. Is it is a standards issue, an aging issue, a security issue, or a chiplet issue? Then, of course, at a nanosecond level, where you’re qualifying the workloads to understand the QoS and usability from those stages, everyone is amping up at this point.

Merlo: In the future, if not already, AI is going to orchestrate and manage all these systems. What is this level of insight providing? Think about a swarm of agents that manage data centers across the entire world. It’s a level of control and decision power that is unthinkable. We still don’t know how much these will accomplish. Getting insight from within the silicon and controlling the power grid, or other alternative energy sources, then optimizing the system, is amazing. How it couples into AI is how agentic orchestration and agentic management of this infrastructure is the only way going forward, because the data volume is just going to be massive. There’s not going to be any human behind these logs, this telemetry, to make decisions because the data volume is enormous, and decisions will be made in real-time.

Nightingale: We’re going to be seeing more self-managing chips that prioritize CPU versus accelerator traffic, dynamically detecting hot spots in the network-on-chip, and redistributing the load. To Nandan’s point, if there’s an error in the silicon somewhere, rerouting traffic around that dark silicon, ensuring real-time calls meet the deadlines that they need to meet, and then software adjusting bandwidth allocation based on this observed contention, is all great stuff that we can use agentic AI to manage. ‘Manage this for me. Here’s a skill profile. Go and look at the design and make sure it performs in that way and adapt if it doesn’t.’ There’s very interesting technology coming up.

Merlo: If you throw robotics into the mix, and you’re predicting that the chip is going to go out of life in about a month, you can make sure the robot is there with a backup to swap it out as soon as possible. This is going to happen.

Harrison: This is a silicon lifecycle management instrument, where you have this neat concept of essentially self-healing silicon. If you’re able to do core harvesting and swap out when they fail, then you’ve got this nice terminology, which is, ‘I’ve got a self-healing piece of silicon.’ That’s neat. I could manage all that functionality.

SE: How do we ensure that observability scales with system complexity, especially in multi-die or chiplet architectures? Is the answer AI?

Radhakrishnan: You will say that when you need a system that can handle that — a system that is pre-trained so it can run a task like, ‘Take the entire data and simulate it extremely fast.’ So, yes, you need that system to be able to handle the full complex design. The speed is important. Accuracy is important. So is the size of the system, because it must not be compartmentalized. It can’t do one system, etc., because all the crosstalk and interactions are happening. You need to be able to take everything as a whole. And to the point, yes, you have to make a decision right away and then leave the data, because you can’t keep storing the data. That’s why it has to be something that runs extremely fast, quickly, and accurately. And I’ll throw the S word out there: standards.

Fish: Standards have their advantages and disadvantages. Test has done a pretty good job over the years at standardizing across equipment, as well as DFT tools and things. But the monitor data and sensor data are all over the map. The hyperscalers want to see at least the telemetry standardized. Everything wants to go up to Redfish, but what about the actual formats of how that data is shared? Is there an opportunity to standardize on some things there? That’s a reasonable question, particularly at scale, as you start having multiple vendors using multiple IP solutions in a large system, and they’re assembled by an end user.

Merlo: Telemetry is going to become a first-class citizen in all these complex systems, just like in the past. Obviously, the CPU, the GPU, and the silicon are the focus in the data center. Now look at, for example, cooling, and how important it is in today’s data centers. If you think about systems that are going to space, you don’t hear about complaints about computing. You hear complaints about how to dissipate heat in space. Just like cooling is another first-class citizen, especially in AI data center designs, that pretty much shapes data centers. Telemetry is going to be as important, if not even more important, just because of the sheer size of these data center deployments. It’s so immense that it’s going to be a must. It’s going to be high up in the priority list for all hyperscalers, without question.

Nightingale: If we’re talking about scaling observability, as systems move into multi-die and chiplet-based design, the key requirement is still having a coherent view of a system’s behavior across die boundaries. The approaches that scale best are those that align observability with the communication fabric itself, so you’re effectively following the same paths the data takes, regardless of how many dies are involved. Standards are important there, of course, and that gives you a unified and reusable model of visibility, whether it’s a single SoC or a multi-die system. Once you have large-scale, system-wide telemetry, AI helps interpret the data, detects patterns, and guides optimizations. But it doesn’t solve the fundamental problem of how to instrument and collect the data in a scalable way. AI helps you understand the data, but scaling observability is really an architectural concern. If you don’t have consistent, fabric-aligned visibility across these die, AI doesn’t have much to work with in that context.

Nayampally: What Andy said is very relevant. It’s architectural. And how you create the kinds of things you measure must also be smart. How it is programmably upgradable and scalable becomes part of the architectural design for it. And how you make it accessible, real-time, or through other means, becomes the other piece that we’re looking at very carefully. Obviously, for anything that you need to solve in real time, what is the computation? Is there a next layer of simplification, rather than a lot of data generation? And if there is data generation, then there is compression. We talked about standards — but architecturally, from being the core of the transport and the monitoring across each die, plus across dies — that’s what we’ve been focusing on, and how to make it easily software-accessible or AI-usable, so that it’s more real-time.

Karvat: At scale, it’s the data movement that kills you. As we scale from chips to chiplets to systems, the amount of data generated will grow exponentially. Design teams will need to process, filter, and reduce the raw data near the sensor to minimize the amount of data that moves to an on-die service processor and eventually off-die to the system stack or cloud. The ability to process data at or near the sensor ensures scalability.

Darbari: Scaling observability here is fundamentally an architecture and standards problem. You need a telemetry fabric that spans dies, a common schema for lifecycle data, and well-defined observability contracts for each chiplet so integrators get a coherent picture. AI helps once that foundation exists. It can correlate events across dies and highlight emergent patterns that humans would miss. But AI is not a substitute for structure. What closes the loop is combining that structured telemetry with formal, system-level properties. You formally specify cross-die guarantees — for example, around coherency, ordering, safety, and liveness, then design just enough cross-chip observability to monitor those guarantees in the field. AI can tell you, ‘This pattern looks suspicious.’ But only formal can tell you whether that pattern constitutes, or could lead to, a real property violation.

SE: Do these observability mechanisms have an impact on system performance or behavior? And if they do, how do we avoid that?

Nightingale: The impact is very low on our types of design. Keeping the observation path independent, using filtering to limit what’s captured, and avoiding intrusive modes unless you absolutely need to, ensures that observability reflects system behavior rather than implementing it. Basically, we have a sub-prime network that doesn’t perturb the data that you’re looking at. It’s the Heisenberg Uncertainty theorem, and we’ve got a different network to spy on or observe what’s going on to mitigate that.

Harrison: We’re doing something similar for our monitors to be completely unintrusive. They’re on a completely separate infrastructure, so they’re effectively passive. Someone brought up the point earlier that the only thing that could be a consideration is the impact on timing closure, but from a purely functional perspective, there’s no impact. There are no cycles that are being stolen from the functional operation. Having that completely independent infrastructure is key.

Fish: There is a broad spectrum of what people are doing in this area. They go from being very trivial, single-temperature sensors, to literally over 100 sense points or thousands of slack monitors, and all on a chip. A lot of that has been over the years. DFT used to be criticized for its area impact all the time, and now it’s looked at as part of design. You need to do it. So the breadth of the monitor infrastructure is looked at, there is a hierarchy, and it’s much more sophisticated than single points. But we’re past that. It’s a metric that is studied, but it doesn’t determine whether or not they will implement monitors.

Darbari: Observability can be intrusive if you are careless. Extra logic on critical paths, high-rate trace streams, and aggressive polling can all perturb the system you are trying to measure. In energy-constrained designs, always-on monitors and telemetry fabrics also consume real power and can affect thermals. Best practice is to tier the mechanisms. You have a low-overhead, always-on layer — health monitors and aggregated counters that are designed into the power and timing budget, a heavyweight debug layer, rich traces, intrusive test modes used sparingly in bring-up or targeted field diagnostics. Sampling, triggers, and filtering ensure that detailed traces only fire when you actually need them, and telemetry traffic is either out-of-band or lower priority than latency-critical flows. Formal can be applied directly here. You can prove that adding the observability logic preserves key properties, including deadlock-freedom and safety, and you can reason about ‘observability coverage.” How much of the logic relevant to your properties is actually visible through your monitors? That gives you a quantitative way to trade off performance overhead versus diagnosability, instead of relying on gut feel.

Merlo: If you think about the cost associated with operating a data center, especially for the hyperscalers, it’s hard to say if the net result is that you save power by using this advanced monitoring, because you can make really smart decisions and optimize the data center. But let’s say you focus exclusively on monitoring the aging of components. Bringing a training cluster down because one node is failing can cost hyperscalers thousands of dollars. So if paying a bit more for better telemetry helps avoid those catastrophic failures, most operators will consider it worthwhile. No one wants an entire cluster to stop because one die can no longer communicate with another.

SE: What would you like observability mechanisms to do that they don’t do today?

Darbari: Three things stand out. First, property-aware telemetry. Most monitors today expose low-level data — temperatures, voltages, counts, and generic error codes. A next step is monitors that are explicitly derived from formal properties. ‘This counter tracks near-misses for a forward-progress property,’ or ‘This flag indicates that the prerequisites for a data-integrity guarantee no longer hold.’ That makes field observability far more meaningful and directly actionable. Second, a bidirectional link between silicon and formal models. Telemetry today is rarely fed back into formal in a systematic way. In an ideal world, real traffic patterns, corner-case sequences, and observed fault modes would parameterize the formal environment and coverage goals. Conversely, formal tools would show you which critical behaviors are invisible to your current monitor set, guiding the next generation of instrumentation. That is a real opportunity for tighter integration between verification and silicon lifecycle management. Third, better support for silent data corruption and cross-layer causality. Silent data corruption is the problem nobody can afford to ignore anymore. Observability should be designed from the start to catch the early symptoms of possible SDC, using checkers and monitors directly instantiated from formal specifications, and to provide the cross-layer checks needed to reconstruct causes across IP blocks and dies. That is where formal and observability together can move the discussion beyond, ‘We saw an error,’ to ‘We know exactly which guarantee failed, why, and how to fix it.’

Karvat: For us, it’s not about additional capabilities. We want to see the industry move toward open standards. No single vendor can generate and process all the data needed for the chips of tomorrow. An open framework will: a) result in an improved customer experience by using best-of-breed components; b) protect sensitive telemetry using a structured security schema; and c) ensure interoperability across management stacks.

Merlo: If this were standardized, there would be much faster progress, much better solutions, and much more scalable solutions out there.

Harrison: I agree. Quite often, when we talk to customers, we talk about silicon lifecycle management, but this is a broad topic, and sometimes other people’s definition of silicon lifecycle management could be completely different from your own. Standardizing on exactly what this all means, and a common direction — especially with the hyperscalers, because they’re really pushing for some common strategy on this — is a direction that should be driven a bit harder.

Radhakrishnan: TSMC and other foundries are developing 3D chip solutions that require secure communication between system components from different companies. To enable decision-making by agents or AI, these systems need encrypted mechanisms at the component and system level, supporting standards for collective operation. Since issues like cracks or hotspots can impact multiple chips through crosstalk, it’s essential to establish methods for foundries and system components to communicate and resolve problems efficiently.

Fish: I’m a little afraid of pointing the finger at myself, because if I say we need this, it means we don’t have it. But for the security or privacy of data at the monitor, at first blush you’d think, who cares much about temperature data or voltage and stuff? Or transactional data. For some of you who have monitors, they’re hypersensitive about that data. In automotive and in the data center, everybody’s sensitive about this data. How and where you protect it consistently across all of us is something that really has to happen.

Nayampally: I’m going to go off a slightly non-fabric end. Again, Ann extended the definition of on-die visibility to in-package visibility, especially in the chiplet world. Things can look electrically healthy. They can look fine in isolation. But they break somewhere in between. Is it the bump? Is it the substrate? Those things are still unclear. And in fact, that’s where the whole multi-scale, multi-domain modeling is trying to go, but that’s not translating yet into what we can do in real time in the physical world. Maybe the analysis there helps you start triangulating how to get to that. But if we do that, we truly unlock the chiplet market, because you’re not pointing fingers or not taking indemnity for things that may or may not be your fault. You can isolate those things earlier. That’s probably the one thing that’s preventing a true merchant chiplet marketplace, apart from the usual liability of who puts it together.

Nightingale: Today, a lot of observability tells you what happened, but not always why, and this is always a question that our customers come back to us with. So there’s an opportunity here for us to correlate traffic behavior with workload, with software state, on system configuration, be it on-die or off-die, so the data becomes more meaningful and easier to act upon. That’s something our customers are asking us for today. ‘We can see what happened, but we’re not that much closer to working out how to fix it.’ Adding extra context is a good focus, as well as everything else we talked about.



Leave a Reply


(Note: This name will be displayed publicly)