Designing Chips That Can Explain Themselves

On-die monitors, localized analytics, and lifecycle data are giving architects new ways to close the gap between design intent and silicon behavior.

popularity

Key Takeaways:

  • On-die telemetry gives architects a path to replace worst-case design margin with measured silicon behavior, improving PPA without compromising resilience.
  • As monitor density and control-loop speed increase, observability must be architected hierarchically across local hardware response, on-die processing, and fleet-level learning.
  • The real payoff is architectural: stronger model-to-silicon correlation, better post-silicon adaptation, and a tighter feedback loop between deployment data and next-generation design decisions.

Experts At The Table: Semiconductor Engineering sat down to discuss on-chip data analytics and resilience with Andy Nightingale, vice president of product management and marketing at Arteris; Nandan Nayampally, chief commercial officer at Baya Systems; Moshiko Emmer, distinguished engineer in the Silicon Solutions Group at Cadence; Pedro Merlo, manager of strategic planning, D2D and Edge Computing at Keysight EDA; Vikram Karvat, chief operating officer at Movellus; Lee Harrison, director, Tessent automotive IC solutions at Siemens EDA; Randy Fish, product management director at Synopsys; and Satish Radhakrishnan, head of GTM at Vinci. What follows are excerpts of that discussion. To view part one, click here.


Clockwise from top left: Arteris’ Nightingale; Baya Systems’ Nayampally; Cadence’s Emmer; Keysight EDA’s Merlo; Vinci’s Radhakrishnan; Synopsys’ Fish; Siemens EDA’s Harrison; Movellus’ Karvat.

SE: Today, when the data is gathered through on-die or in-system monitors, where does it go? How does it get analyzed and then optimized?

Karvat: We think about this in terms of data volume and latency. With the volume of data that sensors can generate (hundreds of GB/s) — forget about moving the data around on-die for processing — there needs to be near-sensor processing and filtering. The filtered (or reduced) data can then be moved more easily around the chip for additional analysis. Moving data off-chip poses a different set of issues, ranging from data ownership to security. The second issue is latency. For real-time or near real-time operations, the data must be analyzed on the die — and in some cases almost instantaneously — to take action. Droop mitigation, thermal events, and DFS operate on the scale of nanoseconds to a few microseconds. Data movement to an external analytics stack only makes sense if it’s for offline analysis, the data volume is relatively modest, and security and ownership issues have been addressed.

Nightingale: Once you have a level of visibility, you can start to act on it. From our experience, teams use it to tune the quality-of-service policies. They can rebalance workloads and refine scheduling decisions. It’s less about a single optimization and more about enabling informed tradeoffs between performance, power, and resource allocation techniques that have been well understood in some domains, and which are being applied more broadly. So QoS tuning, congestion control, latency bounding, feedback loops — all these things can now be applied at that system level.

Emmer: There are two main areas to consider here — optimization at the silicon validation stage, and optimization that can be done once the product is already in the customer’s hands. It’s common knowledge that designs always include guard bands, since not all chiplets, dies, or pieces of silicon are created equally. We must account for process variations in technology, which opens significant opportunities for further optimization. For example, when examining frequency targets and the new power-performance equations, there’s additional room for improvement — not just during silicon validation, but also in real-world field applications. When dealing with systems comprising multiple chiplets, if you extend these optimizations beyond silicon validation to ongoing management in the field, you have to carefully consider how the chiplets interact. The operating flexibility differs if a chiplet functions alone or alongside others, because resources like the power budget must be shared. This scenario presents both challenges and opportunities, as balancing these factors can lead to greater overall efficiency.

Nayampally: That’s a solid starting point for me. If I begin with the question, ‘where does it go,’ traditionally, data ends up in trace buffers and RAMs on counters and registers during the trace process. Performance tuning typically involves ongoing routines that manage this information, polling relevant counters. This approach is mostly batch-oriented. Certain events trigger counters, which is what Moshiko referred to. However, we’re seeing new possibilities for real-time capabilities. You can create software routines that overlay these processes and operate in real-time, rather than relying on passive mechanisms. As you bring more data off-chip into your own AI repository, you’re essentially enabling field models to improve as they learn. Performance isn’t just a one-dimensional matter. For example, when errors occur in the field, you must find ways to work around them. As specific models behave in certain ways, such as causing thermal effects, you can redirect operations to manage thermals actively rather than letting issues persist. Telemetry and control give you the ability to build intelligence atop existing systems for continuous improvement, which also benefits aging and other reliability features over time.

Radhakrishnan: I agree with that perspective. Essentially, any system involved must have an operating system or equivalent component that can receive commands and process data. If the incoming data can be immediately read and analyzed using appropriate software, it enables the system to make timely decisions without needing to store large amounts of information, which is also important for data privacy. The process should work like this — as new data comes in, the system evaluates it right away and decides whether to act or not. This approach resembles a digital twin. It operates in real-time, makes predictions, and only responds if there’s an issue. The core requirement is that whatever system handles, these inputs must operate extremely quickly. This ensures that potential problems, like thermal issues, reliability concerns, or abnormal current spikes, can be identified early, allowing for predictive interventions.

Merlo: If I approach this question from the testing angle, and solutions out there for cheap lead validation or wafer testing, most of the testing relies on external access. You test the interfaces of these components. There are specs and guard bands, so maybe you are designing some margin to cover potential issues in the field. And on-die monitoring will let you make those guard bands as thin as possible to still meet the requirements for the data transmission while being very power-efficient, or whatever your definition of efficiency is. It’s a huge step between testing these components at the boundary versus testing them from within. That’s one thing. The other is when we think about the test conditions, we always try to replicate real-world stresses. We do thermal chambers to emulate different workloads, but it’s never quite the same. With SLM (silicon lifecycle management) becoming more powerful, we get to a level of insight that was unthinkable before in real-world conditions. And this is what matters. And then, to Satish’s point, this is the data that is then fed back to the digital twin. So your digital twin becomes better, you become smarter, and it’s the flywheel effect that we’ve all heard about.

Harrison: The increase in the number of monitors is changing the way that the data is handled. What we’re seeing now is people using more and different types of monitors. They’re putting more monitors into the design, so the amount of data that we’re able to extract is becoming huge. There’s an increasing need to do more processing on the edge, on the die itself. And as we’ve heard, that’s one type of approach. It’s taking immediate effect, and it’s reacting to exactly what’s happening on that particular die. Then we look at that data, take the important pieces of that data, and that’s the bit that gets put up into the cloud. Then you’re doing more fleet analysis on the data. There are two distinct types of analysis you’re doing. One is keeping the die that you’re using running efficiently. The other is looking more at the long-term effects and reliability. We’re seeing more need for localized data processing, and there’s even talk about using AI at that level. I know everybody wants to use AI for everything, but because of the amount of processing that’s going on at the die level, there’s a good case to make for that, as well.

Fish: From the in-field perspective, some people might call them management control processors, or system control processors. They’re deeply embedded. They’re invisible to the end user. They’ve been there for years, and they’re becoming increasingly prevalent. You want to compute data, analyze it, and turn it into insights as soon as possible. You don’t want to drag data around. Some of it can be done on-chip. Some have extremely low latency requirements, so things like which detectors need to be analyzed immediately, and so that’s probably a hardware compute as opposed to a firmware. And then some happens on the application processor itself. There might be some power management there in some cases. But then, as you move back to manufacturing, we gather data during the manufacturing phase, and a lot of that analysis is done on the cloud across a large population for yield diagnostics or quality metrics. And then the other area where we analyze data is during design. It’s odd because SLM in some sense, when you talk about monitors, you are post-silicon, and that data is brought back into the design world for analysis for things like gap to target, or silicon to SPICE. Some people call it model hardware correlation. This is where your silicon doesn’t necessarily match what you designed to, and being able to close that gap and refine it is heavily dependent on latency requirements and overall volume.

Darbari: On-die monitors feed a layered pipeline: silicon to firmware/OS to fleet-level analytics. On the chip, PVT, margin, error, and protocol/traffic monitors generate counters, histograms, and event traces. Most of this is aggregated locally rather than streamed in raw form. System software exposes this via MMIO, management controllers, debug/trace ports, or system management buses, and logs it into lifecycle or observability backends. Analytics stacks mine that data across devices and workloads to identify hot spots, marginal operating conditions, and systematic inefficiencies. From there, you see two feedback loops — a fast, local loop where firmware adjusts DVFS, throttling, routing, and redundancy based on live telemetry, and a slower lifecycle loop, where fleet data drives new guard bands, firmware policies, and even micro-architectural changes in the next tapeout. I see great potential for formal methods here in the analysis phase. Right now, most of this is ‘metrics-driven’ rather than ‘spec-driven.’ What is really needed is telemetry that is explicitly tied back to requirements already known or refined through a training and inference loop from AI agents to generate formal properties and invariants, so the analytics are not just asking, ‘What is the temperature?’ They’re asking, ‘Are we drifting toward a provable violation of a safety, coherency, or data-integrity property?’

SE: Are there concerns about the added area cost for on-die visibility monitors and sensors as we continue increasing capabilities, monitors, and data processing?

Harrison: It really depends on who you talk to. There are people out there who are quite gung-ho about putting monitors everywhere, and it’s not an issue. But we’re still in a learning phase. What we’ll see over time is that once there’s more data out there, we will start to see a bit more refinement and a little bit more optimization in the way those monitors are placed, and the types of monitors that are used. Today, there are no hard and fast rules, as in, ‘This is the application. These are the monitors we should be using.’ It’s more a case of, ‘Let’s just put down as many monitors as we possibly can.’ But going forward, there’s going to be some consolidation optimization in the way we do things, which is going to be good for everybody in terms of results.

Fish: On the question of whether we get hit with area impact questions, I would suppose that all of us get asked something about that. But what’s more disruptive is timing closure, power, impact, all these other things. People want to put temperature sensors as close to the hot spots as possible sometimes, and those hot spots are very dense logic sometimes. So you don’t want to disrupt the timing closure or other closures needed there. There is a lot of analysis done during the design phase to enable the appropriate monitor placement and not disrupt the area, and the overall closure of the design.

Merlo: Randy, I have a question. I understand that many of these monitors can either remain ‘on’ continuously or be activated on demand. Which approach is more commonly used in practice?

Fish: For things like temperature sensors, people are going to have thermal sensors ‘on’ almost all the time during mission mode because they’re really worried about that. But for some things, you may have ring oscillators, or what we would call process detectors, that you selectively turn ‘on’ or ‘off.’ And you may have some you leave on all the time to study aging. You want it running all the time to compare it against the baseline. There are also ones that you just compare infrequently. And so it is a question. Do you need to power things on and off?

Darbari: Area is always something chip designers and architects are looking to optimize. But the industry has quietly moved from, ‘Are monitors worth the area?’ to ‘What is the minimum observability infrastructure we can get away with?’ Some monitors are now treated as non-negotiable infrastructure, on a par with PLLs and DFT. Without PVT, margin, and basic functional health monitors, you cannot sensibly manage variability, reliability, or safety in advanced nodes. The real concern is the uncontrolled growth of nice-to-have observability — extra sensors, per-link counters, trace buffers, and telemetry fabrics that bloat area, power, and routing. The answer is to make monitoring selective and model-driven rather than opportunistic. That is where formal helps. Proof coverage and cone-of-influence analysis can tell you which parts of the design matter most to your key properties, and therefore where you truly need observability. Instead of sprinkling monitors everywhere, you place them where they give you maximum leverage against mathematically defined correctness and safety goals.

Merlo: Going back to the question, it’s all about the ROI. If you’re going to pay a slight area or power penalty, is it worth it or not? Are you getting the right information from your system so you can optimize it? Or are you over-measuring? I agree with Lee that as these systems are deployed, there will be a tendency to overmeasure. But then, we expect to see some consolidation, or the industry learning what is worth monitoring more closely, more often, versus longer-term effects.

Nayampally: When you consider what we’re trying to address, basic digital monitors, performance monitors, counters, and similar devices are relatively inexpensive. The real cost lies with specific sensors that require more analog components and memory. This is particularly relevant for AI data center infrastructure, where reliability and lifespan are crucial factors driving demand for advanced solutions. In these cases, it’s important to maximize capability while keeping RAM costs as low as possible. Real-time computation for tasks, as previously discussed, becomes increasingly vital. If you examine chip development expenses, building a new chip can cost hundreds of millions of dollars. Therefore, a slight increase in the per-chip price isn’t as significant as the risk of upfront failure, which can be much more costly.

Karvat: Design teams don’t implement visibility for visibility’s sake. It’s usually to improve another vector, such as power, performance, or reliability. Our latest on-die voltage telemetry platform is a good example. It gives designers a real measurement of on-die voltage behavior, which tightens guard bands that would otherwise be set conservatively, recovering performance or power. There is a minor area impact to including the sensors, but in today’s AI accelerator race, power and performance can trump minor area increases.

Radhakrishnan: The main issue is determining how much space you’ll allocate for each component or IP. If you can accurately simulate everything using a digital twin — with all the details included — the value increases significantly. This approach benefits both sides. If you have complete data and visibility at every location, you get a comprehensive understanding of the space.



Leave a Reply


(Note: This name will be displayed publicly)