AI agents can be used to identify potential issues during operation and react before it’s too late.
Key Takeaways:
Chipmakers are starting to use AI to manage data that is mined from different “dashboards,” many of which are already embedded in chips and systems and used to monitor everything from thermal gradients to voltage droop.
These dashboards are typically controlled by some type of processor, such as a CPU or MCU, and in most cases they are invisible to the user. But they are critical for tracking any changes in low-level data generated by different blocks, sensors, and I/Os, triggering alerts where needed, and making automatic adjustments as needed, sometimes in fractions of a second. If a processor core is running too hot, for example, data can be shifted to another processing element to balance the load and reduce the heat. Or if one data lane to an HBM stack is blocked or running too slowly due to electromigration, signals can be rerouted through another lane.
Historically, each of these functions was managed separately, remaining isolated because the data they collect is often incompatible. But with AI, different data types can be combined to find potential problems anywhere inside a device, allowing these systems to dig deeper into why heat is spiking in a particular area or why performance has fallen off in a server rack. Add in AI agents, and all of this can be done autonomously.
The resulting response time for preventing problems caused by heat or power, two of the biggest troublemakers in leading-edge designs, can be significant. “The fundamental problem in the power world is visibility. It has to be fast and granular enough to provide visibility into the entire power network,” said Mo Faisal, CEO of Movellus. “Once you know what’s going on, then you can analyze it and decide what to do on the backside. It doesn’t matter how you get the power onto the chip.”
The key here is time to identifying the root cause of problems during operation, such as thermal spikes or a slowdown in performance. “Thermal gradients is one area, IR drop is another, and L(di/dt) events are another,” Faisal said. “These are known as droop events, and they are a really big deal. L(di/dt) will set your margin, your Vmin, and you want visibility like workload-aware, workload-dependent visibility. General visibility doesn’t help. It may tell you the worst case happened, but you want to know exactly when it happened. What else was going on in the system when that happened? Then you can take action and start optimizing the workloads, whether that is by controlling the clock or the voltage, or by controlling the data coming in by slowing down the instructions per second. There are other knobs that come later, but first you need to know what’s going on.”
AI has simplified all of this considerably. “People have wanted to do this for a long time,” said William Wang, CEO of ChipAgents. “EDA vendors would go to the customer and actually write software and build a dashboard. For example, you could build a dashboard just for the fab to connect data from all of the manufacturing machines and test equipment. But this didn’t quite work out for SLM (silicon lifecycle management) because it’s very fragile. If you change the process, then suddenly the dashboard isn’t completely working. The revenue from this wasn’t great. It was very manual, time-consuming, and it didn’t generalize.”
AI agents fundamentally change this approach by raising the abstraction level to make sense of the data. “We have dashboards that can manage AI agents,” said Wang. “In debugging, for example, we could have a dashboard with five agents and see what they are doing. Some of them look at your log file, others look at your waveform, to determine what exactly is going on in the process. If I have 10 different projects, I can activate these different agents to aggregate data, and then I look at the result. There’s also a corporate aspect. With these enterprise features, how can you collaborate as a team to work on the same project? How do you aggregate the data? The answer is still a dashboard, but now we’re talking about a dashboard for managing AI agents that are aggregating data from different sources.”
Others agree. “What we are finding with AI, in general, is that things that used to be really difficult — by sheer complexity, or just being difficult like formal verification where the learning about properties is a tough job — are no longer so difficult,” said Frank Schirrmeister, executive director for strategic programs in systems solutions at Synopsys. “These dashboards were essentially like debug for hardware, where at ‘this point’ you have AI agents being able to take advantage of what used to be a visual inspection process, like inspecting waveforms. Now you can have an agent, or a set of agents, help you find the root cause much faster.”
System-level data
Leading-edge chipmakers appear to be fully on board with this approach. “Nvidia is building AI infrastructure, not chips,” said Hardik Kabaria, CEO of Vinci. “Infrastructure means something that’s always available and accessible. So you have an infrastructure that allows you to do reasoning, and today most of the reasoning is on language. That leads to the proliferation of data, which you now want to understand through dashboards. But anything that humans are building — whether it’s chips, systems, modes, data centers — is governed by the laws of physics. So you want to understand how part of the system is going to behave in the physical world, but you want to do it in a way that makes it accessible to everybody, not just a few people who have a PhD in mechanical engineering. Everybody who is part of the ecosystem wants to understand things like heat transfer, energy balance, momentum balance, and how do they affect the system? Is it going to lead to hot spots? What kind of workload is it going to create? What type of hotspot is going to affect memory? Is it going to affect co-packaged optics? Once you have enough data available at high resolution, at manufacturing scale, then you can use dashboards to make sense of it.”
These dashboards become particularly important as more segments of the design flow shift further left or extend right. This is essentially concurrent system-level design, and being able to access information in one place makes it easier to analyze and co-design.
“If you think about what was happening in the human silicon organization, each group responsible for delivering a block had to roll up certain data to their managers in design reviews, and they were rolling up different dashboards because different things needed to be measured in different ways,” said Rob Knoth, senior group director for strategy and new ventures at Cadence. “But as you keep marching up the stack, reports have to start getting merged. You have to start saying, ‘Hey, I’m doing formal verification on this block, and thermal and power measurement on this lock, and DRC closure, but am I analyzing those all on a coherent data set? This is using RTL version 12 and this is using RTL version 10.’ This has not been correlated, and it is very difficult to read. So organizations wrote their own scripts and started data mining things, and some people tried to build the mother of all dashboards.”
Engineers are now looking to AI to simplify all of this. “Now we’re looking at what depth are tools unifying them,” Knoth said. “This is why we’ve moved from just focusing on chip design to looking at multi-physics and true system design. You can’t ignore certain physics as you move up the stack. And if you start building a modern 3.5D chip, you’ve got to be worried about thermal-induced stress and warpage. You’ve got to be worried about mechanical problems that are happening with bumps. So now to effectively design the system, that dashboard has to be incredibly rich, accessible, and involve multiple tools.”
This helps explain some of the recent investments in startups, and some of the M&A activity surrounding those companies that have well-defined approaches and tools.
“In verification, we have dashboards more with agentic AI, because you can trace the evolution of KPIs,” said Jean-Marie Brunet, senior vice president and general manager for hardware-assisted verification at Siemens EDA. “So, for example, this could be performance or power metrics manageable by a dashboard. Agentic AI is helping in this regard, because it’s orchestrating all those steps, and you see that if you converge this correctly into your dashboard. It’s not new, but agentic AI is accelerating this.”
The next step is to segment this data, and that can be done vertically with deeper analysis, and horizontally across different tools and elements in a chip or system or flow. “Agentic AI has an eval phase,” said Ankur Gupta, executive vice president and head of EDA IC software at Siemens EDA. “There’s plan, there’s execute, and there’s eval. The evaluation is all dashboard data. If you look at the RTL-to-GDS flow, verification is one, and RTL to GDS is another one. Every semiconductor company has an RTL-to-GDS dashboard.”
What AI enables is a consistent view, regardless of the chip or system architecture — providing the data is usable for AI.
“It doesn’t have to be the same kind of data, but it has to be structured data,” said Gupta. “There is a concept called ontology, where you define the ins and outs of every phase, like timing and power. The challenge is how do you get that uniformly across various tools? If you’re measuring power and somebody gives you a total power number and some other tool gives you a breakdown, and some tool doesn’t give you clock power, there goes your dashboard.”
AI-driven dashboards
Dashboards have been around a long time. The original concept of a dashboard, as we know it, is derived from the automobile. While analog pressure gauges had been in use since the 1600s, it wasn’t until the 20th century that different gauges were literally mounted on a board to monitor speed, engine heat, oil pressure, tachometer, and fuel level. Today, the sensors that feed into those dashboards are still a mix of analog and digital, but the analytics are all digital.
In automobiles, AI can identify problem areas by collecting and processing data gleaned from different sensors that previously were isolated by function, allowing potential or actual problems to be dealt with as quickly as necessary. That requires data to be accessible, structured, and in the case of safety-critical applications, prioritized.
“You need more storage for that data,” said Oscar Camacho, application marketing manager at Infineon. “So we have memories like FRAM, which allow you to have multiple read/write cycles at the edge. And then you need to move this data around between a computer and the end nodes. Depending on the application, that node could be a motor drive or an actuator in a vehicle. This data moves through high-speed communication layers, and gets processed in real-time by a central computer. And within our own processors, we’re scaling up processing capabilities, adding parallel processing units to allow for some machine learning algorithms located directly at the module that is performing the function.”
The real change is less about the amount of data being generated, which continues to balloon, than what can be done with that data. “The cameras are faster and you have more comfort features, so the amount of data is definitely increasing, but maybe not as much as the usability of the data and the conclusions that you can draw out of it,” Camacho said. “AI can predict how the driver will behave. It can predict the maintenance of the car based on the degradation of the battery. It’s enabling a lot more intelligent decisions to be made with data, which is making it a smarter car.”
Pulling all of this data together into dashboards makes it more understandable. This is the general idea behind digital twins, but agentic AI may provide a more granular and potentially lighter version of the original concept. What gets used where will likely depend on cost, the amount of data that needs to be mined and monitored, and the criticality of that data.
Regardless, AI-driven dashboards will be especially useful at the edge, where power is more limited than in the data center, as well as inside of data centers, where multi-die assemblies that include logic developed at leading-edge nodes need to be monitored more closely due to the limited margin for heat, noise, and accelerated aging.
“There’s no way to directly measure electromigration,” said Movellus’ Faisal. “It shows up in different ways. But you have to be able to measure that and then take action. And especially on 2nm, very large chips, the hardware instrumentation is going to be very, very important. Without that, I don’t think we can do power management.”
Conclusion
Dashboards are a familiar concept to nearly everyone, but the information they can provide, and how that information is used, are changing significantly. What’s important here is less about the amount of data being generated than extracting actionable information from that data. And the more data that needs to be mined, sorted, and accessed, the more important a dashboard becomes.
“It’s not about running one physics simulation,” said Vinci’s Kabaria. “Our customers are running half a million physics simulations and asking, ‘Can you help me create a dashboard so that I can direct my engineering team?’ Not just one person. A whole team. So they are focused on the right activity to create the next-gen, best fit for the best products.”
Getting to that point is well beyond human capability. “We haven’t evolved enough to really grasp the combination of 28 dashboards for different characteristics in your design,” said Synopsys’ Schirrmeister. “AI will be able to make sense of it and help you find potential causalities and correlations. This is something machine learning and big data analytics could do before, but AI makes it much easier to use, just like it suddenly makes formal verification so much easier to use and apply.”
Put simply, future dashboards may be much more targeted, more customized, and much simpler to understand, and they could have a broad impact on how data is used for designing, manufacturing, and using chips.
Leave a Reply