Complexity mounts, driven by multiple dies, larger and more complex systems, and the incessant demand for performance improvements everywhere.
Simulations of semiconductors and systems are becoming bigger, more complex, and increasingly necessary, mirroring everything that is happening to the hardware itself — particularly in AI data centers.
The move beyond monolithic chips to multi-die assemblies now requires solving some thorny multi-physics challenges, such as thermal and power delivery, which are increasingly difficult to model accurately. Multi-layer signal routing in high-bandwidth memory, for example, requires additional simulations. And verifying the entire system’s functionality, including the interactions between different integrated subsystems, is much more challenging than verifying a single chip.
Much of this is unfolding at the leading edge of design, where SRAM scaling has slowed, and logic scaling is no longer providing sufficient performance improvements. There are still more transistors, but there also are a variety of tricks being used to speed up the movement of data between memory and processing elements, as well as pre-processing to reduce the amount of data that needs to be moved in the first place.
Simulation must adapt to all of this, and it needs to do it quickly enough to meet production schedules. But there are three main problems:
“This simulation challenge falls under an observation that Anirudh Devgan, CEO of Cadence, made back in May, when he said, ‘Design for AI and AI for design,” noted Rob Knoth, senior group director, Strategy and New Ventures at Cadence. “The whole point is that all of us in the industry are helping to build out the AI infrastructure, and that is difficult. The complexity involved in these systems and the timeframes that are required to deploy and then create the next generation are mind-numbing. Of course, it’s going to be hard. It’s always been hard. Why would it be easy now? ‘AI for design’ is the silver lining. We’re building these incredibly complex systems made up of incredibly complex chips, the likes of which we’ve never attempted in the past. We’re working our butts off to create these incredibly difficult computational systems. Then we use those same systems to design the next generation. That’s what’s kept us paid for generations. In many ways, the story hasn’t changed, but it’s deepened and evolved. There are new actors, new catastrophes and emergencies, and there are new things we’ve got to tackle.”
And all of this is happening amid a rapid build-out of AI infrastructure and accelerating product cycles. “To make these frontier models, everything’s getting harder, and the amount of computation required is tremendous,” Knoth said. “If you look at the power requirement to train frontier models, it’s the same sort of thing where we’re now getting to single data centers consuming as much as a major U.S. city. What does this mean for what’s going on inside there? State-of-the-art AI processors, which allow these subsystems to be more than three times the reticle size? That’s a lot of compute. That is where they’re predicting this is going. If you look way out, there are systems on wafer (SoW), wafer scale substrates over 40X reticle size, and this is where complexity comes in. When we start thinking about what’s truly involved for system architectures, implementation, integration, and analysis, this is daunting. And, of course, they’re all 3D-ICs.”
There are also device-level factors that make simulation more difficult. “Something is difficult to simulate if it takes too long because it has too much detail,” noted Roland Jancke, head of department design methodology at Fraunhofer IIS/EAS. “For a simulation, the question is always, ‘What is a good level of abstraction? How much information can you skip because it’s not important for the question at hand, and how much can you abstract to speed up the simulation?’ That’s always the key for a simulation task.”
Additionally, what makes something difficult to simulate is the difficulty in modeling it. “You need to model more, and you must build accurate models to be able to do that,” said Michael Munsey, vice president semiconductor industry at Siemens EDA. “In the world of semiconductors, the way we’ve virtualized processors has gone a long way to allow more simulation to happen early in the process. We had to create that virtualization technology. Otherwise, we would have been stuck in the world of emulation and developing RTL and running software at the same time. New technologies had to be developed to be able to model that. So, when we look at larger and more complex systems, it’s the ability to have high-fidelity models, to be able to represent each part of the system. When you look at an entire product, some things are ahead of the curve and are easier to digitalize than other parts of the product. That’s what makes things become even more difficult to simulate.”
Size plays a key role here. The larger and more complex the system, such as a car with all its integrated functions, the harder it is to model and to simulate. “The models become so large, and you need to be able to abstract them away, to be able to fit them into enough compute power to do the actual simulation,” Munsey said. “The long pole of the tent is being able to digitalize. Did you represent that in terms of a larger digital twin to be able to do the full system-level type simulations that you want to do?”
So does speed. Everything is moving faster, whether that’s product cycles or the signals within a device or system. That requires more current, and it generates more heat. “On top of that, electromagnetic simulation can give you things like local heating, and then you can do the physics of thermal, which is another set of equations of physics,” said Matt Commens, director of product management at Ansys (now part of Synopsys). “You can take those sources extracted from electromagnetic simulation to source a thermal analysis, and make some local assumptions about how many watts a chip might be dissipating, as well as heat, and you can use that to source a thermal simulation. It’s just that everything gets harder as things get packed tighter, and as things move faster, there’s no free lunch, so the engineer always has to deal with it. It comes down to what level of margin they can deal with. One of the things about simulation is that you can design up to the margins if you have good simulation. If you don’t have good simulation, and you have to put some padding in there, you’re going to lose real estate, you’re going to lose functionality. And so, the companies on the cutting edge are going bigger with simulation. They want to extract bigger parts of the system with as much rigor as possible. On our end, we’re being tasked to take advantage of that HPC computing, and the latest generation GPUs have changed the game in a lot of ways.”
Sometimes the CPU can still be quite valuable in the HPC space. “It depends on what techniques you use to accelerate the simulation and parallelize different layers of hierarchy in the simulation process,” Commens said. “It’s a real balancing act taking advantage of all this different hardware.”
Leading-edge node challenges
Simulation challenges for chips developed at 3nm and below have their own unique set of issues. “There is a focus on the very real, very urgent challenge of dealing with emerging simulation issues of the angstrom-era physics effects at the transistor level,” said Steve Roddy, chief marketing officer at Quadric. “Proper circuit-level behavior requires understanding all of those device physics effects at an increasing number of signoff corners (for varying voltage and frequency operating points, etc.), but there is also a need to move up-level from the tried and tested gate-level simulations of old, which take entirely too long to run. Understanding the whole chip behavioral state — how many blocks are active and switching under various scenarios — is key to designing proper power grids and clock trees. In the era of 10 billion-gate devices, the building blocks need to be individually characterized for power and switching activity using high-level but cycle-accurate simulations like the cycle-based instruction set simulators (ISS) for processors.”
A complex chip with CPUs, GPUs, DSPs, and AI NPUs can be simulated with all the processors having power-accurate ISSs linked together in a larger system simulation. “Such a SystemC chip-level simulation can inform the designer what power state each block is in during various usage scenarios to allow right-sizing of the power grids without needing to always design for every element experiencing simultaneous worst-case activity,” Roddy said. “The obvious takeaway is that design teams should avoid, where possible, deploying large mega-gate logic blocks that lack higher-level simulation capability.”
Abstracting details and creating models isn’t a pervasive skill set, however. “Not everyone has a good handle on how to do this today,” said Fraunhofer’s Jancke. “The developers themselves know about the problem at hand, what needs to be solved, and what needs to be implemented. But they aren’t necessarily experts in simulation methodology and modeling methodology. We see a need to be more knowledgeable in that area in order to have efficient models for efficient simulations. Sometimes they use the simulator because it’s free, because they can have it from an open source from the internet, without understanding that it’s completely the wrong simulation level that they’re using. That’s why more knowledge is necessary in the company, taking the developer by the hand and giving them the knowledge about simulation principles.”
Those skills are typically learned on the job. “In Dresden, it’s not taught,” Jancke said. “It’s not part of the curriculum in electrical engineering, unfortunately. Maybe in mechanical engineering, they get some knowledge about simulators in their background, how they work, how the different principles and simulation tools are, but it’s rarely found today in curricula and universities.”
This skills gap is widening as new challenges emerge at the forefront of technology, but that’s only part of the problem. There are infrastructure and practical hurdles in implementing advanced simulation strategies at scale.

Fig. 1: AI infrastructure challenges. Source: Cadence
“We know that the processors themselves increase 1.5X to potentially 3X in power per generation,” Cadence’s Knoth said. “That means analyzing becomes critical, and optimizing is even more critical. Then what gets challenging in terms of simulation is that we’re seeing over 4 terabytes per second per link in between chips. So now you must start thinking about every chip having photonics involved, which means a lot of the simplicity, a lot of the assumptions that you would normally make, go out the window. The scale of the simulations gets even harder. The fact that chips are getting that much more power-intensive, with that much more processing power going into each one, means the power density becomes tremendous. Liquid cooling is the default. So not only do we have heat and light, but now we’ve got fluids running around in these systems. We just keep adding on extra size, extra complexity in the simulations themselves. Then we get to the last dimension — time — which is funny since time is the fourth dimension. Here, not only is your job that much harder, but you’ve got to complete it that much faster.”
Technical strategies
Every component, every signal, and every system needs to move faster. To maximize the performance of SoCs designed for HPC, parallelism is used during both the compilation and simulation/run stages.
“By leveraging the multiple cores available in a machine or across a grid, significant performance improvements can be achieved,” noted Taruna Reddy, principal product manager at Synopsys. “Simulators have long incorporated technologies that utilize this parallelism effectively. Again, a trend in the architecture of HPC-targeted SoCs is the transition toward multi-die or chiplet-based designs. These integrated systems consist of multiple dies packaged together to function as an SoC, providing advantages such as scalability, high bandwidth, and energy efficiency. This approach involves independently manufactured dies that are interconnected through advanced communication fabrics. However, verifying the correctness and functionality of these complex systems presents significant challenges. Traditional simulation methodologies, whether in-house or otherwise, often struggle to scale adequately in terms of performance and capacity.”
Distributed simulation technologies are one option, because they enable large multi-die simulations to be executed as smaller, manageable components through multiple participating executables. “In the context of HPC SoCs tailored for the AI datacenter industry, there is an increased prevalence of timing exceptions due to complex data paths,” Reddy explained. “These timing exceptions are typically verified during gate-level simulations later in the verification cycle, resulting in low coverage. This delayed verification process can lead to compromised silicon quality. By verifying these constraints earlier at the RTL stage, higher coverage can be achieved using simulators, which can effectively parse constraints files.”

Fig. 2: The need for exception verification at RTL. Source: Synopsys
And while AI is being used to address challenges like coverage closure in simulation, Reddy noted it would be interesting to see how AI can be further leveraged in the future to improve raw performance and facilitate the early detection of bugs.
Conclusion
Amid these escalating challenges, it’s important to recognize the broader perspective shaping the industry. While the technical hurdles may seem overwhelming, the drive to innovate persists, propelling both individuals and organizations toward new frontiers. This sense of determination and optimism is reflected in the way experts approach the future of supercomputing and AI infrastructure.
“A supercomputer is hope,” said Knoth. “It is this machine that you are using to do some kind of physics that was not possible with previous generations of the machine, so if we look at these four dimensions of the problem here, is it scary? Maybe. But is it hopeless? No. This is opening up tremendous opportunities, because when we talk about levels of abstraction, that’s one of the fundamental things that’s allowed electronic design automation to deliver the productivity advances that we’ve had generation to generation. We’re not pushing polygons by hand anymore. We’re not even placing standard cells by hand anymore. The levels of abstraction that we’ve been able to go up allow each engineer to design larger, more complex systems, and at the same time, the software then is able to start doing better optimization inside the scope that it’s abstracted that it’s responsible for. So the fact that now our world doesn’t start and stop at the boundary of the chip, package, PCB, or even the data center anymore means we’re going to be able to optimize the hack out of this problem, and that is where hope lies.”
Knoth said he can appreciate the opportunity that AI will bring. “I can see where it’s coming from, on the engineering side. For example, one of our R&D leads published a paper at a conference recently, where they talked about using surrogate models to increase the speed of thermal simulation for 3D-ICs. And I thought, ‘I can see where this is going.’ Also, LLMs, in some ways, are amazing and phenomenal, and I think about what they’re going to allow people to do. It’s like the transition from machine code to higher-level languages, where more people can start to put their ideas into practice. It makes it that much bigger. But I feel like the real revolutions, just like in engineering itself, are not going to come from these high-level, glitzier things, but rather from the lower-level breakthroughs like some of these old concepts that are new again, like neural networks. I expect we’ll start to see some better physics done because of the marriage of actual principled simulation, calculated things, and marrying that massive amount of high-quality physics-based data with training new models.”
Related Reading
Lines Blurring Between Supercomputing And HPC
Acceleration of performance improvements due to AI and disaggregation are driving significant changes at the leading edge of computing.
Leave a Reply