Using Emulators For Power/Performance Tradeoffs

Chip design’s big iron is moving further forward in the design cycle.

popularity

Emulation is becoming the tool of choice for power and performance tradeoffs, scaling to almost unlimited capacity for complex chips used in data centers, AI/ML systems and smart phones.

While emulation has long been viewed as an important but expensive asset for chipmakers trying to verify and debug chips, it is now viewed as an essential component for design optimization and analysis much earlier in the design process. Emulators today regularly take in an entire SoC worth of RTL, which is far more than standard simulators.

“The analysis of the power-to-performance tradeoff takes place early in the design on the system level,” said Andreas Brüning, head of the Department for Efficient Electronics at Fraunhofer IIS/EAS. “But the analysis on the basis of emulators is not regarded as target-oriented. The realization of a system on an emulator has completely different characteristics to the target system due to its flexible structure, both in terms of power consumption and performance.”

The fact that this kind of hardware-accelerated technology is being utilized that early in the design process speaks to the complexity of these designs.

“The world is definitely changing, and the flow for power is quite interesting and changing,” said Frank Schirrmeister, senior group director for product management and marketing at Cadence. “The key is to have the data ready from the emulator for consumption of more accurate data creation in downstream power analysis, integrity and optimization tools. How do you make the data available, and how do you choose the right set of data, because even an emulator is looking at the design from an architecture perspective? It gives you data, and you need to create a lot of data to actually become relevant over long enough cycles to find the right spot. One way to approach this is related to the refinement, such as which areas to look at with the initial toggle data. Then, once the engineering team has honed into the window they want to look at in more detail, they get more accurate data. As such, the switching format is used to drive the activity into the backend tools, then you hone in to the window of most activity. The emulation data is connected down from an abstraction level into the data for the implementation flow with power analysis and integrity tools, which use the same engine later used to do power data. This comes down to things like how fast the user gets information back from the tool.”

In that situation the data generation could be run, saved, and moved to the next step. But because of the sheer amount of data, that’s not practical to do because it takes too long.

“You have slices, so that instead of waiting to save everything, you transfer smaller slices much earlier. The user sees the data while you’re still running on the emulator, so that you get to the data fast, and you pipeline these items,” Schirrmeister said.

The amount of activity here, as well as how data is being generated, is far different today than it was in the past.

“It used to be that people would run software simulations, but now they have evolved into methodologies that are using emulators,” said Preeti Gupta, product manager for RTL power analysis tools at ANSYS. “Emulators give them a much more realistic activity for the end application. Also, it gives engineering teams a lot of activity for that end application, not just a snapshot of it.”

The challenge with emulators is the quantity of activity data they generate, and how the tools that are consuming that activity deal with the data volume. That activity has a first-order impact on power.

“The number of nets that are switching determine how much power the design will consume,” Gupta said. “If there are fewer nets switching, then the power consumption is lower than when there are a lot of nets switching within the design. Considering what the design targets and goals are, you may want to look at emulator activity in many different ways. For example, you may be a package designer, or you may be a power grid designer, or you may be an engineer concerned about thermal aspects of the chip and the system. You may be interested in how to reduce the power consumption. You may be a software engineer who wants to understand how the software that you’re writing impacts the hardware power consumption. There is a gamut of roles and functions that can each benefit through emulator power flows.”

Emulation needs can vary greatly by user, by application, and by project. For example, a semiconductor making power tradeoffs for chips designed for the mobile market has very different needs than a CPU developer doing power/performance tradeoffs during the migration from one generation to the next.

“For the CPU company, their needs included having very early on — two years in advance before their chip was taped out — an understanding of the activity trends in terms of power, in terms of performance loops, when running real applications, and they wanted activity trends as accurate as power trends,” said Vijay Chobisa, product marketing director in the emulation division at Mentor, a Siemens Business. “They were working with an early version of RTL to see how the chip was behaving, and when and what kind of architecture adjustments they needed either in hardware or software.”

For this company, there were two key concerns. “One aspect was to detail design activity in a graphical view so they could very quickly see the areas of concern in the hardware,” Chobisa said. “If they saw a peak of values in the design, they wanted to very quickly see what portion of the chip was contributing to that activity. But not only that, they wanted to go back and correlate that in the software and determine what kind of adjustments could be made to the software or hardware in order to keep the chip in the power budget while delivering the required performance.”

In this chip, there were several cores with varying levels of performance. “They wanted to use their best and fastest core so that application comes up as quickly as possible,” he said. “In order to achieve that, they were using the best possible core to launch the application, but once that application was launched, just populating data and providing information didn’t need a high-performance core. They wanted to verify this scenario, and also make sure that they were able to halt all of the processing when they turned on the high performing core to a large application.”

The other important aspect was capturing the detailed design activity for the entire design, Chobisa said, and to do that the emulator engine needs to be able to capture the design activity on every clock, and transfer that activity from the emulator to the host for processing and generating the graphs that are user-friendly.

Profiling power
Behind all of this is a need to profile power in devices, which has a big impact on everything from overall system performance to reliability. Emulators help in that regard because they can profile real application scenarios and a view of power consumption over time.

“That’s a basic feature, but there are end benefits,” Gupta said. “If I am doing an operating system boot, when does power go up, when does power go down, when does it stay the same? When is it a sustained worst-case power? If you have high-power consumption for a long period of time, or you have a high power consumption for a couple of cycles, these are very different power phenomena. The engineering team wants to understand this power phenomena for real-life application scenarios, because once they know which block of the design is consuming more power, they’re able to optimize it. But you first have to know where, why and when in order to fix it. Looking at these realistic activity scenarios helps.”

Another critical consideration in making power/performance tradeoffs is identifying power-critical windows. “If my design consumes a very high amount of power for 1 cycle versus 4 cycles versus 10 cycles, how does it stress my power grid? If I have a design that has a huge block for which I turn on the clock suddenly, and the clock goes to tens of thousands to millions of nets so the power suddenly shoots up, that sudden increase in power consumption could actually cause a huge amount of voltage drop, which then could cause a timing failure,” Gupta explained. “Understanding these peak power scenarios (Di/Dt), or just high power consumption, or even a ratio that says, ‘What is the peak versus what is the average power,’ all of these are telltale metrics in terms of how to define problem areas.”

The third power/performance tradeoff benefit from emulation involves the ability to do very fast power profiling of long vectors for optimizing software operation.

“Imagine a software person sitting in their own world writing algorithms for apps that are going to run maybe on a cellphone,” she said. “How are they going to know how much power the app will consume on that hardware when the hardware is being built independently? What happens is these software developers actually don’t care about the 5% accuracy difference, and it’s things like this we worry about for signoff analysis. They just want to know if they code their algorithm in a certain way whether they are consuming more power or less power. For that kind of audience, just the ability to have a visualization of a power profile early on is a huge benefit.”

But that still leaves a challenge of sifting through all of the data generated by emulation and making sense of it.

“You can load an emulator, fire up your design in the hardware, throw the vectors, load the software, hardware/ software debug, check your software stack and everything,” said Shailander Sachdeva, applications engineer for power products in the verification group at Synopsys. “When it comes to power, the good and bad thing with the emulator is you can generate tons of data. But you have to actually analyze it and you have to store it. A 10-second run can generate 10 terabytes of data, but not many companies have the wherewithal to actually offer 10 gigabytes of data over the network or store it and then analyze it. So it becomes tricky. Usually with emulators, when you run it you have to optimize the data, which is coming out of the emulator for two reasons. One, the more you write the data out of the emulator, it slows the emulator, because every time the emulator has to stop and flush the data out, then resume with the operations. This slows down the emulator and means your emulator cost increases — and emulator bandwidth is a costly thing.”

For functional debug, there are set methodologies that allow much more granular use of the emulator. But when it comes to power, two basic things are being measured — peak power and average power. “Average power is the consumption of the power in my design,” Sachdeva said. “That helps the designer or the architects to design the battery capacity of a mobile device, for instance. On average, it takes 100 milliwatts of power, so I can accordingly pick up a battery technology or a battery capacity. Peak power means that on an average it’s taking 100 milliwatts, but then when there is traffic or when the OS is loading up and the firmware gets fired up, the power consumption shoots up maybe 500 milliwatts. That 500 milliwatt number is also very important because that will help you decide what are the rush currents and how to design the power supply. The power supply/battery capacity on the chip should be able to handle that much current. Both peak power and average power are important considerations.”

Peak power requires constant monitoring of the system to determine how much power is being consumed.

“Here, more intrusive analysis is needed, and emulation can help do this more intrusively and over a bigger period of time,” he said. “But if we were running peak power in a very short window, you’d run a very synthetic small test case get to know what the peak power is. Traditionally what designers then do is pick up a high-traffic scenario, create a test case and tell the verification guy they created a test case where all the transmit and receive activity is happening, and all the ports are activated. Then they will measure the power consumption there. This worked fine, but there is still the possibility that when you have your software stack up and running you have the power management unique in software, which is working, which is enabling disabling the different parts of chips, that might unveil a very different scenario. To do that you actually run emulation, and during the emulation run you actually measure power without dumping out the data. You have to optimize it. And you measure the power. So you get a peak power profile graph. Then you can identify regions of interest where the power is picking up.”

Achieving optimal performance
The key in all of this is constant refinement and integration.

“When it comes to performance, you want to be able to make changes,” Schirrmeister said. “We suggest using less accurate but faster data to create toggle counts early over the full run. Find the area of highest activity, then create more accurate activity data in the downstream tools that have the estimation capability underneath; or you do it at the gate level, directly.”

One interesting challenge always is how to make changes if things don’t work out. “You may make changes to the power regions, you may make changes to the levels of clock gating, but the big effects come from moving things around, like between the hardware and software. When you’re at this point it’s pretty late to run RTL, so you want to connect this to the front-end flows. Twenty years ago we thought everything would be abstracted, and from those abstractions you go down into refinements and it would all be top-down. The reality, especially when it comes to interconnect and so forth, is these tools autogenerate the interconnect so you can just brute-force it instead of simulating it at a higher level. You can use the bigger run in the big iron, where you run it at the RTL level. In the past, we never thought that would be possible because you need to implement it first, but given that the implementation of the interconnect has been automated and a lot of the items you do things in DDR and so forth are part of the configuration, they are things which don’t require fully re-implementation.”



Leave a Reply


(Note: This name will be displayed publicly)