Deciphering Performance Analysis

Improving system performance requires much more than simulation.


Simulation traditionally has been the go-to technology for improving system performance, but practices are evolving and maturing because engineering teams need to be able to simulate in multiple domains and at at multiple levels of abstraction. In addition, they need to tune the level of simulation they are using to what types of models they have available, and what kinds of analysis they’re trying to do.

“We’re seeing a lot of [engineering teams] being willing to mix levels of abstraction and modeling styles, and understanding that’s needed in a lot of cases,” observed Jon McDonald, technical marketing engineer at Mentor Graphics. “It’s a lot of more than the traditional , which is what people have been doing forever. That’s always the first step, but people have been driven much more recently by other kinds of issues like performance analysis. It’s difficult to decide ahead of time how you want to build something if you haven’t done some level of quantitative analysis.”

To be sure, performance analysis is one of the most often performed activities before bidding for projects or designing products. According to Vinay Mangalore, VP of engineering at PathPartner Technology, this is especially true if the product is high-volume and designed for multimedia or network processing. “Higher data rate, higher resolutions, and increasing penetration of connected devices makes it imperative to design products which are future proof to a reasonable degree and are designed for optimal requirements to keep the cost low. Performance would include real-time operation, memory through-put, latency requirements and power consumption.”

As a category of work, performance analysis happens at many different stages throughout the development of the electronic product. “It’s not just ‘early,’ and it is important as the hardware is developed,” said Pat Sheridan, director of product marketing for virtual prototyping at Synopsys. “It’s important even post-silicon, because you’re always wanting to understand how the product is going to perform in the hands of the customer.”

This is easier said than done, however, McDonald pointed out. “Increasingly, there are more design teams doing software analysis using virtual platforms to look at the performance, do some what-if analysis: ‘Okay, if I take this out and put it in the FPGA fabric, how fast does it have to be?’ If you take a reasonably complex system, it doesn’t take long before you just can’t keep up with all the variables in the spreadsheet. Really the only way to deal with it is in simulation.”

The key here is heterogeneous simulation. “It’s not just digital elements,” he said. “There is often some real world interaction, whether it be analog, mechanical or even user interaction that needs to be taken into account. As a result, the response times are important and often not purely modeled in the digital portion of the design, so being able to link to other simulation models or domains is important. The other big part of heterogeneous simulation is that there is no one simulation language that covers everything that everyone wants. You’ve got models in Simulink, you’ve got models in C++, you’ve got hardware-in-the-loop capabilities, and people are trying to pull all of these things together — even RTL at some level. They are trying to create a simulation model where they can do some of this analysis and performance tuning and architectural tradeoffs before they commit to a design.”

The big problem has always been models. Without models, engineers can’t do more simulation.

“By enabling heterogenous simulation, there’s usually some model available for most of the elements in the system,” he added. “If we can find those models, pull them together, that really helps us enable the system-level simulation.”

In the past, this was done on an ad hoc basis. More standards are being created for communication of simulation models in different domains, such as FMI. That could help speed up the performance analysis process.

“Performance has always been an issue, where with abstraction we thought we could annotate enough information into the abstracted model to give feasible performance data,” said Frank Schirrmeister, senior group director, product management in the System & Verification Group at Cadence. “It turns out that’s not the case. The big example here is a performance model for the interconnect. If you have the interconnect being highly configurable—like the AMBA ACE implementations where you essentially need a tool to configure it and you have, let’s say, 400 parameters—if you build a model for it, then building the model itself takes time. But even worse, verifying the model takes even more time.”

Headaches will vary
For companies that support SoC-level integration, the problem is no longer just whether the chip will be functional. It’s whether it will function at the required level of throughput. There, performance becomes a critical element.

This is more than just cycles per second, said Drew Wingard, CTO at Sonics. “Am I going to be able to get enough bandwidth out of this DRAM in order to be able to do this and this at the same time? Now we are into performance modeling. There has been a general consensus that the right way to do most of that performance analysis as it applies to shared resources on the chip is to build a cycle-accurate model of the memory subsystem, which is the shared resource, and the interconnect that services that, and to run abstracted traffic representative of the system that needs to share. So we don’t have to build a full functional model of a microprocessor. We don’t have to run actual software. We need to understand in detail what are the statistics of the cache misbehavior of that processor—out of the video decoder, out of the display engine and out of the GPU, and out of all the other elements. To us they just become traffic sources with interesting statistical properties and with very interesting address patterns because the memory is address-dependent.”

What’s not so well captured in normal flows are the actual requirements because shared resources are frequently over-subscribed.

“Then everything around arbitration and scheduling becomes really, really important, and without clear, unambiguous description of who is going to have to wait,” Wingard said. “Physics doesn’t let me do this all at the same time — who is allowed to wait and how long are they allowed to wait without a crisp definition of that. We can’t do the equivalent of static timing analysis. We can’t do static performance proving. We can’t formally approve that this system is going to meet its performance requirements without an understanding of what those requirements are.”

Scaling performance analysis
Performance analysis on a block level is feasible, but scaling it to the full chip level in a complex SoC gets much murkier.

“The way you need to simulate to get this performance information is with dynamic simulation of the multicore behaviors, in contrast to a spreadsheet estimate,” said Synopsys’ Sheridan. “If the scope of your calculation is simple enough that you can get a good estimate with a static formula or the math, then that’s fine. But where the spreadsheet doesn’t work on its own is when you have dynamic multicore behavior. It’s not obvious how the system will perform under certain conditions, so this is where SystemC system-level simulation with the right detail in the transaction-level models is focused. It can give you an early look at the dynamic behavior of the system.”

For complex multicore SoCs, one way to think about this is in the context of virtual prototypes, he said. “You are modeling using loosely timed SystemC TLM models and trying to go as fast as possible, because in that context you have to model all of the things in the system that the software can see. When you do performance analysis initially, you definitely do not have to model the entire system with accurate models. You can focus on the components within the block diagram of the product that are most relevant for the performance of the system. That would be the on-chip interconnect, the memory subsystem, and some representation of the application workload—which can be traffic that you were just replaying and watching how the different streams of traffic interact as the system scenario runs. Or it could be software running on a processor. In that case you do need more things in the model such that the software can boot, and those are things that may not be relevant for the measurement but are important just to get the software to boot. In both use cases there is this idea that you can iterate and refine what the model is doing, and the scope of the effort can be incremental. In the context of performance analysis you start with the most critical components of the interconnected memory subsystem, and then you can expand that as required to check different subsystem performance.”

McDonald confirmed there are customers currently using approximate timing and performance analysis on very large complex SoCs and systems. “The more accurate you need the timing, the more work needed in the modeling, but we have seen examples with 95% timing accuracy when compared to the actual systems. One of the key attributes to make this manageable is a model that allows incremental refinement of the timing accuracy. You cannot afford to rewrite the model every time you want to improve the timing accuracy. You need models that allow for incremental refinement. We can start with no timing, then add some approximate timing very quickly. This approximate timing can be refined to provide more accuracy as needed. But this refinement does require additional investment in the models.”

Wingard pointed out that in the simplest model all components communicate with each other through shared memory, and that shared memory is off chip in DRAM. But it would be a horrendous scaling problem from a simulation perspective. “You can’t afford to have that much memory bandwidth. As a result, a transition has taken place. There is a lot more use of an extra level of memory hierarchy on chip, whether that be SRAM based or software managed SRAM or hardware managed cache or some mix thereof. We are seeing the emergence of more true peer-to-peer traffic where the result of one stage of the algorithm might communicate with a buffer that’s at the next stage, instead of going through shared memory. All these things take pressure off the memory system. They make the overall system more complex, but from a modeling perspective, if you are willing to use abstraction, they actually make it simpler because it’s relatively inexpensive to guarantee peer-to-peer paths are going to make performance. And if you can guarantee they’re going to make performance then you don’t have to analyze them in your simulation, so you can abstract them away. You can take large chunks of this and separate them into a different problem. This is one of those few circumstances in which what would have been a horrible modeling complexity problem actually ends up being solved because it also happens to be an architectural problem.”

Accuracy is always an issue, he noted. In all cases engineers are looking at an abstraction of the actual traffic and dealing with the fact that they don’t have the real software, so there is no substitute for the simple historical thing of rounding up a little bit.

“We over-design the system a bit and a better system designer will be more confident about how much they should or shouldn’t round up and where,” said Wingard. “Marketing groups within these companies have engaged much more in the definition of the use cases that the silicon has to support. One of the interesting things about performance analysis is that it isn’t a singular thing. Performance analysis is across a range of use cases that the chip is expected to be able to operate well in at the system level, and across one or more end user devices. A really interesting challenge is how do you make the tradeoff between more use cases and more detail versus fewer use cases and less detail, but still have enough rounding to be safe.”


Bijoy P says:

Dynamic simulation with real use-cases modelled brings out the FIFO depth requirement of the IPs at its boundary interfacing to Interconnect. May be IPs are designed with oversized FIFO depth – but Performance Analysis with corner cases modelled and worst case traffic scenarios help SOC/IP designer to resize their boundary FIFO sizes. We use this approach in our SOC.

Leave a Reply

(Note: This name will be displayed publicly)