Software Debug Gets Tricky

When it comes to debugging software in multicore designs, traditional emulation needs a few tweaks.


By Ann Steffora Mutschler
As designs continue to grow in size and complexity, that complexity has led to an increasing number of processing cores. Additional cores, in turn, allow for additional software to be run on those cores, and debugging the software becomes critical.

Traditionally, emulation has played a significant role in verifying that software against RTL code, and continues to do so. But with the advent of multicore architectures, the picture is evolving.

“Almost everything is multicore,” observed Jim Kenney, director of marketing for emulation at Mentor Graphics. “ARM is out there pushing their big.LITTLE architecture, which is at least two cores, and in some cases it’s eight. Almost everything we see is multicore. The emulators handle it just fine because the processors are modeled as RTL in the emulator. If you have multiples of them, you have multiples in the emulator. They connect up just the way they do in the design. All the clocks and all the synchronization works exactly as the real design does, so in terms of execution there’s no difficulty at all for the emulator to be able to run a multicore design.”

Lauro Rizzatti, senior director of marketing for emulation at Synopsys agreed that if an emulator has enough capacity to accommodate big designs, it will take a multicore design without any problem. “With multiple cores, there is a lot of software being processed by the cores and therefore that software will reside on memory that typically is inside the emulator. Actually, this would be an ideal scenario because basically the whole design and the software are running inside the box at the maximum speed achievable so there are no dependencies with the outside world. But then again, there could be a testbench in addition to the embedded software.”

Where multicore gets tricky is debugging software, Kenney asserted. “The standard method for debugging software on an emulator is a JTAG probe, so these cores all have a JTAG debug port on them. You connect up a hardware probe just like you would if the processor was on a board. That hooks to a software debugger, and you can debug your software as it’s executing on the RTL version of the processor core as it’s modeled in the emulator.”

The problem is JTAG makes the processor do a ton of work, he said. “Let’s say for example you want to read a register. ‘I’d like to know what the value of R0 is in my processor.’ You make that request to the processor via JTAG and the processor will run 150,000 clocks to come up with the answer. Step instruction: 1,000,000 clocks. JTAG when you’re running a single core on real hardware is not so bad. But when you are running at emulation speeds, a megahertz or so, it can be real sluggish making it go off and run millions of instructions. You picture single stepping through your code, and every time you click the step button it’s more than a second before it moves. That’s going to drive a software guy crazy,” Kenney said.

The other big issue is multicore synchronization. Picture a multicore ASIC that these cores are running on synchronously, passing messages back and forth, and doing the things they do to stay synchronized. If one of them is asked to do a step instruction, run the million clocks that it takes—all of the synchronization goes out the window, he explained.

“If you’re trying to capture an event or trying to debug an event that you saw came up with a wrong answer in the emulator, in just asking the processor to do a single step, you clock the rest of your ASIC gates a million times and that event is gone. You probably walked all over it way before a million clocks and the other processors that were passing messages back and forth are just sitting there while you advance this other one a million clocks and so it’s very intrusive. JTAG debug is very intrusive and when it comes to multicore, it breaks all the synchronization. In the old days with a single core, JTAG was ok – a little sluggish but ok. But now with multicore, it doesn’t really work very well,” Kenney pointed out.

A third aspect to the issues debugging software in multicore designs with an emulator is the cost of software debug seats on the emulator. “There are two reasons people build FPGA prototypes. One is they can run software faster. The other is they can build software seats cheaper than buying emulators to do it. There’s a whole lot more software guys on these teams than hardware guys, and they all say they all need access to the emulator, and the guys who are on the emulator say there is no way we have that much capacity for you,” he added.

“FPGA prototypes are popular for software debug because they run fast and they are cheap – as long as the design fits in the number of FPGAs on the board,” Rizzatti reminded. Commercial FPGA prototype boards are limited in the number of FPGAs while in-house developed FPGA prototyping boards may have tens of FPGAs, such as in the case of Qualcomm.

As a result of these software debugging issues in multicore designs, the practice of using transactors is gaining steam. Essentially, a transactor is a kind of verification IP (VIP) with some complex software and hardware. This allows the emulation execution to be recorded for later debug while not on the emulator and in a faster debug environment than the emulator.

This also speaks to the expanding role of emulation in general, and the ability of the technology to be used with other prototyping and simulation technology in a more integrated fashion.

Cadence fellow Mike Stellfox sees emulation as an integral part of the move toward incremental refinement flows in the context of system-level modeling and analysis. “There are two levels where people are doing coarse-grain, abstract modeling in SystemC and virtual platforms, and then there’s this type of analysis we’re now starting to see taking off with taking configured IPs and doing the analysis there. It’s a good opportunity for these two worlds to come together, where you should be able to quickly build environments of models of your chip which mix abstractions between system C/TLM and RTL in order to facilitate a more automated, incremental refinement flow. There’s always been this dream of ESL and I think so far hasn’t been realized because there’s been a disconnect between the ESL tools and the RTL flows that are there.”

He noted that some incremental refinement flows have been developed, but it’s a trick of having the right abstraction for the right level of analysis and not trying to think that everything should be TLM or everything should be RTL. And then, based on the use cases for those parts of the system that need to be refined, to do that analysis with more refinement for those types of use cases.

“The trick is being able to put in place a good flows, which will require a lot of automation because of all the different types of models that allow you to switch between abstractions and mix abstractions. The systems are so complex you do need accuracy, and one way we’re tackling that is with hardware acceleration and being able to combine platforms where you have some parts modeled in TLM, some modeled in RTL maybe perhaps running in an emulation machine like Palladium—that’s one way we’re tackling that today and I see more and more that’s going to become more of the mainstream flows where you’re going to have no choice to do it that way,” Stellfox said.

It used to be the primary use of emulation was to integrate software a little earlier, have a full in-circuit target environment before tapeout, and to validate that some level of software and the hardware worked in a real target system. “That’s still a very important thing and people are still doing that, but emulation/acceleration has definitely moved more into the mainstream design and verification flow because of the need for performance. It’s become key to being able to scale the verification activities or just enable software to be developed against the real RTL and actually run well ahead of the silicon,” he concluded.