Multicore Debug Evolves To The System-Level

Complexity is making this process more difficult, but new and better approaches are being developed.


The proliferation and expansion of multicore architectures is making debug much more difficult and time-consuming, which in turn is increasing demand for more comprehensive system-level tools and approaches.

Multicore/multiprocessor designs are the most complex devices to debug. More interactions and interdependencies between cores mean more things possibly can go wrong. In fact, so many problems are possible that engineering teams frequently wonder where to begin. Compared with single-core, multiple-core designs include more execution streams to debug at the same time. In addition, instructions are executed in each thread, and complex interactions occur between each core.

“Microarchitectural features like out-of-order execution can aggravate the debug further as the intermediate register and memory states keep changing even for the re-executions of the very same test,” said Shajid Thiruvathodi, CTO at Valtrix Systems.

To simplify the debug, it’s important to first figure out at an algorithmic level where a failure happens.

“The execution traces must be annotated with debug prints which identify the transactions from different cores,” Thiruvathodi said. “This will make it easy to comprehend the trace and get a sense of what is happening at a global level. Debugging issues at a system level often requires matching a transaction with the master originating it. From a stimulus standpoint, it is very important for each core to write a shared memory only with unique values that are not repeated again in the test. This helps in disambiguating transactions when looking at a bus trace or cache contents.”

As with all debug, it needs to happen in context. But that context can be tremendously complex in multicore designs.

“It is absolutely vital to take a system-level view, rather than looking at things from a processor-centric standpoint,” said Gajinder Panesar, fellow at Mentor, a Siemens Business. “You have to accept that most multicore systems, like many very complex systems, are impossible for the human brain to comprehend without significant technological help. You need to gather lots of data to understand the operation of a modern SoC, but you also need tools that can help you understand that data. That might include the right kind of visualizations to help human comprehension — perhaps as simple as the ability to view hardware and software operation simultaneously. Or it could be more advanced data analytics, heatmaps, and correlations that guide the engineer towards a diagnosis of the problem.”

There are some specific factors to consider when debugging multicore systems, including such time-tested concepts “halt on error” and “stop on error.”

“In a multicore system, if one of the processors throws an error, it really does depend on the whole system,” he said. “The likelihood is that one processor failing, for whatever reason, will eventually bring down the whole system. In debug terms, you have a choice. When you see that error you can halt the processor that’s in error, with all the processes executing within it halting, but all the other cores continuing. Generally, when you see an error, you don’t want that to happen. You want to bring the whole system gracefully to a stop. Stop on error is a more general solution. When an error occurs, the errant core will cause the system to come to a stop.”

The challenge is getting the halt/stop process to happen gracefully. That doesn’t work for the conventional method, in which a host-based debugger detects and traps an error, and then sequentially informs other cores to halt and extracts their state. It simply takes far too long to bring all of the cores in a many-core system. In contrast, a hardware on-chip approach can detect an error and stop the system immediately, so that state can be captured and the problem analyzed.

This isn’t a one-time fix, either. Debugging needs to be viewed as an ongoing activity throughout the design, development, and deployment. “The key difference between a single-core and multicore design is the increased number of cores, which compounds the debug effort due to interactions between the cores and peripherals,” said Simon Davidmann, CEO of Imperas Software. “A multicore design could be based on a uniform core configuration or heterogeneous with a range of configurations or even a mix of ISAs. Multicore also can include a design hierarchy of subsystems and clusters of cores in various array structures.”

It gets even more complicated. “A multicore design is likely non-deterministic on physical hardware without the ability to control one core in relation to another,” said Davidmann. “Limited access may even limit the options to stop some of the other cores while debugging another. This requires a comprehensive debug approach across the design phases, and a mixed tool set-up could be complex to synchronize.”

Moreover, debugging a system is not just debugging the code executing on the cores.

“If a system failure is detected — as opposed to some kind of error in an individual processor, such as a corruption on a memory access — the system needs to gracefully stop and capture state,” said Mentor’s Panesar. “What’s needed is the ability to say, ‘Stop now.’ This requires a hardware debug widget, and for the debug interfaces to the cores to be managed in a coherent and timely manner. What’s needed in the multicore system is a hardware infrastructure that supports that process with minimal processor intervention. That hardware infrastructure could be an embedded analytics bus monitor that allows you to look at transactions between on-chip blocks to analyze system behavior, not just CPU code execution. It is not confined to looking at processor code execution. It’s not just a dumb monitor. It is run-time configurable, allowing you to look for specific transactions or classes of transaction that may be of interest.”

Single-core vs. multicore
This represents a break from the traditional uniprocessor approach, which initially served as the basis for multicore debug. But the limitations of leveraging the uniprocessor approach became apparent rather quickly.

“This works until you get to the tricky problems, which all seem to occur around shared resources between the cores,” said Larry Melling, director, product management at Cadence. “When you’re dealing with multiple cores, the shared resources are what create the stickier issues. The reason for that is with each core, software in general is written to say, ‘I don’t want to be timing-dependent upon the hardware.’ It wants to execute and be timing-agnostic. If there are two cores, each executing software that wants to be timing-agnostic, and they have a shared resource between them — whether it be a shared memory or, some other device that they both need to have access to — now, all of a sudden, I have some dependency there. That dependency, if it’s a hardware resource, inevitably involves a timing dependency, as well. Certainly you can engineer things such that you can handshake and make sure that if one core is using a resource, the other one can’t.”

The old approach also may require over-engineering for multicore designs, which limits performance, as well as efficiencies and other benefits that come with multicore designs.

“Those are the places where the old tools and the adaptations that we’ve made are challenged,” Melling said. “Even on single-core designs with multiple masters in hardware, we’ve always faced problems of having to debug memory corruption types of issues where they both have access to some sort of shared memory, and the memory location that you need to do one thing gets corrupted and overwritten by somebody else. Trying to find and debug those can be really difficult because basically it’s a needle in a haystack. There are millions or billions of instructions being executed, and when it happened is anybody’s guess.”

Fig. 1: Virtual and hybrid platforms to improve software speed. Source: Cadence

What exactly is the best approach is often a matter of preference.

“Some people will do a live debug of their design, where they’ll either build an FPGA prototype and put it on the bench, or they’ll run a prototyping system, they’ll use an emulator, or they’ll use a simulator,” said Alex Wakefield, applications engineering scientist at Synopsys. “Also, different users run different amounts of software. Some are just running bare metal boot software or things that don’t run an operating system. Those test runs are typically relatively short, and that means it can be done with the simulator. Other users go all the way to booting Android on their device before that device is manufactured.”

Heterogeneous multicore
This gets more complicated when cores are not identical. Toolsets need to be able to handle multiple cores at the same time, as well as multiple cores with different architectures.

“There may be a cluster of Arm cores doing packet processing, or there may be a neural network type core that is doing image recognition,” Wakefield said. “There also may be a power controller core, and some other application cores — such as x86s, or a PowerPC, or another type of heavy-compute architecture, or a larger Arm core — doing an application piece. They’re all running at the same time, and all interacting with each other. If the tool focuses only on one particular core at a time, you’ll have a tough time trying to say one core sent a message to another core, but then something happened and it didn’t get an answer back. If you can only debug one at a time, you can’t see the interaction between the cores. The tool needs to handle multiple cores at the same time that have shared memory spaces or separate memory spaces so it is possible to debug all the cores, and they all step together in synchronization. Typically, live debug won’t work very well for this.”

Frequently, a technical limitation crops up with real hardware cores. “You have to stop them, and just because of how this whole thing works, it’s impossible to stop everything at the same time,” said Kai Schuetz, senior staff R&D engineer at Synopsys. “You stop one core, then you stop another core. Then you run into the same problem again when you start the cores because you cannot start them at the same time. Basically, you will lose precision since the debugging interacts. It’s like you’re making measurements on the system, and by doing that measurement, you can mess it up and bring the system out of sync.”

It’s an intrusive process, and live debug breaks the sequence of events. So a test may give a false sense of security that everything is working, or it may produce a different behavior than if the debugger was not used.

Virtual modeling is less intrusive, but not always as effective. “Right now we’re seeing a lot of interest in virtual modeling for RISC-V, and people are catching a lot of bugs with it,” said Louie De Luna, director of marketing at Aldec. “We’re also seeing people using multi-core debug with QEMU, which is a free, open-source emulator, and they’re using that same environment for debugging software and hardware. It’s an interesting idea, but it’s not quite there yet.”

One technique that helps to reach failure points quickly is co-simulation, which involves executing several instructions on a model and design at the same time and comparing the results obtained from them. The checking involves the registers, the memory accessed by the instruction, any control registers, etc. For a multicore design, the idea is the same, except that an instruction will be executed on each core.

That introduces some potential problems. “Any memory that is ‘true’ shared by the cores cannot be checked such that the model, and the design might yield different values,” Valtrix’s Thiruvathodi said. “In many co-simulation verification environments, true shared memories are not checked. Co-simulation is a very powerful way to debug multicore designs, as you will be able to reach the point of failure very quickly. But because it cannot check true shared values, we have to resort to tricks mentioned above.”

And it becomes particularly difficult where software workloads are huge. “There are lots and lots of cycles,” said Cadence’s Melling. “There are virtualization technologies such as virtual platforms, and instruction-set simulators that allow for software execution at even higher rates of performance than the CPU designs running the software on top of them. Running at the speed controlled by that hardware, you can use these abstractions of virtualization to execute that faster, and because you design your software to be independent of the hardware timing, except where there’s handshaking and things that have to occur to synchronize, you can basically run those independently. We’ve been doing hybrids with customers that are the combination of the virtual world and the hardware world, and seeing 40X to 50X improvements on things like OS boot. That’s the other side of co-simulation that is occurring, and it’s about jumping up another level of performance — being able to do more software, and more validation of that in a full system, pre-silicon environment.”

Co-simulation also can involve instruction accurate models at different abstraction levels. “The best example of this is a RISC-V processor verification method, based on comparing the processor hardware RTL functionality against a RISC-V reference model with instruction-level analysis for a step-and-compare methodology,” Davidmann said. “Running a combination of random, directed, and architectural test case scenarios as instruction streams (software) on the target processor (hardware) is supported by closely integrating the Imperas simulator with the RTL test bench in a SystemVerilog simulation. This leads directly into a debug arrangement to interactively investigate the same code running on the RTL and the reference model within a single-tool environment.”

Finding bugs faster
One of the big challenges is to get to bugs faster and earlier in the design process.

Synopsys’ Wakefield noted that formal verification can help, and has seen customers successfully use formal to find some problems, with certain types of designs better suited for formal. “When software is sitting on top with the hardware, that problem gets pretty big, and it’s probably too hard for formal. At least today, formal would need to make a big leap forward to be able to handle that.”

Still, formal has an important role to play. “Some big challenges with multicore designs involve performance analysis, and the memory subsystem and fabric,” said Nicolae Tusinschi, product specialist at OneSpin Solutions. “Debugging an SoC with a single core is complex enough, and things get exponentially more complex as the number of cores increases. During this type of analysis, it is highly inefficient to deal with design issues that could have been detected in earlier phases. This is one of the reasons using a low-quality open-source core might not be the best choice, even when building a demonstrator. RTL bugs could slow down the project or even mask performance issues. The first step is to have high-quality core and IP models and setup an automated flow to formally proof that IPs are wired-up correctly. In many cases, it is also beneficial to do a formal verification of the fabric and memory subsystem to rule out deadlock and livelock conditions that are often missed in simulation,” he added.

No matter how it is approached, debugging multicore designs is complicated.

“At the moment you introduce one CPU in your design it is complex, then with each new CPU it is increasing the complexity,” said Olivera Stojanovic, project manager at Vtool. “There is complexity in terms of open source that you need for trace during the debug, and also to bring up the whole system. For this, you need a little bit of magic. Also, there are always a lot of problems. You need knowledge from different areas. You need to understand the flow. You need to understand C. You need to understand what to include. Then, with UVM on the other side, how will everything work together? It’s really complicated. On top of that, you need to understand the whole system. What will your test cases or use cases be? And you usually need to do all of this with the architect, which requires a lot of different knowledge, and you need a really good engineer that can handle it.”

Overall, the most complex part of verification is the interaction between processors because there are so many files involved, said Darko Tomusilovic, director of verification at Vtool. “There are so many different things you need to do in order to run even the simplest software scenario. It always took me a month just to run, ‘Hello, World,’ on a new processor. Simply put, it’s complex in its own nature. You need to know a bit of software, a bit of hardware, a bit of compilation flow, you need to know a lot of different aspects.”

Best practices
Despite the hurdles, there are some best practices for debugging multicore designs.

Melling, for example, advises hardware teams to get closer to the software teams. “Engineers doing hardware need to start to develop a sense of what kind of software and software architectures are going on top of that hardware, along with the implications of all that. If you design software-aware, you have the best chance for success.”

Still, the fundamental assumption is that taking a single-core debug system and extending it to multiple cores is fraught with danger.

“You may be able to extend a uni-core approach to a small number of cores but modern systems have many, many cores,” said Mentor’s Panesar. “We have customers who are deploying hundreds and sometimes thousands of cores, and this is not uncommon. It’s also worth making the distinction between homogeneous and heterogeneous multicore. The challenges are slightly different dealing with multiple core architectures, as opposed to multiple identical cores, or at least cores from the same vendor,” he explained.

Panesar noted some guiding principles for debugging multicore designs. First, it’s always important that debug is considered at the architecture stage. That requires an understanding of what structures and design enhancements need to be put in place to facilitate debug.

The second principle is to discard the processor-centric view of debug. Most bugs these days are down to system-level interactions. You won’t spot them by performing multiple uni-processor debug sessions and hoping the system interactions will take care of themselves. It requires a system-level view. In addition, processors aren’t the only complex structures on today’s chips. The behavior of NoCs and on-chip buses, custom logic, memory controllers and so on can have a significant influence on system behavior — and if you can take steps to observe and analyze their operation, not just CPU code execution, you will not only spot more bugs, but it will also become dramatically easier to find the root cause and fix them.

In general, the debug process needs to be viewed as an opportunity to make the product better. Not all bugs are show-stoppers, and making the product as good as it can be is just as valuable as fixing something that’s broken.

What multicore debugging solutions will eventually evolve to look like is yet to be determined. Some believe conventional debug will move to system-level analytics on-chip with local smarts, with the external host left to visualize and handle higher-level correlation.

But no matter how it ultimately gets resolved, change is required. Software debug and bring-up takes about 50% to 75% of the total development time of an SoC, based on research that was carried out in the days when multicore designs were relatively few.

“The term, ‘multi’ itself signified relatively few cores in each design,” Panesar said. “This figure certainly will not reduce as we move to ‘real’ multicore designs, so it’s a problem that we need to take very, very seriously.”

Leave a Reply

(Note: This name will be displayed publicly)