Using Processor Trace At The System Level

More on-chip interactions and design heterogeneity is rekindling interest in a well-worn technology.


The race to process more data faster using less power is creating a series of debug challenges at the system level, where developers need to be able to trace interactions across multiple and often heterogeneous processing elements that may function independently of each other.

In general, trace is a hardware debug feature that allows the run-time behavior of IP to be monitored. More specifically, processor-trace functionality is a hardware real-time monitor that non-intrusively captures events in the CPU, sends it out to an external device where it will be saved, and ultimately reconstructed into human-readable form.

“This provides valuable information that can be used to debug the CPU,” said Steven Yeung, vice president of CPU hardware engineering at Imagination Technologies. “Since there could be large amounts of data captured, compression is typically implemented in the CPU hardware, and thus does require system level components and software to decompress and reconstruct the data.”

Trace technology isn’t new. It has been in use since the late 1980s, when the microprocessor industry really began ramping. The problem is that it has not kept pace with the ongoing explosion of customized processors being developed for the cloud, edge and AI-related design. In fact, when it was first introduced, processor debug was known as static debug.

“You would use JTAG or you would go in, run the system, then stop and analyze the state of the processor, then run again,” said Rich Collins, product marketing manager for IP subsystems at Synopsys. “That worked, in general, for debugging bigger issues. But the whole idea of trace came about because people wanted to be able to see in real-time what was happening in the system, because when you stop on something like an instruction boundary, oftentimes you miss something that might happen if the processor was just running. That’s what started the whole idea of doing real-time trace, where you could get data off of the chip while the processor was running, and then use tools to analyze it in real-time to understand what’s happening in the system. Within that, people wanted to see tracing instructions that run in the processor, but they also wanted to see data transactions and the ID of the task that happened to be running at that time, along with the program counter. All those sorts of things they wanted to be able to see inside the processor in real-time.”

Given that a sizable portion of the time and effort in software development is spent on failure debug, it’s easy to see that processor trace functionality is now so valuable. “As the systems become more complex, the time spent on debug also scales exponentially,” said Shubhodeep Roy Choudhury, CEO of Valtrix Systems. “Processor trace functionality, in very simple words, provides visibility of program execution. It enables the users to determine the exact set of instructions executed by the processor, which can then be analyzed to get to the root of the failure.”

Technology for the algorithms and the features mainly came out of single-processor systems and systems with a small number of cores. “Moving into the 21st century, where there may be many-core systems, as well as some completely bonkers multiple-core systems, the concept of having processor trace is still attractive, but requires a shift in the algorithm to make it efficient in such systems,” said Gajinder Panesar, CTO of UltraSoC. “Having made that shift you can then work backwards to single core systems or 2 core systems or 8 core systems, but if you don’t address the performance and the bandwidth issue head on, you’re asking for trouble.”

Standards needed
The IEEE Nexus 5001 real-time trace standard was introduced to add some consistency into this process. One of the keys in this space is to make sure that the tool vendors have a standard interface that they can talk to, so they’re not trying to build custom tools for every single processor implementation, Synopsys’ Collins explained.

Other prominent processor providers developed their own trace capabilities. Arm, for instance, has CoreSight, which has become a de facto standard given the proliferation of Arm cores.

“Synopsys’ ARC processor was originally and continues to be a Nexus 5001-supported architecture, and generates trace according to that IEEE Nexus standard, but also supports the CoreSight interface, as well,” Collins said. “This is so that if a tool vendor wants to have an ARC core in a system with an Arm core, they can still support that with our trace and combine that with the Arm trace, because what’s happening more often. Processor trace is getting extended to system trace or SoC-level trace.”

This is particularly useful for software development. “When trying to debug devices, SoC developers, chip suppliers, but mostly software developers, want to be able to see the entire system view because these devices are becoming super complex, with maybe a handful or even dozens of processors on a single die,” he noted. “There’s so much going on at the same time that they want to be able to get a picture of how things all play together — how different processors play together, system level functions, memory accesses, all those sorts of things. They want to be able to see it, and then get that data dumped off chip and be able to view it and understand how everything is playing together.”

How this looks inside different processors is specific to each one, Collins stressed. “It’s how we actually get information out for an ARC core, versus somebody’s RISC-V implementation versus an Arm core that’s proprietary or specific to that core. The key is to have it as it gets put together into a message that a tool on the outside of the chip can understand and interpret. Standardization is important so that instruction trace message coming from a RISC-V core or an ARC core looks the same to a company like Lauterbach, for instance, that builds trace tools.”

Also, processor trace provides a detailed history of program execution, which is useful for debugging and performance optimization, and often has no performance impact on the program being traced, said Tim Whitfield, vice president of strategy, Automotive & IoT Line of Business at Arm.

As an example, trace can be used in performance profiling tools such as AutoFDO, which uses processor trace captured in main memory specifically to capture traces for analysis on the device without interrupting program execution too frequently. In small microcontrollers, where main memory is limited, export off-chip is often preferable.

“For Arm, processor trace captures many different pieces of information, such as basic program flow execution, cycle counts for detailed performance analysis, global timestamps for correlating program execution across multiple processors and debugging software coherency problems, and data trace in some processors to allow reconstruction of variables and memory contents,” Whitfield said. “Further, Arm’s CoreSight debug and trace technology provides extensive options for trace capture. Depending on the device, trace can be captured in many different ways, from being exported off-chip, to captured in the main memory, to captured in dedicated memory on the device. Regarding exporting off-chip, this is the method used to capture boxes with large amounts of storage for extended periods of tracing. This is mostly used in early development of devices in the lab, as well as when debugging devices that have failed in the field.”

On-chip methodology includes capturing in the main memory, which helps re-purpose some RAM to act as local storage on the device. This means it does not need off-chip capture boxes, and also enables analysis of the trace on the device, which is useful for ‘on-the-fly’ performance optimization in the field. The other on-chip method is capturing in dedicated memory on the device, where the use of main memory is undesirable for the given application, such as with safety-critical or real-time system use-cases. By capturing the trace in dedicated memory, it allows non-intrusive capture of trace and is useful for post-mortem debugging, Whitfield said.

The method used for processor trace functionality will ultimately depend on the type of device or application you are working with, he added.

Debug as a process, not an individual step
There are a number of tools engineers use to debug complex processors. None by itself is sufficient, and many play critical but different roles in increasingly complex design debug.

“Processor tracing allows recording of the processor running, which is very useful in debugging crashes and illegal instructions,” said Shaun Giebel, director of product management at OneSpin Solutions. “However, it can take millions of cycles to hit the problem, and debugging such a long trace is challenging. How do you start tracing back to find the source of the problem? How long could it take? Formal verification can help. You specify a property to describe the state/failure that is the symptom of the bug, and then formal engines generate a ‘counter-example’ trace showing how the failure can occur. This will be the minimal possible path back to the source bug, likely many magnitudes shorter than the original processor trace. This approach is used in pre-silicon verification and in debugging errors found during post-silicon validation in the bring-up lab.”

Much of this needs to be at least system-aware, if not entirely a system-level process. Valtrix’s Choudhury said the entire setup of trace functionality requires a processor with a trace interface/module that can generate packets of information and store them into memory. A software decoder is also required to decode the packets and reconstruct the program execution.

But is there one methodology for all of this? Probably not.

“Think about anything at the system level,” Collins said. “System architects and design teams want to have visibility. An easy example is bus transactions. There can be fairly complex bus fabrics connecting all these different cores and logic elements, and the software developer wants to understand how transactions that go out onto the bus or onto the fabric are working — or not working in some cases — so system-level bus tracing is an important aspect. A lot of engineering teams are looking for solutions there. Also, memory and anything that accesses memory at the system level needs to analyzed. So do more and more peripheral and high-speed I/Os, and things at the transaction level such as Ethernet packets. I’m going to want to be able to analyze that those Ethernet packets are being processed correctly or being transmitted correctly. Those are all examples of system-level features where people want to see trace.”

The challenge rises with the integration of different IP blocks. “I have seen a case where a customer had to integrate an Arm core, a RISC-V core, a Tensilica core, and a DSP core from a competitor all onto the same SoC,” said George Wall, product marketing director for Tensilica Xtensa at Cadence. “They also had to integrate non-processor IP blocks. Being able to debug the interaction of those, when things are interacting with independent tasks running on each of those cores is a tremendous challenge. Customers want the visibility they get with trace, and they want a standard because the software tools need to be able to decode that trace information and display it in a meaningful way.”

Planning ahead
Debug is an aspect of design that needs to be thought out at the architectural level. “It has to be built in from the initial specification,” Wall said. “It isn’t something that a designer easily can add in after the fact, so it really has to be planned up front. It does need to be planned out at the system level because there are resources such as memories that need to be allocated for storing the trace data. And there are I/Os that need to be specified for dumping that trace data out, as well as mechanisms made for controlling that trace. So it is a system-level design problem.”

Most designers today can recall a time when software development efforts really only started after silicon was available.

“In order to pull in software development and achieve faster product time to market, several trends emerged,” said Hagai Arbel, CEO of Vtool. “There was emulation using FPGA, which was used as another sort of verification, but mostly a software development platform, as well as emulators. Then people realized that similar scenarios were developed by the simulation testbench teams and embedded software teams, and Portable Stimulus was born. What was done in parallel by teams all over the world was trying to solve the debug problem.”

From a project perspective, there are two key steps. “First, replace the CPUs in a UVM testbench with VIPs to better control the scenarios you run,” Arbel said. “UVM teams are doing that. Second, run small pieces of software in simulation in order to validate the integration of the CPU and its subsystem. This is done, for example, to validate the boot ROM code. The combination of these is left for each team to decide, as there is no real methodology for how to split the two. Many times, they overlap. The second approach also assures good preparation for the silicon bring-up process.”

However, one of the main challenges for both the software team that uses simulation before emulation/FPGA is available, and the UVM simulation team, is debugging. The problem is that the software and the testbench need to be debugged at the same time.

“Processor trace functionality, as being addressed for good reason by RISC-V, is focused on better debugging capabilities for the software team,” Arbel said. “It is extremely important that verification teams use similar means in order for these two teams to communicate efficiently.”

Standardizing the debugging process should include visualization of the trace functionality, he said. “It’s how engineers can perceive what has happened in a given scenario of interest without diving into thousands of instructions. In addition, given traces of some software under debug, how can algorithms we run on that data can provide insights to root-cause-analysis of the software performance and coherency? These should be applied both in simulation and during on-board debugging.”

A key challenge for the verification/debug side of the design flow has been recognizing that the software and the hardware team are no longer as separate as in the past. These two disciplines have been increasingly dependent upon each other to improve performance and lower power consumption in complex designs, which in the past was handled by scaling of features in a new manufacturing process. But as those scaling benefits have decreased, particularly below 10nm, the solution increasingly has been new architectures and more reliance on hardware-software co-design.

“We have to develop the means of communication, i.e., technical project communication, that treats both as different aspects of the same project,” Arbel said. “Steps have been taken but very, very slowly, and people realized a long time ago that we cannot wait for the silicon in order to start a software process.”

Teams are continually reinventing how they are approaching debug. “For example, in every SoC project, the verification team is required to write and run software,” he said. “Some write software, some get it from the software team. They’re going to the software team saying, ‘Give me simple, short software that is doing this and that. They get it. It’s buggy by nature, and then they start a long correspondence cycle between the two of them. But the main time-consumer is the fact that they don’t understand each other. And when I talk about debug, I refer to debug on a wider aspect of communication, because when you debug something as an engineer and it’s code that you wrote, and you run some simulation and you debug a bit of hardware or software, you can go back to your code, your tracing. Today debug is seldom that. Debug is communicating with other people, other parts, and the whole process is becoming about communicating problems, communicating assumptions as for root cause, and trying to figure out together what is the shortest path in order to understand that and fix that.”

Trace has been around for a long time. As with anything, technology needs to evolve.

“If you look at some of the processor designs out there today where there is a strange system glitch, and the system goes into production, then suddenly there is excessive latency or some graphics glitches,” said Neil Hand, director of marketing, design verification technology at Mentor, a Siemens Business. “How do you track them down? The ability to do this becomes more important when you have something like RISC-V, which allows more degrees of freedom not specifically because it’s open source, but because you’re encouraging people to enhance it. You’re encouraging people to do new and interesting things in there. As a result, you’ve got to do more testing. One of the big challenges of RISC-V is that as soon as you change anything inside a processor, you have to revalidate that whole processor. And processor validation is a skill unto itself.”

Companies like Arm and Synopsys have large teams of experts for this. “They know exactly how to go about verifying that processor,” Hand observed. “A lot of the RISC-V people don’t. And yet once they change a single thing in that processor, it needs to be revalidated. And that doesn’t just go for the functionality. It goes for performance, as well. Let’s say you’ve added a memory fetch mode. What’s that going to do now to the whole bus and system level performance? This trace capability, regardless of whether it’s in software with system-level analysis, or whether it’s in hardware with these trace capabilities — it becomes even more critical than it was in the past.”

Along with that, the need to have some kind of a standardization is essential.

“It’s not possible for tool vendors to build one-offs on every single device that gets created, so having some sort of standardization, with the ability to customize, that’s going to continue to be important,” said Collins. “Also, heterogeneous tracing is going to be important because there is going to be a mix of different instruction set architectures that are going to be on devices. It’s not realistic that any one architecture is going to dominate a device that has, say, 18 processors on it. It’s likely that they’re going to be different processors and different kinds of processors, such as convolutional neural network engines. They’re kind of a different animal when it comes to processing, but people are going to want to be able to trace transactions there as well,” he said.

The bottom line is that as more processing gets pushed onto a single device, the need for tracing increases. There are more interactions, more dependencies, and far greater complexity. Being able to trace all of that is critical, but also far more difficult.

Leave a Reply

(Note: This name will be displayed publicly)