Heterogeneous System Challenges Grow

How to make sure different kinds of processors will work in an SoC.

popularity

As more types of processors are added into SoCs—CPUs, GPUs, DSPs and accelerators, each running a different OS—there is a growing challenge to make sure these compute elements interact properly with their neighbors.

Adding to the problem is this mix of processors and accelerators varies widely between different markets and applications. In mobile there are CPUs, GPUs, video and crypto processors. In automotive, there may be additional vision processing accelerators. In networking and servers there are various packet processing and cryptography accelerators. Server applications traditionally have relied on general-purpose CPU, but the future brings more dedicated acceleration engines, which may be customized for specific applications and may be implemented using FPGAs.

While heterogeneous processing has been in use for some time, it is getting more complex. In 1989, Intel rolled out the 80487 math co-processor for its 80486 CPU. And in 2011, ARM introduced its power-saving heterogeneous big.LITTLE architecture. In between and since then, there has been a growing mix of CPUs and GPUs and many other types of accelerators.

“It’s common, for example, to offload common tasks to a dedicated hardware accelerator, for video compression, cryptographic acceleration and the like,” said Neil Parris, senior product manager at ARM. “This costs silicon area compared to using a general-purpose processor, but it delivers a higher-efficiency and lower-power solution.”

One key difference from the past, though is these are no longer independent modules in a system.

“Where we find ourselves today is that all of that complexity is scaled down to fit inside a cell phone, but the same problem persists,” said Felix Baum, senior product manager for runtime solutions in Mentor Graphics’ Embedded Software Division. “You have an SoC. You have a chip. It has three or four cores, which used to be completely separate devices and boards in the past. And you have all these things interacting with each other. One core runs a real-time operating system, another core runs Linux, a third core is there to just do encryption and offload security operations, and a GPU is there for display graphics.”

Once complicated and sophisticated software is written, there may be timing issues or contentions on the bus. But because it’s now all fit into one chip, it is practically impossible to expose and visualize, Baum said.

To address this, there are a number of different frameworks for software engineers to manage the software complexity. But all of that stops at the software level. “We do have agents. We do have JTAG tools and other tools to capture the traces to latch onto whatever little information is possible on these systems to collect and provide to users for visualizing system behavior. For example, this operating system was doing this, and that operating system was doing that, and my touchscreen is non-responsive not because of the operating system that controls it, because of some other thing that was happening in parallel. Those are really hard problems to catch,” he explained.

In the past, software engineers used to laugh at the hardware engineers who would create models to verify and validate devices. These days, it’s a different story. Now it is the software engineers who are eager to take advantage of tools that take a hardware design and turn it into a software model of the chip or board.

Connecting processors and software
Chip-to-chip connectivity, such as cache coherent interconnects from companies such as Arteris, NetSpeed Systems, Sonics and ARM, are a path to connect processors and accelerators across different architectures.

It is not uncommon for different OS and RTOS instances to be running on an SoC, Parris said. A multitude of processors will interact in different ways depending on the application, including semaphores, mailboxes and interrupts. In the server and networking realm, multiple OS instances may run with virtualization, and these instances will share common hardware. In automotive and intelligent machine applications, multiple processors and OSes may need to interact. Then, in safety-related applications, care may be taken to provide isolation between different domains on the SoC. For example a Cortex-A processor may be running a rich OS like Linux for user interface and perception, and the Cortex-R will take care of safety critical functions including actuators.

Can simulation keep up?
For many years, simulation has been at the heart of verification. With the complexity of ever-more-complex heterogenous systems, there are questions about whether all of it still can be simulated.

Simon Davidmann, CEO of Imperas Software, believes it can be simulated, but with some caveats. “The challenge is always the interaction with its environment. How do you actually talk to the real world, because in some applications you’ve got four or five different radio systems that you are interacting with? Each one is independent, but they to all need come together as different threads in the software. It’s clearly quite a challenge. You can use simulation to a point, but when it comes close to product release, you do need to get out there and try the system in the real world.”

At the same time, Davidmann pointed out that if only real world data is considered when the system is tried out, that isn’t good enough, either, because it’s limited to how it works at that moment.

A common goal for many design teams is getting Linux to boot up. But as more processing complexity is added into systems that may be just the starting point for verification, where the challenge is understanding reliability in the context of the complete system with which it interacts.

“The real world doesn’t make it very easy or even make it possible for you to test things in the corner cases,” Davidmann said. “Simulation lets you move things into the corner cases. You’ve got much better controllability and visibility, so if you’ve got a system with a few different CPUs in it—accelerators and other active components in the real hardware—it’s very hard to configure and control it with specific corner case values. And then it is very hard to see what’s going on. Whereas in a software simulator, if the simulator is designed right, it should be very easy to put the design and the software into a certain state and then control it. When we are testing a new processor, we use all of the technology in our simulator to set the processor up into different states, and let it step through a few instructions. You don’t have to run it for a half an hour to get somewhere. You can basically inject all the values you need into a certain position, set it to a certain state, and then roll it forward, stimulate the new instruction, or the new interrupt, or the strange corner case, or whatever is required. Just the fact that it works today doesn’t say that it’s going to work tomorrow in the hardware so you need to be able to explore it.”

All that is to say he believes the only way to construct heterogeneous systems is with simulation. “Though simulation is necessary, it is not sufficient. When you get to multiple processors, SMP is hard enough, but when you get to AMP — because they are heterogeneous and they’re different, and they’re all interacting — then it becomes much more of a challenge. You need a lot of other tools and technology that comes with a virtual platform, such as assertions, functional coverage, code coverage, that allow you to see how your verification is running and keep an eye on it. You can be monitoring with assertions, and you know that you’re not breaking your rules even though it might not be visible at this point in time,” Davidmann said.

Divide and conquer falls short
Heterogenous systems require a rethinking of old methodologies, as well. While a divide-and-conquer method may have worked in the past, it leaves much on the table.

“There are more cons than pros to a divide-and-conquer method,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “A divide-and-conquer method works really well if systems are isolated, if they don’t talk to one another. Think about computing 5 or 10 years ago. It was just a bunch of different CPU clusters doing their own processing and having their own versions of the OS running on them, with very little inter-compute-engine communication. For that, divide and conquer works. Simulation for that particular subsystem works. But anytime you have complex interactions, then the divide-and-conquer method breaks down because it doesn’t expose all of the holes you could have. This is exactly why emulation became popular.”

Mohandass recalled a situation where three years of simulation work was completed on a core, then the system was booted, and in essentially 15 seconds it exhausted those three years of simulation. “That used to be the level of timescale. With emulation, we’re getting better but we are not there yet especially when you talk about these heterogeneous system architectures where you have CPUs, GPUs, accelerators, DSPs, everything talking. You’ve got to look at the glue that is holding all of this together.”

This is typically a coherent interconnect, and if engineering teams do this in subsystems, they miss the bigger picture and they expose themselves to a lot more problems, Mohandass said. “One of these problems is deadlock, where you have a CPU subsystem waiting on, let’s say, a PCIe subsystem, and imagine a CPU subsystem is waiting for something on the PCIe subsystem. The PCIe subsystem has different components. One leads to another, and one of the internal components is waiting in turn to respond from the CPU subsystem. You essentially have this cyclical loop, one waiting for another—CPU waiting for PCIe, and PCIe waiting for CPU, so the system is essentially deadlocked. If you do this divide and conquer method, you’ve verified your CPU subsystem to death. Great. It works perfectly. You’ve verified your PCIe subsystem to death and it works fantastically, and is even silicon-proven. Both of them have been silicon-proven in different contexts very well. You put them together, the chip doesn’t work. Why doesn’t it work? Because there is this big deadlock that you failed to understand and failed to analyze because you were not looking at it from a system level. That is the pitfall of a divide-and-conquer method. The divide and conquer method worked really well a decade ago. Not anymore.”

ARM’s Parris asserted that it’s important to develop the system all together to get the best idea on how it behaves, this includes processors, interconnect, memory system and debug capabilities.

“For simulation there need to be a mix of full system and divide and conquer. To get suitable performance for software development, system designers have been using ARM Fast Models to simulate a programmer’s view of all the hardware. To get more detailed performance, designers might mix divide and conquer, for example, starting with traffic generators in place of processors to check the interconnect and memory system characteristics, or run cycle models to measure performance. It’s also possible to mix these techniques. For example, ARM offers a ‘Swap and Play’ technology to allow you to boot an OS quickly with fast models, then swap to cycle models for full accuracy. One key takeaway is that to get real performance data, it’s critical to run real code on a real processor attached to the system rather than to try and model its traffic in artificial way,” he said.

Tools vs. methodology
So what does this mean for tools vendors? Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence, believes the whole question of divide and conquer could be reversed to, “‘Is divide and conquer the evil that prevents further tool integration?’ My simple answer to all of this was, yes. It still works because a lot of people do divide and conquer, and we really haven’t been able to find a lot of issues. There are some that can only be looked at across different OSes.”

Schirrmeister drew an analogy to work being done in Accellera’s Portable Stimulus group, which is all about generating scenarios. “This is all about the type of effects, but it’s not even at the OS-to-OS level, but within the GPU-CPU-accelerator type of domain where you assess challenges like, ‘I have a cache coherency issue. Will my caches be correct if I’m performing a certain access through the cache, and right at that time I’m experiencing a power shut-off in one of the domains in the chip?’ Those are the types of scenarios which are so complex at the SoC level – not even at the multi SoC level – that the modeling helps to express it.”

To further illustrate the challenges that exist going forward, Mike Thompson, senior manager of product marketing for DesignWare ARC processors at Synopsys, pointed to the company’s vision processors, which are heterogeneous with multiple different types of execution units. He observed there is still a lot of homogeneous implementations that are being done, and said in many ways the challenge isn’t that different between the two structures of processor design.

“But when you get outside the processor, or if you have multiple processors, you could have multiple processors that each have multiple execution units or processors inside,” Thompson said. “Our vision processor can be configured with eight or nine different execution units inside. It can operate independently. It can be configured independently. Or it can operate together. That’s going to become more common, especially with applications like vision. But as we see a merging of, for instance, radar and vision, we are working with customers that want to fuse the information coming in from radar and vision.”

A tin can on the road can look as large as a car with radar. “With vision there’s obviously differences and you can detect that,” he said. “When you bring the two together, now you’ve got a much more complete picture. But using that data together is a challenge because you’re going to process the radar information in a wide vector DSP structure, and you’re going to need the same capability to do vision. Now you need units with somewhat different capabilities, but also a lot of similar capabilities but you have to bring the information together. We are going to see this more and more as we move forward.”

Related Stories
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
How Many Cores? (Part 2)
Fan-outs and 2.5D will change how cores perform and how they are used; hybrid architectures evolve.
How Many Cores? (Part 1)
Design teams are rethinking the right number and kinds of cores, how big they need to be, and how they’re organized.
CPU, GPU, Or FPGA?
Need a low-power device design? What type of processor should you choose?



Leave a Reply


(Note: This name will be displayed publicly)