Tools For Heterogeneous System Development

Final in a series: The amount of software that interacts with hardware is increasing, and no longer can applications ignore the execution platform. What is EDA doing to help?


System architects look to both heterogeneous and homogeneous computing when there are no other options available, but the current thinking is that a system-level software methodology could simplify the design, ease integration of various blocks, and potentially improve performance for less power. While the theory appears sound enough, implementing it has turned out to be harder than expected.

When a design has to meet timing or power budgets, dedicated hardware becomes necessary. The problem with that approach is it limits the design’s flexibility. Specialty processing cores are the middle ground, but the programming environments for them are disjointed and the software industry has proven slow in developing tools and methodologies to deal with these new architectures.

Still, the amount of software that is hardware-dependent is growing. Security and power are just two of the affected areas. “Most operating systems (OSes) have hooks in them to manage power states, mostly for the CPU cores, but beyond that it is an application-level exercise to make sure you are staying within power boundaries,” says Warren Kurisu, director of product management and marketing within Mentor Graphics’ Embedded Software Division. “While the OS should be able to provide a finer level of control than the application, we have not seen that in any OS so far.”

The software industry has been developing applications and methodologies for years, but it has been resistant to change. In response, the EDA industry has developed models of the hardware with sufficient accuracy and high enough execution speed to enable software development, bring-up and debug earlier in the development flow. That allows software to have some level of impact on the hardware.

These models come in several forms. “The main difference between them is the time of availability,” says Tom De Schutter, director of product marketing for physical prototyping at Synopsys. “The virtual prototype can be available before the RTL is stable. This enables more parallelism between the two teams. From the moment you get stable RTL, the advantage of the physical prototype is that you get more accuracy. It is more a matter of timeline rather than capability. The virtual prototype has easier debugging, so it makes it more likely that you would tackle things that would be harder to debug. But the physical prototype has access to real-world I/O, so you can run in the context of the interface, such as USB. The use cases are different even though it is the same team doing it.”

And sitting in the middle is emulation, which has a different set of tradeoffs attached to it. “One reason we have seen a rise in emulation is that semiconductor companies do not want to create untimed or loosely timed SystemC models for the virtual prototype when they have the RTL and can put it in an emulator,” says Larry Lapides, vice president of sales for Imperas. “This is a low-risk, even if low-payoff, approach to using emulation. But emulators are expensive, and it could pay for a lot of engineers developing models. But that is seen as being riskier.”

Lapides believes embedded systems companies tend not to use emulation at all. “While we see emulation being touted as hardware/software co-design, they are just a wide transom that is being used to throw things over rather than a unifying factor. A multi-million-dollar piece of hardware is not a unifying factor.”

But the industry has needs that have to be met. “Hardware teams are presented with the challenge of providing hardware execution platforms as early as possible to support the continuous integration efforts of hardware and software,” points out Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence. “Virtual platforms, simulation, emulation and FPGAs are all required. In addition, techniques are needed to address issues that would take too long to catch in simulation.”

Virtual models
As soon as you start talking about models for virtual prototypes, a discussion starts about the appropriate level of abstraction. “There is a substantial difference between building a virtual platform model that is focused on function versus one that includes performance,” says Drew Wingard, chief technology officer at Sonics. “The vast majority of the software community is fine with function only. They would like to be able to write hardware-dependent software, without waiting for the chip to come back. The virtual prototype folks have almost done as much as could be asked of them, if only we had the models for all of the hardware.”

And that creates the second problem. “There is the dependency on the hardware team to create the virtual prototype, and that is one of the most fundamental issues,” says Synopsys’ DeSchutter. “The hardware team has been used to using EDA tools, and they needed the tools to get their job done. In software, there has always been an aversion to understanding the hardware and creating prototypes for software development, so they were always dependent on another team, a CAD team, an EDA team or the hardware team to create the prototypes. That limits the availability and the notion and understanding of the capabilities available because they do not interact with it first-hand.”

IP companies, such as ARM and Imagination, ensure that models of their processors and sub-systems are available. They spend a lot of time and money on these models. In 2015, ARM bought Carbon Design Systems, one of the leaders in high-speed model creation. But the processor is only one small of the total SoC, and most other IP providers do not currently provide abstract models that can be used for software development.

When the software team is left to do it themselves, they take a different path. “They stub the hardware out,” says Wingard. “It is not that the technology doesn’t work, but that fast enough models often don’t exist. It is a simple ROI issue. By the time the models are written, are there enough guys writing enough lines of software whose job will be sped up so that I am better off than waiting for the chip to come back, or running on an emulator? The answer often becomes that I am better off running on an emulator.”

Lapides sees this changing incrementally. “Models are becoming more available. In the teams with the best methodologies, they are still stubbing things out. They know that they will do hardware debug eventually and so the question is what is the ROI for building a model. It comes down to having a software verification plan that says this is best debugged on the virtual prototype, and this will be done on the hardware. So they may only do extremely software-centric debug on the virtual prototype with just a few peripherals, but in the next project they can re-use the models that exist. And reuse is a lot easier in virtual platforms because of the higher abstraction. In the next project they may add a few more models, and so the full slate of models gets built over time.”

When multiple pieces of hardware are integrated closely together, they interact, even if the software is supposedly independent. This makes software debug a multi-level problem. Hardware developments, such as cache coherence actually make this problem worse. “When you have software running on CPUs and DSPs that have shared memory and shared buses, how do you know when one thing is affecting another?” asks Mentor’s Kurisu “It is a difficult problem and it is hard to debug them. Ensuring separation, sharing devices such as I/O, access to GPUs – how do you enable communications across the system – these are the problems that the industry is facing.”

The deeper you look, the more problems that are uncovered. “Debugging of multi-threaded applications is hugely challenging, as the number of possible legitimate execution orders may be effectively infinite,” says James Aldis, architecture specialist for verification platforms within Imagination Technologies. “When the application is also running on a collection of heterogeneous processors, new problems arise. The debugger has to support all the processors and be able to present the potentially vast flux of information from them in a manner useful to the operator.”

No longer is this just a hardware-dependent software problem. “Today, a device has many layers of software stack on top of the hardware, and the approaches that need to be done to get those on board aren’t always necessarily the same,” says , director of models technology at ARM. “You don’t want to use the same methods and co-design techniques for bringing up an OS-level application as you would on the lowest-level driver, for example – very different speed and accuracy requirements. The biggest shift that I’ve seen, in the time that I’ve been doing this is that typical teams will use multiple techniques in order to get each one of these various needs addressed and to get the hardware designed to the software throughout the process.”

The industry is divided about the best way to do software debug. “One camp wants the debug environment to look exactly like it looks on the hardware using the same tools,” says Lapides. “This is nice from a continuity perspective but you end up with asynchronous debug because you have multiple heterogeneous processors and debuggers hooked up to different ones and when you set a breakpoint on one, the debugger cannot synchronously stop the other resources on the device.”

“Heterogeneous multicore simultaneous debug can be tricky,” admits Chris Jones, product marketing group director for Tensilica IP at Cadence. “Many chip designers choose industry-standard bus-based debug topologies. Many customers appreciate having each core appear as a memory-mapped peripheral rather than having to stitch cores together over a JTAG daisy-chain. JTAG remains popular within our customer base.”

Lapides puts the case for the alternative. “In the other camp are teams that have advanced their methodology and are willing to have things look different in order to be able to fully take advantage of the virtual prototype capabilities including the controllability and visibility that they provide. At the same time they get determinism, multi-processor heterogeneous debug, can synchronously break and control the whole device.”

No matter which approach they try, some issues remain constant. “You can’t abstract the heterogeneity out of debug,” says Aldis. “The user is going to be confronted by GPU- or DSP-specific feedback; is going to need to understand when groups of threads advance in lockstep or are independent, will need to know about memory scope visibility and synchronization. The industry is still a long way from generic software engineers being able to debug effectively on complex heterogeneous, multi-vendor platforms.”

And the problems are growing beyond just the functional aspects. “Can the user check what bandwidth the GPU is actually getting during the critical processing phase?” asks Aldis “Can they see the latency from dispatch of a task to the task code actually starting? Can they identify races between cache maintenance operations and start of processing tasks? Achieving the gains requires analysis and debug; profiling, measurement and optimization.”

New approaches
Some aspects of the system need to be optimized during the development of the actual silicon. “Coherency cannot be resolved in software only,” says Schirrmeister. “For this, the software and scenario-driven verification offerings of the Accellera allow users to specify scenarios that otherwise would be extremely difficult to create manually.”

may be one thing that pulls the two sides together. “The only way to fully test some systems is to generate parallel test cases running on multiple processors,” says , chief executive officer of Breker. “It is too hard for a human being to write such test cases. Verification tools based on Portable Stimulus tools must be able to generate parallel test cases on heterogeneous systems.”

Another approach is to add on-chip instrumentation to help with the problems. One example of this is UltraSoC that has a library of vendor-independent debug IP.

Industries such as automotive may also bring about changes. “You could also do fault injection within a virtual prototype to see how software reacts to errors,” adds De Schutter. “We are starting to see interest in that, especially for safety critic applications, such as automotive. But even there, the uptake is slower than you might expect given the benefit that you can get from it.”

Security is another area that may force change. “Architectures are constantly evolving,” says Wingard. “There is a battle between allowing everyone to talk to each other, which can enable things that were not considered when designing the chip, versus not allowing anyone to talk to anyone else until something allows it. You may lock everything down to start with and then unlock when a connection is needed and required. That also requires a new set of software models.”

That also may imply that technologies may have a larger role to play in the future as well because that is the only technology that can actually prove some of these needs for separation.

The hardware and EDA industries have been attempting to make the transition to heterogeneous multi-core easier, but there is still a long way to go before clear methodologies emerge that encapsulate hardware, software and systems views and take into account functionality, performance, power, safety and security.

Related Stories
Heterogeneous System Challenges Grow
How to make sure different kinds of processors will work in an SoC.
Embedded Software Verification Issues Grow
Inconsistent results, integration issues, and lack of financial incentives to solve these problems point to continued problems for chipmakers.
Are Chips Getting More Reliable?
Maybe, but metrics are murky for new designs and new technology, and there are more unknowns than ever.


Kev says:

Heterogeneous System Development is really a macroscopic version of IC design – a bunch of diverse computing elements working in parallel to perform some function. The main problem is that the RTL methodology used by digital designers doesn’t work for anything other than the digital blocks on a chip (and not well at that).

At this point you need a top-down design flow that works at all levels and understands the analog world as well as 1s & 0s.

You also want correct-by-construction approaches for complex systems rather than a “build it then verify it” approach.

That means you want to ditch a lot of the existing methodology in favor of a “software defined hardware” approach so that you can algorithmically define your entire system ahead of hardware (an “executable spec”), that can be translated (by formal methods) into actual hardware or compiled for fast execution (GP-GPUs, FPGAs).

That leads to the problem that you can’t get software guys to do parallel stuff, C++ doesn’t support heterogeneous parallel processing. However the new crop of machines for deep-learning don’t work the same way as the old ones and need new languages – neural-networks look a lot like analog circuits. So I would expect some major changes –!t=jo&jid=/google/design-methodology-engineer-1600-amphitheatre-pkwy-mountain-view-ca-1294020104&

Leave a Reply

(Note: This name will be displayed publicly)