Just adding more cores doesn’t guarantee better performance or lower power.
It’s one thing to pack multiple processor cores into a design, but it is much more difficult to ensure the hardware matches the software’s requirements, or that the software optimally uses the hardware. Both the hardware and software teams are now facing these issues, and there are few tools to help them fully understand the problems or to provide solutions.
Design teams continue to add more complexity, and they expect software to optimize the power/performance tradeoffs. But whose problem is this, and how will it be solved? The answers aren’t clear at this point.
Adding more cores can make the specs look good, but that approach rarely leads to the right solution. “It is an easy solution to just throw in more processors but there is difficulty harnessing all of the processing power from a software point of view,” says Tom de Schutter, director of product marketing for physical prototyping at Synopsys. “That is because of a lack of tools and methodologies that could help with the issues related to parallel processing. The SoC team is still trying to find out the best ways to work together and find a way to allow everything in the system to get used.”
Anush Mohandass, vice president of marketing at NetSpeed Systems, is in agreement. “You cannot just say, ‘I have this great architecture that has a thousand processors and expect software to restructure their code to take advantage of it.’ That is not the way to do it. You need some common APIs and from there you have to drive software adoption.”
Most legacy software is sequential code. “The typical solution has been to say that you can’t get there from here just by writing sequential code and growing into those applications,” says Kumar Venkatramani, vice president of business development at Silexica. “You have to start doing some kind of parallel capture and specification. Even with that, the challenge becomes how to take multiple tasks and map them onto multicore platforms. This is not a simple scheduling problem and you can’t just approach it with round-robin scheduling. You have to put more intelligence into the mapping.”
The embedded world has often looked at specialized cores to help with system optimization. “If we could get to the required level of features and performance at acceptable cost and power by building arrays of cache-coherent processors we would do it,” says Drew Wingard, chief technology officer at Sonics. “The point is that those are not sufficiently optimal solutions to the problem. The reason why we have heterogeneity is because we are trying to focus the hardware to better match the system application. So we find there are different styles of processors or hard wired pieces that do a far better job in the performance/feature/power/area matrix than we can get from general purpose programmable processors.”
Defining the problem
One way to help with mapping optimization is to have benchmarks that can show how effective the mapping is and thus operate as an optimization tool. That is a task the Embedded Microprocessor Benchmark Consortium (EEMBC) is tackling. “In the early days of EEMBC, we would write benchmark code and people would pick their compiler and they would run it on their processor and it was all very portable,” says Markus Levy, president of EEMBC. “This does not work with heterogeneous chips anymore. Today they include accelerators such as a GPU and DSP and possibly hardware accelerators. The level of heterogeneity is infinite, so you have to look at it from a higher level of abstraction. From a benchmarking standpoint, we develop a template that can run on a specific architecture. But the goal is that people will take that and modify it to work with their own custom implementation.”
Several programming models are under discussion. “They highly depend on the application domains,” points out Frank Schirrmeister, senior group director for product management in the System & Verification Group of Cadence.” OpenCL is one alternative. OpenMP and MCAPI are other alternatives, as are other graphics-specific programming models like CUDA. The challenge is mapping the higher-level programming models into the actual hardware fabrics, which sometimes is done via compilation and run-time multicore communication frameworks. It is all about abstracting the hardware from the programming itself.”
There are several layers to the problem. “There are lots of initiatives in the industry,” says James Aldis, architecture specialist for the Verification Platforms Group of Imagination Technologies. “A programming language for capture of applications is always going to be needed, and many new/existing languages are making sure they support heterogeneous and parallel applications. Most interesting at the moment may be OpenCL-C and proprietary languages like C++-AMP, but C++ is possibly the long-term winner.”
Then there is the platform itself. “Support for language implementations on diverse platforms is also required,” continues Aldis. “Virtual architectures that attempt to abstract the underlying processors and their architectures so that hardware details are not required at compile-time are in their infancy.”
One such attempt is happening within the Multicore Association. “They announced the first version of SHIM (Software-Hardware Interface for Multi-many-core) about out a year or so ago,” says Maximilian Odendahl, CEO of Silexica. “We are now working on the next generation of the standard. The standard defines an XML file which would describe the number of cores, the type of cores, the different sizes of memory, the communications architecture. Is it a bus? Is it a NoC? What frequency and voltage islands exist? Then we can use compiler technologies to understand the software, not only in terms of computational requirements, but in terms of how much data is being sent at which points in time.”
Moving data around is just as important to overall performance. “In principle, it is the communications overhead that limits scalability,” says David Kruckemyer, chief hardware architect for Arteris. “In the early days, when you wanted to share a piece of data you had to check in all of the caches. You had to check in a lot of different places because you made copies. They were very basic and they checked everywhere. As you scale up the system, you have Order N2 in terms of communications. That placed an upper bound on scalability. So they implemented directories and snoop filters to track where the copies are throughout the system. There was a small tradeoff in terms of area, but it was a win in terms of scalability.”
Several organizations are tackling the mapping and optimization problem, and have adopted task graphs at their core. EEMBC first defines use cases, such as those needed for ADAS or embedded vision. Each of the use cases is divided into the individual processing steps at the kernel level, and each step is encapsulated in a µBenchmark. This enables it to run on any of the available compute devices, and then capturing the processing performance and latency inherent in each step. These are then combined using a Directed Acyclic Graph (DAG). The benchmark deploys µBenchmark nodes across the available compute devices on the heterogeneous architecture, identifying an optimal distribution and application flow.
Synopsys’ De Schutter explains how prototyping can enable that kind of optimization: “On the boundary between hardware and software, we are trying to help with prototyping by extracting the problem and looking at task graphs rather than software stacks. You can look at what kind of tasks you want to do, and you can explore how they are mapped and how that will impact the overall power and performance profile for a specific application. That can also help with architecting the system.”
The tools on the hardware and software sides remain a long way apart. “On the hardware side we have come up with a lot of tools to explore parallelism and to do tradeoffs between what goes into software and what need to be implemented in hardware,” continues De Schutter. “But when you get into the software world, we have some mechanisms to help, but it is more rudimentary and tends to be more about running tests over and over again. It seems to take the software community longer to embrace change and new methodologies and to understand the capabilities of new features or prototyping that can help with hardware related problems.”
There have been several attempts to take knowledge from the hardware world and develop tools to help software development. “Imperas 1.0 went down a path of tying to use simulation to help users optimize software for multiprocessor platforms, be they homogeneous or heterogeneous,” says Larry Lapides, VP of sales for Imperas. “We were way ahead of the times and we did not have enough knowledge of the software side of it to build an effective solution. Other companies, such as Critical Blue, attacked it from a different angle: ‘How could I parallelize software?’ They did that by analyzing the software and finding potential areas that could be parallelized. Over the next 10 years, they switched focus because they had not managed to find a way to automate that and instead provided more of a service model that took advantage of their tools.”
Today, Silexica is attempting to do similar things. In addition to having tools that can map tasks onto processors, it also is attempting to deal with the legacy issue. “The parallelizer can take in legacy sequential C Code and allows you to explore that space,” says Silexica’s Venkatramani. “There are techniques in compilers such as data level parallelism or task-level parallelism, and these are inherently what we use to analyze the C code with no modification or pragmas. From that we can extract the possible parallelism. Armed with that information you can work out where to spend your time and get the most bang for the buck.”
Silexica’s Odendahl explains how it works. “We do both static and dynamic analysis. Static analysis has been done before, but it is not tremendously successful because you are limited in the subset of C that you can actually understand. For that reason, we do both static and dynamic analysis, where we get a representative understanding of an execution of the application. We can provide additional information that we gathered from the static analysis. It is not just the identification of parallelism but also the target timing analysis. This has to be with the target platform in mind to be able to understand hotspots. Then, if you find places that are not critical it might not make sense to parallelize at that location. But if you find a hotspot, then you have to start looking at control and data dependencies. We can provide the risk analysis and the productivity increases.”
Most embedded systems have multiple layers of complexity, and this is compounded when multiple operation systems and user-defined applications run on the device. “When you build a new product with different OS instances, perhaps hypervisors, the booting and management of the various software pieces on the different cores is hardwired together,” says Warren Kurisu, director of product management and marketing for Mentor Graphics’ Embedded Software Division. “It is a big engineering effort, but it is not rocket science. The problem is that once you start to reconfigure that investment, you are just redoing the engineering all over again. That is the costly part of it.”
Again, some notion of a framework is required. Kurisu lists some of the problems and issues that have to be addressed. “Configuring and deploying multiple operating systems and applications across heterogeneous processors; booting multiple operating systems efficiently and in a coordinated manner across heterogeneous processor cores; communicating between isolated sub-systems on a multicore processor or between heterogeneous processors; visualizing interactions between heterogeneous operating systems on heterogeneous multicore platforms and; proprietary functionality that allows interoperability of open source and proprietary environments with all the above capabilities.”
After all that, you can start to consider where tasks may actually get executed.
There are several solution types that are being attempted at the moment. Some of them create virtual instruction sets so that code can be mapped to a variety of processors. Others create APIs that allow each platform to appear somewhat generic. Additional approaches are looking at defining the platforms or the hardware/software boundary. And then there are several programming languages or extensions to existing ones.
What is clear is this is far from being a solved problem, and the industry is in its infancy. There is a long road ahead and the problem keeps getting harder. Hopefully, solutions will emerge at a faster rate than the problem changes.
Heterogeneous Multi-Core Headaches
Using different processors in a system makes sense for power and performance, but it’s making cache coherency much more difficult.
Silexica: Multicore Software Automation
German startup analyzes numbers and types of cores, compiles software.
Heterogeneous System Challenges Grow
How to make sure different kinds of processors will work in an SoC.