Not All There: Heterogeneous Multiprocessor Design Tools

Tools are being developed to optimize a single processor for a given piece of software, but gaps remain for multicore systems.

popularity

The design, implementation, and programming of multicore heterogeneous systems is becoming more common, often driven by the software workloads, but the tooling to help optimize the processors, interconnect, and memory are disjointed.

Over the past few years, many tools have emerged that help with the definition and implementation of a single processor, optimized for a given set of software. While companies such as Cadence and Synopsys have had internal tools for their proprietary architectures for decades, RISC-V has created a new and open marketplace for such tools. With the breadth of options available and the necessary ecosystem to support them, the rate of adoption has been rapid.

But as soon as a design includes multiple processors, or heterogeneous compute environments, the tooling becomes a lot less available. There is a dearth of tools that can help partition software, or optimize a processor while considering the memory subsystem or the communications network. There’s also gaps in tools that can analyze dataflow within the system. While there are a handful of tools that can do some of those tasks, there is nothing that can do them all, and there are no flows that combine multiple tools to accomplish the task.

Many questions get raised as soon as multicore enters the picture. “There is a RISC-V multicore frenzy,” says Frank Schirrmeister, vice president of marketing at Arteris IP. “You typically extend them in similar ways. Maybe you have multiple clusters. Within the cluster you may have multiple compute units, and perhaps you co-optimize the compute and the transport. You may decide to have an application accelerator associated with the cluster or attached to each core. Each processor could be a heavily optimized application-specific instruction processor (ASIP) using something like CodAL from Codasip, or LISATek/NML from Synopsys. You could use high-level synthesis to build an application accelerator.”

What’s missing
Some fundamental pieces of a flow simply do not exist, and others do not work together as seamlessly as required. But some progress is being made.

“Gaps in the design methodology can always use less automated, traditional RTL design approaches,” says Zdeněk Přikryl, CTO for Codasip. “As processor design automation tools develop further, we can expect gaps to be closed and a complete hardware/software co-design methodology to emerge.”

Huge gaps do remain in the flow. “There is nothing like an automatic parallelization tool, even for homogeneous multi core, even less so for a heterogeneous multi core system,” says Tim Kogel, principal engineer for virtual prototyping at Synopsys. “People are struggling to find the right data model. NVIDIA has a huge investment in their CUDA franchise, developing all kinds of programming models to address that. There’s a lot that still needs to happen. What I see today is you have separate tool chains for the different sub-systems. For example, you have the machine learning compiler tool chain for your neural networks, and then you have the traditional compiler tool chain for the CPU, and then it’s a bit of an effort to stitch them together.”

When restrictions are made, solutions become more likely. “The ability of RISC-V to tightly integrate accelerators to a processor is a big advantage because it reduces the range of heterogeneity needed to be supported by Linux,” says Charlie Hauck, CEO at Bluespec. “It can do this without constraining the range of heterogeneity implemented by accelerators. This minimizes the scope of the tool changes needed to support domain-specific processors in a heterogeneous multiprocessing environment. This includes cycle-approximate system-level C/SystemC modeling for fast architectural exploration to refine specs before moving into implementation and validation.”

To be software-driven, it has to start from there and work down to the hardware. “With model-based systems engineering, they are starting with the overall functionality, which crosses multiple domains — software, hardware, physical,” says Neil Hand, director of strategy for design verification technology at Siemens Digital Industries Software. “As they start pushing down, they are looking at the division of hardware and the software — what goes into each processor. And generally, they want to go with the processor that will do the job in the most effective way. If they can get away with running the algorithm on a single monolithic processor, they would probably do that. If it’s not meeting the requirements, they want a way to peel that off and decide what functionality has its own embedded core, or its own accelerators.”

But as soon as you put any form of network between them, you have to analyze the communications overhead. “There are subsets of the flow that work,” says Arteris’ Schirrmeister. “For example, in our tool suite, we spit out a model that helps you to analyze the NoC itself. You figure out how many switches you need when you have 10 initiators and 15 targets. These are my priorities. You do that architecture analysis and then you figure that out in context. Then you export it to something like Platform Architect. Those relationships are in place, but the bigger problem is they are disconnected. Even if the technical flow is there, the capacity of a person or even a team to interact efficiently to make all these analysis optimizations based on the architecture feedback is very distributed.”

And they may not always be asking the right questions. “When you look at the offerings from the network on chip companies, they’re still viewing the world as systems that want to be able to perform any task,” says Russell Klein, program director for the Catapult HLS team at Siemens. “They look at the data you need to move, what latencies we have, what bandwidth is required. But they’re not considering the possibility of taking this memory and putting it ‘over here in this compute area.’ Then the data never needs to move or use the interconnect. If we sequester groups of data around where it’s needed, and group them with the compute elements, communications can be minimized. What about in-memory compute? Those types of things need to be considered when you’re looking at your interconnects. Which of these can we get off the interconnect and not even have to move it in the first place? I don’t think we have the tools that are able to understand that problem and offer up solutions.”

A lot more is required. “There is a class of exploration and partitioning and co-design capabilities that exist, and you can brute force your way through today,” says Siemens’ Hand. “But there’s more that is needed. We’ve got a lot of capabilities to help answer the question, ‘Have I met my goals?’ But one of the challenges with a top-down flow is you need to be able to make measurements, to do estimations, and for that you need models, you need performance information, you need the software to understand the overhead of the system, the system tradeoffs. That’s a really tough problem to solve. If you have a NoC, if you’ve got shared memory, or not shared memory — these are huge impacts on the overall performance. And you cannot abstract that away in a software container, unfortunately.”

Software readiness
To be able to find the best mapping of software onto a piece of available hardware, or to define the most appropriate hardware architecture for a piece of software, requires extensive analysis capabilities.

“When you have a complex chip with fancy hardware features, how does the software developer make use of that? That’s where the biggest challenge lies,” says Synopsys’ Kogel. “Many people are developing hardware for AI, but the battleground is the ML compiler. You can design beautiful accelerator hardware, but if the compiler cannot make efficient use of it, it’s most likely not going to be used. That is a big problem to be solved.”

Today, those environments are distributed. “Having a software environment that allows you to do that analysis is valuable,” says George Wall, product marketing group director for Tensilica Xtensa processor IP at Cadence. “But that’s probably more than one software environment. There are probably some very processor-centered environments you could look at, like an ISS with a SystemC model around it using your processor vendor’s development tools. And then there’s probably a higher-level piece with the virtual platform.”

Even RISC-V is struggling to maintain the software ecosystem when complete freedom is utilized. “In order to keep the software complexity under control, you need to standardize on the instruction set variants,” says Kogel. “For the integer and the floating point, and for each set of instructions, they have a pre-defined set, which you should use in order to benefit from the general infrastructure. So instead of opening it up completely, standard profiles are defined. But if you choose not to use those, then you’re on your own.”

Profiles restrict the hardware to help simplify the software ecosystem. “The RISC-V software ecosystem is getting ready for it,” says Codasip’s Přikryl. “We have different groups dedicated to profiles and platforms that nicely summarize what is expected as a baseline and how the domain specific features may be used. With this in place, the software ecosystem can grow nicely, because it is clear how different part of the software stack should be handled.”

Things become less clear when heterogeneity is involved. “We are seeing people put hypervisors on heterogeneous systems,” says Jeff Hancock, head of product management for Siemens embedded software. “Sure, that could be considered single-threaded on the application cores, but in automotive they’re running Linux and Android Auto, for example, on a hypervisor, on a quad A53, or something like that. And then they have these R cores that are on the same chip that are running Autostar. So even on a single SoC that has all multiple cores, they’re actually running heterogeneous-type environments.”

Domain-specific solutions are being created. “We see the SOAFEE initiative from Arm in the automotive domain,” says Kogel. “This an attempt to formalize the software development in a way that you can go seamlessly from cloud native development to deployment, trying to encapsulate things with dockers and virtualization. It takes multi-company initiatives like this to come up with something at this level.”

But the software has to change. “Whether you’ve got eight Intel cores on your laptop, or a dozen cores in your phone, the programming model is still a single-threaded application,” says Siemens’ Klein. “And that’s the limiting factor. As long as the software folks are going to demand that we program with a single core, and the illusion that we’re the only thing on the processor — until we can get past that, we’re just going to have to keep trying to build larger processors and accelerate things within that model. As soon as you break that model, all kinds of really cool things happen. But that’s not a hardware technology issue. That’s a software, cultural issue.”

Attempts have been made to partition a single thread onto multiple cores. “The architects who build these systems are working out how they can break things down and do them in parallel,” says Simon Davidmann, founder and CEO for Imperas Software. “The applications are not one thread of C, because that would be very hard to do anything with. There’s no magic in parallelizing C code.”

Being able to take arbitrary software and target that to arbitrary hardware is the holy grail. Just one step of that is being able to derive a data flow model, and that is very hard problem to solve in the general case.

“It works for ML frameworks because you have a better-suited data model to begin with,” says Kogel. “TensorFlow, has the inherent parallelism and data dependencies defined in a much nicer way than some arbitrary piece of C, C++ code. Currently, what we see is people are using tracing, trying to re-engineer these dependencies from running applications on virtual platforms or on target hardware to extract the actual dependencies. It is a hard problem to do that automatically based on static or dynamic code analysis. We have seen attempts in this direction, but it’s still more of a research topic. It would bring great benefit if it could be solved.”

The problem has always been data dependencies. “One thing about neural networks is they’ve got a regular structure and they’re embarrassingly parallel,” says Klein. “There’s so much parallelism in there that it becomes a lot easier to go in and understand what can be partitioned off on different accelerators without introducing any data dependencies. As we move to general-purpose software, generalized algorithms, that’s still a nut that nobody has cracked. How do we take the C program that somebody wrote, with the illusion that they’re going to be the only one on the computer and it’s going to be a single-threaded application, and be able to move that into multiple small CPUs? That’s still a really hard problem. And again, the software community doesn’t seem to be embracing the potential benefits of going there.”

The problem exists in all application areas. “Most design teams and IP vendors within AI/ML have focusing on rule number one — provide processing horsepower matched to the workload,” says Steve Roddy, chief marketing officer at Quadric. “But they have neglected rule number two — don’t burden the application software developer. Most ML solutions have offloaded a portion of the graph processing from a legacy CPU or DSP, but not the entire ML graph workload. This further complicating the life of the software developer.”

Change is hard. “We want to get there,” says Cadence’s Wall. “It is an immensely challenging problem to have a software development environment completely decoupled from the architecture. Even if one does take advantage of something like OpenMP, there’s always going to be some level of architectural dependency.”

So much is being left on the table. “What you leave on the table in terms of performance and efficiency — and the overall capability of the system when you design the hardware, and then chuck it over the wall or sell it to another company and then write the software — is huge,” says Klein. “There’s so much loss of capability, there’s so much loss of efficiency, so much loss of performance. I know of very few organizations where the necessary people sit down and design the hardware while thinking about the architecture of the software that’s going to run on it. What they’re almost always focused on is designing hardware to run any single-threaded application and trying to fit it in that mold. That may be starting to break down.”

It takes a team. “Understanding the issues takes a team effort,” says Schirrmeister. “It requires a set of architects and is an incredibly complex task. At the same time, system complexity is growing tremendously fast. We are approaching an era where the tools may become connected enough for this to actually work. We are connecting the tools vertically, and some flows are in place. There are some of the architecture optimization tools. We certainly don’t have the people who can run it all, so that’s an education problem. But if you combine it with machine learning and AI, then we have a chance — if you can afford the cycles to go through meaningful variations.”

Conclusion
For the past 25 years, people have been looking at the potential for system-level tools that could encompass hardware and software. Being able to optimize both and map one to the other has obvious advantages, but they clearly have never been enough to overcome the issue of software productivity. Getting to market faster is more important than a product that operates faster, consumes less power, or is cheaper. Many in the industry think that change is inevitable, but they have been wrong for a long time.



Leave a Reply


(Note: This name will be displayed publicly)