Software-Defined Hardware Architectures

Hardware-software co-design is established, but the migration to multiprocessor systems adds many new challenges.

popularity

Hardware/software co-design has been a goal for several decades, but success has been limited. More recently, progress has been made in optimizing a processor as well as the addition of accelerators for a given software workload. While those two techniques can produce incredible gains, it is not enough.

With increasing demands being placed on all types of processing, single-processor solutions serve an ever-decreasing niche. As soon as multiple processors are required, the interconnect, and the communications and memory architectures play a huge role in the overall efficiency and effectiveness of the solution.

Tools do exist for all of the necessary tasks, but no single tool exists that can do all of them. Flows are rudimentary, but with massively increasing attention — largely brought about by the RISC-V processor architecture — we can expect progress to be quick.

“We are at a very interesting point in time for chip design or product design,” says Simon Davidmann, founder and CEO for Imperas Software. “Today, pretty much all electronic products are defined by their software. And it’s not just one little bit of software. Think about the complexity of ADAS or inference engines. We have these software workloads, and now we’re having to design chips to implement them. You can’t have a single processor because we haven’t got the silicon technology. It ends up being fabrics of processors. ADAS needs orders of magnitude more processing than is available from a handful of SMP processors. Nobody can sell me a chip that will do what I want, so I’ve got to build my own chip. You need three things. You need the optimized processors, you need the interconnect, and then you need the accelerators with the processors.”

In preparation for this story, a number of interviews were conducted. Siemens put forth three experts, each from a different division in the company. This call is a good indication of the challenge within the industry,” observed Neil Hand, director of strategy for design verification technology at Siemens Digital Industries Software. “Whoever you talk to, whatever their focus, they see the problem somewhat differently. “

State-of-the-art
While the notion of processor plus accelerator was first introduced with the Intel 8086/8087 in 1980, all of that functionality was crammed onto a single processor in PCs starting with the Intel Pentium in 1993. But in the mobile world, not all of that functionality was combined due to stringent power and area limitations.

“Mobile devices have looked like this for decades, with GPUs, video, display, ISP, and DSP all accelerating specific parts of the mobile workload,” says Peter Greenhalgh, senior vice president of technology at Arm. “We only have to narrow this view slightly to arrive at networking equipment, which consists of many accelerators (e.g. for packet processing) surrounding an applications processor.”

Two things are changing, though. First, the processor itself now can be modified for particular applications or use cases. And second, the software is increasingly driving the process.

New tools have been developed that are well suited for these goals, as well. “Domain-specific processors are more than just a processor plus accelerator,” says Zdeněk Přikryl, CTO for Codasip. “There are some domains where this approach still makes sense. But there are many others where the accelerator is not enough, and the processors themselves should be optimized or customized. Domain-specific processors may contain specific instructions, helping to significantly improve performance. Or, new security approaches, such as CHERI, can be adopted. Tools support not just an implementation, i.e. RTL code, but also software porting, such as C/C++ compilers, executable models, debuggers, and so on.”

Those processors are going to be operating in a larger context. “RISC-V is just at the right time and place for people who want to build their own processors and their own fabrics of processors,” says Imperas’ Davidmann. “This is part of this perfect storm, because suddenly everybody wants to build these new fabrics with RISC-V. They need virtual platforms. They need simulation of fabrics early on in their design process so they can run their software workloads and analyze them to get their software to be functionally correct. Then they work out if the performance needs are met before they implement it.”

There are some cases where an even larger context has to be considered. “If we go with the idea of a heterogeneous system that’s doing many things, but utilizing localized resources, you start to get into the question of, ‘How does the system behave?” says Siemens’ Hand. “It’s not so much how do the software and the hardware get designed, but what is the overall architecture of the system. What do you need to do for heterogeneous system modeling and system analysis? That will help you identify where your bottlenecks are.”

Adding in too much flexibility can create problems, as being demonstrated with RISC-V today. “Too much freedom creates unexpected disasters later,” says Frank Schirrmeister, vice president of marketing at Arteris IP. “You have RISC-V, where in principle you can do everything, but that creates a huge problem. If you create a fragmented software ecosystem, you don’t recreate the experience where your software will just work. That was the promise of the Arm partnership. RISC-V is now setting guard railing that will help you to get a critical mass of software to then run on those profiles.”

That flexibility also impacts verification. “Modifying a CPU is a highly invasive effort and requires significant verification cycles to get right,” says Arm’s Greenhalgh. “Moreover, when modifying a CPU, it’s rarely a matter of just unpacking a standard verification suite and proving the CPU is correct. Instead, new tests and testbenches need to be written, and this can be a challenge if not done by the engineers who designed and verified the original CPU, as they will have a lot of existing knowledge about the CPU.”

This is where some of the dedicated processor tools shine. “You need the tools at the level like ASIP designer from Synopsys, or Codasip, or other open-source tools, that do instruction-level modeling of the target architecture, and then generate a compiler and simulator from that description of your architecture,” says Tim Kogel, principal engineer for virtual prototyping for Synopsys. “With those, the software developer can again benefit from all these beautiful custom instructions. That’s more like the problem of the instruction-level architecture optimization, the tweaking to make a core for a certain target architecture.”

At the same time, the industry is wanting to push into new areas. “Tenstorrent has done smaller RISC-V systems for AI accelerators in the past, but the roadmap for their architecture is a sea of CPU cores, with additional chiplets of AI accelerators with more memory and external I/O,” adds Kogel. “At SNUG, Mercedes Benz talked about an multiple SoC (mSoC) for autonomous driving. It is composed of chiplets that do different tasks — compute, inference, some specialist central fusion processing — again, a heterogeneous system because you need to optimize for performance as well as low power. On a smaller scale, the latest generation of Infineon microcontroller for more embedded computing is also a multicore CPU plus a vector DSP.”

Specialized tools are emerging to deal with this. CacheQ, for example, splits software into what will run on a general-purpose processor and what will run on an accelerator. And architectural classes are emerging to be able to sort this.

“If we look outside of processors or processor clusters, we can see domain-specific solutions in interconnects or other system level parts, too,” says Codasip’s Přikryl. “For example, there are different AI engines that leverage tiled architectures with specific interconnects between tiles.”

Moving to multiprocessor solutions
Applications processors migrated to homogeneous multiprocessor solutions when it stopped being possible to economically increase performance. “There is a point where you reach diminishing returns by trying to get everything onto one core, and you really do have to distribute the workload over multiple cores,” says George Wall, product marketing group director for Tensilica Xtensa processor IP at Cadence. “That’s been true for quite some time, and it presents new design challenges. With domain-specific computing, deterministic execution is still critical. With a multi-core design, you lose some of that determinism due to the uncertainty and the variability in terms of communicating between processors and shared memory. The other big challenge with any multiprocessor is software. How do you adapt the software tools to be aware of the multicore, multiprocessor architecture? How can the software help you partition your workload so you can achieve that efficiency? And how well can the tools help identify the efficiency you’re getting out of the multiprocessor configuration?”

While progress has been made for homogeneous multiprocessing, heterogeneous adds additional challenges. “The move to heterogeneous multiprocessing will become widespread once it is extended to support domain-specific processing natively,” says Charlie Hauck, CEO at Bluespec. “For instance, hardware accelerators connected to processor cores in multi-processing Linux systems can be made to look like software threads. This enables accelerators to be scheduled and executed with the same powerful task concurrency provided by software multithreading for decades, eliminating lots of tedious and error-prone manual scheduling. It also enables Linux to address a host of other problems like preventing accelerator malfunctions to crash the system or affect other processes. Supporting domain-specific computing by extending proven and familiar technologies like multi-processing, Linux, and multi-threading will facilitate adoption by minimizing methodology changes.”

That requires a system-level type of thinking. “The two keys to successful heterogenous multicore SoC design are, 1) to provide function-optimized processing matched to the workload and, 2) avoid forcing a software developer to think about partitioning code onto a heterogenous target,” says Steve Roddy, chief marketing officer at Quadric. “Graphics cores have long had OpenGL APIs that cleanly allow a developer to run on the GPU without needing to dig into the peculiarities of a given SoC architecture. While there are frameworks like OpenCL for writing programs that magically execute across heterogenous processors, they often rely upon immature compilers that are cumbersome for software developers, or which come with large performance overheads. As a result, there has been limited uptake in the market for OpenCL and its kin.”

The problems extend outside of the processor cores. “This is where the two worlds come together,” says Kogel. “You are exploring your CPU instruction sets at the fine-grain instruction level. And you are exploring the macro architecture, considering how you put multiple cores together in a cluster that is connected to a backbone that is connected to DDR memory. And what does that do to the overall performance? That’s exactly what we see as these two words come together. They have been separated for a long time.”

Decisions must be made about how heterogeneous the cores should be. “From a software design standpoint, it’s much simpler to have those accelerators be identical within a cluster,” says Cadence’s Wall. “However, there will be cases where that cluster of processors will share a common accelerator. Such accelerators are more likely to be loosely coupled and maybe performing tasks that happen only periodically during your during the processing stream.”

In high-end applications, accelerators also can be programmable. FPGAs and embedded FPGAs are gaining traction alongside of GPUs, largely because they can be programmed as algorithms are updated.

“There are thousands of different UARTs (universal asynchronous receiver-transmitters) out there,” said Andy Jaros, vice president of sales at Flex Logix. “If somebody is building a chip with a UART, they want to have the flexibility to target any application. It’s getting to the point where developing chips is expensive, whether it’s at 90nm or 28nm. So they’re willing to sacrifice a little bit of area for FPGA reprogrammability, even though it’s going to cost a little bit more because it’s going to allow me to respond to many more customers.”

Communications
Many compute systems are constrained by communications, not processing horsepower. “Data storage and the data movement become extremely important,” says Russell Klein, program director for the Catapult HLS team at Siemens EDA. “One of the things you find as you break these systems down is that the data movement becomes the limiting factor for how quickly you can compute things. Usually, we can build enough compute elements to perform really anything. But it’s getting the data to those compute elements, and draining the results away to where they need to be used, that becomes the real challenge within that design.”

That problem exists at multiple levels. “While there is progress in scaling up the computation inside a single chip, scaling out with fast interconnect technology becomes the most important means to expand the horsepower,” says Weifeng Zhang, chief architect and vice president of software for Lightelligence. “This could be from chiplet-to-chiplet at the chip level, or accelerator-to-accelerator at the compute node level and beyond.”

Data movement becomes key to throughput. “It’s not just mapping the processing to the different types of cores, you need to think about how to move the data from one subsystem to another,” says Kogel. “What looks easy in a spreadsheet, becomes a lot more complex when you have a custom accelerator, which can do a certain task ten times faster. But if it takes ten times longer to move the data to that subsystem, you haven’t gained much in the end. It’s a complex problem to design these systems with sufficient bandwidth, with the right communication paradigms, whether that’s DMA engines that move data efficiently, or it is coherent interconnect that automatically moves the data for a cache coherency protocol. There are many options in a large design space to explore.”

Communications require physical interconnect and may require higher level protocols. “The trend is reflected in the active development of Compute Express Link (CXL) and Universal Chiplet Interconnect Express (UCIe) consortia, which grew quickly with more than 100 members,” says Lightelligence’s Zhang. “Similarly, Open Compute Project Foundation (OCP) plays an important role to establish and promote the open domain specific architecture (ODSA) with die-to-die interconnects using Bunch of Wires (BOW). With all these open standards, compute and memory disaggregation, such as memory pooling and heterogeneous accelerator (including GPUs) pooling, will likely becoming the dominant infrastructure in data centers very soon.”

The organization of the memory sub-system is critical. “With standards like CXL coming to the forefront, memory is no longer trapped in one place,” says Nick Ilyadis, senior director of product planning for Achronix. “You can share it in different ways, and you can cache it in different ways. It gives you a lot more flexibility in how the memory architectures is spread around the different processors. There are accelerators, there’s how the memory is segregated, shared or pooled, or segmented. There’s going to be private memories, shared memories and pooled memories.”

It all needs to be captured in an executable model. “Companies need to take a high-level approach by building a model of the application, even before you have a specific architecture,” says Kogel. “With a high-level resource model, like a virtual prototype of the architecture, you can explore options and then try to map the application to that. Then you can analyze the impact of the communication overhead of the resource utilization. You need to be doing a high-level exploration phase before selecting, or in order to select the right target architecture.”

In a future story, suitable flows will be described as well as how some of the gaps needs to be filled. Most importantly, it will examine the state of software readiness to take advantage of it.



Leave a Reply


(Note: This name will be displayed publicly)