Automatic mapping of software onto existing hardware, or using software to drive hardware design, are highly desired but very difficult.
For the past 20 years, the industry has sought to deploy hardware/software co-design concepts. While it is making progress, software/hardware co-design appears to have a much brighter future.
In order to understand the distinction between the two approaches, it is important to define some of the basics.
Hardware/software co-design is essentially a bottom-up process, where hardware is developed first with a general concept of how it is to be used. Software is then mapped to that hardware. This is sometimes called platform-based design. A very recent example of this is Arm‘s new Scalable Open Architecture for Embedded Edge (SOAFEE), which seeks to enable software-defined automotive development.
Software/hardware co-design, in contrast, is a top-down process where software workloads are used to drive the hardware architectures. This is becoming a much more popular approach today, and it is typified by AI inference engines and heterogenous architectures. High-level synthesis is also a form of this methodology.
Both are viable design approaches, and some design flows are a combination of the two. “It always goes back to fundamentals, the economy of scale,” says Michael Young, director of product marketing at Cadence. “It is based on the function you need to implement, and that generally translates into response time. Certain functions have real-time, mission-critical constraints. The balance between hardware and software is clear in these cases, because you need to make sure that whatever you do, the response time is within a defined limit. Other applications do not have this restriction and can be done when resources are available.”
But there are other pressures at play today as Moore’s Law scaling slows down. “What’s happening is that the software is driving the functionality in the hardware,” says Simon Davidmann, CEO at Imperas Software. “Products need software that is more efficient, and that is driving the hardware architectures.”
Neither approach is better than the other. “We see both hardware-first and software-first design approaches, and neither of the two yields sub-optimal results,” says Tim Kogel, principal applications engineer at Synopsys. “In AI, optimizing the hardware, AI algorithm, and AI compiler is a phase-coupled problem. They need to be designed, analyzed, and optimized together to arrive at an optimized solution. As a simple example, the size of the local memory in an AI accelerator determines the optimal loop tiling in the AI compiler.”
Costs are a very important part of the equation. “Co-design is a very good approach to realize highly optimized hardware for a given problem,” says Andy Heinig, group leader for advanced system integration and department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “But this high level of optimization is one of the drawbacks of the approach. Optimized designs are very expensive, and as a result such an approach can only work if the number of produced devices is very high. Most applications do not need optimized hardware, instead using more flexible architectures that can be re-used in different applications. Highly optimized but flexible architectures should be the result of the next-generation hardware/software co-design flows.”
High-level synthesis
The automatic generation of hardware from software has been a goal of academia and industry for several decades, and this led to the development of high-level synthesis (HLS). “Software that is developed to run on a CPU is not the most optimal code for high-level synthesis,” says Anoop Saha, senior manager for strategy and business development at Siemens EDA. “The mapping is inherently serial code into parallel blocks, and this is challenging. That is the value of HLS and how you do it. We have seen uses of SystemC, which has native support for multi-threading, but that is hardware-oriented and not software-oriented.”
Challenges remain with this approach. “We have been investing in it continuously, and we have continued to increase the adoption of it,” says Nick Ni, director of marketing, Software and AI Solutions at Xilinx. “Ten years ago, 99% of people only wrote Verilog and VHDL. But more than half of our developers are using HLS today for one piece of IP, so we have made a lot of progress in terms of adoption. The bottom line is that I don’t think anything has really taken off from a hardware/software co-design perspective. There have been a lot of interesting proposals on the language front to make it more parallel, more multi-processor friendly, and these are definitely going in the right direction. For example, OpenCL was really trying to get there, but it has lost steam.”
Platform-based approach
Platform-based design does not attempt to inject as much automation. Instead, it relies on human intervention based on analysis. “Hardware/software co-design has been happening for quite a while,” says Michael Frank, fellow and system architect at Arteris IP. “People have been trying to estimate the behavior of the platform and evaluation its performance using real software for quite a while. The industry has been building better simulators, such as Gem5, and Qemu. This has extended into systems where accelerators have been included, where you build models of accelerators and offload your CPUs by running parts of the code on the accelerator. And then you try to balance this, moving more functionality from the software into the hardware.”
Arm recently announced a new software architecture and reference implementation called Scalable Open Architecture for Embedded Edge (SOAFEE), and two new reference hardware platforms to accelerate the software-defined future of automotive. “To address the software-defined needs of cars, it is imperative to deliver a standardized framework that enhances proven cloud-native technologies that work at scale with the real-time and safety features required in automotive applications,” says Chet Babla, vice president of automotive at Arm’s Automotive and IoT Line of Business. “This same framework also can benefit other real-time and safety-critical use cases, such as robotics and industrial automation.”
This works well for some classes of applications. “We are seeing more hardware/software co-design, not just because the paradigm of processing has changed, but also the paradigm of hardware has changed,” says Siemens’ Saha. “In the past, the hardware was very general-purpose, where you had an ISA layer on top of it. The software sits on top of that. It provides a very clean segmentation of the boundary between software and hardware and how they interact with each other. This reduces time to market. But in order to change that, they have to change the software programming paradigm, and that impacts the ROI.”
A tipping point
It has been suggested that Nvidia created a tipping point with CUDA. While it was not the first time that a new programming model and methodology had been created, it is arguably the first time that it was successful. In fact, it turned what was an esoteric parallel-processing hardware architecture into something that approached a general-purpose compute platform for certain classes of problems. Without that, the GPU would still just be a graphics processor.
“CUDA was far ahead of OpenCL, because it was basically making the description of the parallelism platform agnostic,” says Arteris’ Frank. “But this was not the first. Ptolemy (UC Berkeley) was a way of modeling parallelism and modeling data-driven models. OpenMP, automatic parallelizing compilers — people have been working on this for a long time, and solving it is not trivial. Building the hardware platform to be a good target for the compiler turns out to be the right approach. Nvidia was one of the first ones to get that right.”
Xilinx’s Ni agrees. “It is always easiest if the user can put in explicit parallelism, like CUDA or even OpenCL. That makes it explicit and easier to compile. Making that fully exploit the pipeline, fully exploit the memory, is still a non-trivial problem.”
Impact of AI
The rapid development of AI has flipped the focus from a hardware-first to a software-first flow. “Understanding AI and ML software workloads is the critical first step to beginning to devise a hardware architecture,” says Lee Flanagan, CBO for Esperanto Technologies. “Workloads in AI are abstractly described in models, and there are many different types of models across AI applications. These models are used to drive AI chip architectures. For example, ResNet-50 (Residual Networks) is a convolutional neural network, which drives the needs for dense matrix computations for image classification. Recommendation systems for ML, however, require an architecture that supports sparse matrices across large models in a deep memory system.”
Specialized hardware is required to deploy the software when it has to meet latency requirements. “Many AI frameworks were designed to run in the cloud because that was the only way you could get 100 processors or 1,000 processors,” says Imperas’ Davidmann. “What’s happening nowadays is that people want all this data processing in the devices at the endpoint, and near the edge in the IoT. This is software/hardware co-design, where people are building the hardware to enable the software. They do not build a piece of hardware and see what software runs on it, which is what happened 20 years ago. Now they are driven by the needs of the software.”
While AI is the obvious application, the trend is much more general than that. “As stated by Hennessy/Patterson, AI is clearly driving a new golden age of computer architecture,” says Synopsys’ Kogel. “Moore’s Law is running out of steam, and with a projected 1,000X growth of design complexity in the next 10 years, AI is asking for more than Moore can deliver. The only way forward is to innovate the computer architecture by tailoring hardware resources for compute, storage, and communication to the specific needs of the target AI application.”
Economics is still important, and that means that while hardware may be optimized for one task, it often has to remain flexible enough to perform others. “AI devices need to be versatile and morph to do different things,” says Cadence’s Young. “For example, surveillance systems can also monitor traffic. You can count how many cars are lined up behind a red light. But it only needs to recognize a cube, and the cube behind that, and aggregate that information. It does not need the resolution of a facial recognition. You can train different parts of the design to run at different resolution or different sizes. When you write a program for a 32-bit CPU, that’s it. Even if I was only using 8-bit data, it still occupies the entire 32-bit, pathway. You’re wasting the other bits. AI is influencing how the designs are being done.”
Outside of AI, the same trend in happening in other domains, where the processing and communication requirements outpace the evolution of general-purpose compute. “In datacenters, a new class of processing units for infrastructure and data-processing task (IPU, DPU) have emerged,” adds Kogel. “These are optimized for housekeeping and communication tasks, which otherwise consume a significant portion of the CPU cycles. Also, the hardware of extreme low-power IoT devices is tailored for the software to reduce overhead power and maximize computational efficiency.”
Software/hardware reality
To make a new paradigm successful takes a lot of technology that insulates the programmer from the complexities of the hardware. “Specification and optimization of the macro architecture requires an abstract model of both the application workload and the hardware resources to explore coarse-grain partitioning tradeoffs,” explains Kogel. “The idea of the Y-chart approach (see figure 1) is to mate the application workload with the hardware resource model to build a virtual prototype that allows quantitative analysis of KPIs like performance, power, utilization, efficiency, etc.”
Fig. 1: Y-chart approach, mapping application workload on HW platform to build virtual prototype for macro-architecture analysis. Source: Synopsys
“The workload model captures task-level parallelism and dependencies, as well as processing and communication requirements per task in an architecture-independent way,” Kogel explains. “The hardware platform models the available processing, interconnect, and memory resources of the envisioned SoC. The practical applicability requires a virtual prototyping environment that provides the necessary tooling and model libraries to build these models in a productive way.”
Much of this remains a directed, manual approach. “There’s a gap in this area,” says Young. “Every major company is doing their own. What is needed is a sophisticated, or smart compiler, that can take the different applications, based on real-time constraints, and understanding the economics. If I have various processing resources, how do I divide that workload so that I get the proper response times?”
As processing platforms become more heterogenous, that makes the problem a lot more difficult. “You no longer have a simple ISA layer on which the software sits,” says Saha. “The boundaries have changed. Software algorithms should be easily directed toward a hardware endpoint. Algorithm guys should be able to write accelerator models. For example, they can use hardware datatypes to quantize their algorithms, and they should do this before they finalize their algorithms. They should be able to see if something is synthesizable or not. The implementability of an algorithm should inherently be a native concept to the software developer. We have seen some change in this area. Our algorithmic datatypes are open source, and we have seen around two orders of magnitude more downloads of that than the number of customers.”
Ironically, automation is easier for AI than many other tasks. “We have compilers that not only compile the software for these AI models onto instructions that run on the processors within the chips, but we can recompile to a domain-specific architecture. The whole hardware design is based on the actual model,” says Xilinx’s Ni. “That is real software/hardware co-design. It is only possible because, while we have such a challenging problem like AI, it is also a well-defined problem. People have already invented AI frameworks and all of the APIs, all the plug-ins. TensorFlow or PyTorch have defined how you write your layers and things like that. The compiler has less things to take care of, and within those boundaries, we can do a lot of optimizations and adjust the hardware creation.”
Coming together
It is unlikely that a pure hardware-first or software-first approach will be successful long-term. It takes collaboration. “AI applications demand a holistic approach,” says Esperanto’s Flanagan. “This spans everyone from low-power circuit designers to hardware designers, to architects, to software developers, to data scientists, and extending to customers, who best understand their important applications.”
And automation is not yet as capable as humans. “AI-based methods will assist specialists to optimize algorithms, compilers, and hardware architectures, but for the foreseeable future human experts in each domain will be required ‘in the loop’,” says Kogel. “The most competitive products will be developed by teams, where different disciplines collaborate in an open and productive environment.”
Full automation may take a long time. “The human engineering aspect of this will always be involved, because it’s a very difficult decision to make,” says Young. “Where do you define that line? If you make a mistake, it can be very costly. This is why simulation, emulation, and prototyping are very important, because you can run ‘what if’ scenarios and perform architectural tradeoff analysis.”
Sometimes, it is not technology that gets in the way. “It requires an organizational change,” says Saha. “You can’t have separate software and hardware teams that never talk to each other. That boundary must be removed. What we are seeing is that while many are still different teams, they report through the same hierarchy or they have much closer collaboration. I have seen cases where the hardware group has an algorithm person reporting to the same manager. This helps in identifying the implementability of the algorithm and allows them to make rapid iterations of the software.”
Conclusion
New applications are forcing changes to the ways in which software is written, hardware is defined, and how they map to each other. The log-jam of defining new software paradigms has been broken, and we can expect to see the rate of innovation in a combined hardware-software flow to accelerate. Will it extend back into the sequential C space running on single CPUs? Possibly not for a while. But ultimately, that may be a very small and insignificant part of the problem.
” They do not build a piece of hardware and see what software runs on it, which is what happened 20 years ago. Now they are driven by the needs of the software.”
Which to me means that traditional computers won’t cut it.
So how to find out what the software needs is step one.
Next is the fun part — design hardware that does what the software needs.
Basically it needs data and a means to process that data. The software is itself the algorithm to process the data. (Wow, this is so profound!) Well, sort of, because software and hardware do not process data in the same way. e.g. hardware uses whatever opcode and data that are on the inputs, so the software algorithm must apply the appropriate opcode and data. Well, duh. how do you do it with an HDL? Oh, you gave me a description of the hardware instead of the data…….
Does this nonsense help to explain what poor tools are available?