Software-Defined Hardware Gains Ground — Again

AI applications are prompting chipmakers to take another look at different options for reconfigurable hardware.

popularity

The traditional approach of running generic software on x86-based CPUs is running out of steam for many applications due to the slowdown of Moore’s Law and the concurrent exponential growth in software application complexity and scale.

In this environment, the software and hardware are disparate due the dominance of the x86 architecture. “The need for and advent of the hardware accelerator changed this relationship such that now the software programmer needs to be aware of the underlying hardware architecture in order to write efficient software,” said Russell James, vice president of AI & Compute Strategy at Imagination Technologies.

If this sounds familiar, it should. In the early 2000s, there was an upswell in interest in software-defined hardware with companies such as Chameleon, Trimedia and a number of others attempted to spur adoption of reconfigurable computing approaches, to be able to change the computing quickly in order to be adaptive to the application.

“For AI, you essentially want the computational network side to be updated appropriately and fast enough,” said Frank Schirrmeister, senior group director, solutions marketing at Cadence. “In the early 2000s, when software-defined hardware was being discussed in earnest, people didn’t yet fully foresee the end of Dennard scaling, Amdahl’s Law, and all of the things which are now understood. Now, we are actually ready — it only took us 15 years — because we really need the domain specificity of the architecture.”


Fig. 1: Microprocessor data trends. Source: Cadence

But domain-specific architectures have a downside, too.

“The architecture may be fixed, and programmability is really hard, so you need to go back to this notion of the programming model with which you actually program this portion,” Schirrmeister said. “Having said that, it’s really tough to hone in on one specific architecture without giving it the freedom to reconfigure itself at least a little bit. Otherwise, you’re basically back to the ASIC/FPGA question again. The reconfigurable part gives you a target domain architecture, and then within that it gives you, because of reconfigurability, wiggle room to make updates. And that’s a very clear and interesting advantage. You want to be really flexible enough to be reconfigurable, which is why software-defined hardware/reconfigurable hardware is making this comeback. AI will be a big starting point to drive toward reconfigurability.”

Many others agree. Sergio Marchese, technical marketing manager at OneSpin Solutions, said that is the point of complex FPGAs such as Xilinx Versal, a device with programmable logic and lots of resources (AI engines, CPUs, DSPs, etc). “Engineers write AI software (framework level), and there is a tool chain to automatically configure the hardware, optimizing it for that specific software,” he said. “This is great in principle, particularly when algorithms can change quickly as is the case for AI and other cutting edge-applications. However, squeezing the last cycle out of the hardware to gain performance requires changes at the RTL level and engineers with that expertise. Further, any changes to the RTL require thorough re-verification, using the certainty of formal methods whenever possible. This includes formal equivalence checking to ensure that the FPGA implements intended functionality.”

If reconfiguration is fast enough, in theory it could even be done at runtime in the field, but this flow is not reality yet, Marchese said.

There are more ways than one to achieve this kind of configurability, though. General-purpose GPUs running NVIDIA’s CUDA, a proprietary parallel software platform, are a good example. “Today, we have many different types of hardware accelerator, from FPGAs to full custom ASICs such as the Google TPU,” he said. “The need for multiple compute types also has extended to the embedded SoC domain, where CPUs, GPUs, DSPs, NNPUs, etc., can all be integrated into a single chip. This leads to the desired paradigm of software-defined hardware, where the software defines the hardware to run the program. From cloud data center servers to mobile phones and other embedded devices, the desire is the same.”

Specifically, a software-defined hardware infrastructure (SDHI) is a further extension to this where the infrastructure chooses from a virtually integrated set of compute processors, the right elements to run the software efficiently. “To operate efficiently these hardware accelerators need a software compute framework to enable the underlying software language to effectively and efficiently utilize all available hardware,” said Imagination’s James. “An example is the popular software compute framework, OpenCL. This and other such frameworks work on two main types of hardware accelerator — fixed underlying hardware architectures like GPGPU, or full custom ASIC and FPGAs, which are hardware-programmable ICs that can achieve a percentage level of performance compared to the same design in a full custom ASIC.”

The OpenCL framework provides the necessary structures, APIs and resources to execute the underlying C-like language algorithms on the available hardware. This is important because much greater utilization and efficiency can be achieved by enabling parallel execution across both homogeneous/heterogeneous compute cores. The difference between GPGPUs and FPGAs is that in the FPGA case, the OpenCL kernels are mapped using HLS tools to hardware representations of the kernel, rather than an execution of the kernel on the underlying fixed architecture hardware accelerator.

There have been several efforts to bring software closer to the hardware over the years, including:

  • OpenCL, open-source that can be compiled and synthesised into actual hardware designs, or mapped to existing compute architectures (like GPUs).
  • CUDA and Tensor RT, which bring software programming much closer to the hardware (although in this case that hardware architecture is GPU based and fixed).
  • SYCL, which is open rival to CUDA.
  • SystemC, a timing annotated C language.
  • Matlab, an HDL-Coder.

With high-level synthesis (HLS) tools, all hardware ultimately is defined as software. The level of abstraction is what differentiates one approach from another. RTL (VHDL or Verilog) is a hardware description language used extensively in the semiconductor industry. RTL compilers, synthesis and place and back-end layout tools map the RTL code into an IC physical layout file (GDSII or OASIS).

As a language RTL operates very differently from a “software” language like C, the main difference being the order of execution.

“C, like many other software languages, executes sequentially, which is the way a CPU operates,” James said. “RTL executes all lines of code concurrently, and the designer must explicitly code for other execution order rules. This difference means that it generally takes someone with a hardware design background to properly understand and write RTL code. To write RTL code is a long and laborious process, and if behavioral or algorithmic models could be synthesized to RTL code, then this could enable a faster design process. HLS tools to do just this have been around for the past 30 years or so, but the sticking point has always been that more abstraction often leads to a less optimal hardware design (the abstraction penalty). FPGA vendors provide in-house OpenCL to FPGA implementation tools to enable a wider set of developers to use the FPGAs (without in-depth RTL design and coding experience).”

Why SDH is necessary
The term software-defined hardware specifically refers to the mapping of applications to FPGAs, as an alternative to the expensive development of application specific SoCs. But the term also can be used more broadly for any domain-specific programmable and configurable SoC, which is optimized for a selected set of applications., Tim Kogel, principal applications engineer at Synopsys said the term applies to the general principle of form follows function, which in chip design is that functions determine the compute architecture.

Software-defined hardware is needed whenever general-purpose CPUs, GPUs, or DSPs do not provide the necessary performance and/or computational efficiency, and when dedicated hardware does not provide the necessary flexibility. “Popular examples are IPs and SoCs for the acceleration of artificial neural networks, which require high flexibility to adjust to rapidly evolving neural network graphs, but also customized architectures to achieve the necessary performance and power efficiency,” Kogel said. “Depending on the target market requirements, this results in a variety of programmable and configurable computer architectures, ranging from general-purpose CPUs with vector extensions, optimized GPUs, vector DSPs, FPGAs, application-specific instruction-set processors, to register-programmable data paths.”

James considers SDH essential. “Compute frameworks abstract the developer, to a degree, from the underlying hardware architecture, and open compute platforms such as SYCL take this a step further and allow for greater abstraction from the underlying hardware (at the developer level),” he said. “Write OpenCL for GPGPU and this defines how the hardware executes the defined kernel. Write OpenCL for FPGA and this creates a fully custom accelerated kernel on FPGA. Write OpenCL for an SoC containing a CPU, GPGPU and a dedicated neural network accelerator and this can enable heterogeneous compute for the maximum efficiency of execution.”

But this approach is not without challenges, particularly when AI is involved.

“AI is a fast paced, rapidly evolving technology, and so any products in this area need to be developed, verified, validated and deployed very quickly or they risk missing the boat,” he said. Applying an SDH approach to ASIC design leads to a hardware architecture that has multiple types and numbers of computing elements. Some of these computing elements will be more general-purpose, and others more fixed in function, but the combination will give the best flexibility when it comes to developing software algorithms and applications both now and in the future, giving a degree of future-proofing to ride out the AI evolution storm for long enough until the next ASIC comes along.”

This may look like the complete democratization of compute hardware, but not all compute cores are created equal, and they don’t all perform equally. This is where hardware vendors can create differentiation by adding custom vertical optimizations into the open compute frameworks or, alternatively, completely custom optimized closed compute solutions.

Also, the uncontested advantage of software-defined hardware is the ability to add significant improvements in performance and computational efficiency. Kogel recalled Google’s 2015 TPU v1 that runs deep neural networks 15 to 30 times faster with 30 to 80 times better energy efficiency than contemporary CPUs and GPUs in similar technologies. “AI is indeed the killer application for domain-specific architectures in that it provides an embarrassing amount of parallelism to take full advantage of tailored processing resources. As a result, we are seeing a new golden age of computer architecture, which has spurred hundreds of chip design projects, where before the number of design starts was constantly declining.”

But the big challenge coming with software-defined hardware is the complexity of the necessary software programming flows. “Sophisticated compilers and runtime environments are needed to map applications to the customized hardware and take full advantage of the available resources,” he said. “Developing a competitive software flow requires significant investment and close collaboration between hardware and software teams.”

Others identify similar issues. “If you look at what people are doing in neural networks, a neural network fits much better with an extended C++ description than most other things because it’s an asynchronous network of small processes that you’re describing, and it fits with the computing paradigm called CSP (communicating sequential processes),” said Kevin Cameron, consultant at Cameron EDA. “CSP is communicating sequential processes, which have been around since the ’70s but never were implemented well. One of the problems with a lot of this stuff in hardware/software tradeoffs is that getting a language that people like is difficult, and software engineers aren’t going to do Verilog or SystemVerilog. It’s expensive and it’s hard.”

Still, Cameron believes if some of the concepts in C++ were extended to support what hardware description languages are doing, along with event-driven instructions and data channels, maybe the engineering community would use them.

“At one point people thought OpenCL was great, but it doesn’t really feel like it has made it,” said Schirrmeister. “There is still a large variety of programming models. There are very cool, interesting architectures with reconfigurability today, but how do we program them? What’s the programming model for it? Those are interesting questions to answer going forward.”

Other tools and approaches
Virtual prototyping adds yet another configurable knob to turn. During the specification of the computer architecture and the software development flow, virtual prototyping enables the joint optimization of algorithm, compiler, and architecture.

“The idea is to create a high-level simulation model that enables power/performance tradeoff analysis of application workloads, compiler transformations, and hardware resources,” Kogel said. “The goal is to come up with an optimal specification for the hardware and software implementation teams. To accelerate time to market in highly competitive markets like AI SoC design, virtual prototypes enable to shift-left the development of embedded software and compiler. Here, a simulation model serves as a virtual target for pre-silicon validation and optimization of the compiler, early development of firmware and drivers, as well as early integration of AI accelerators into host the software stacks running on the host CPU.”

Conclusion
Which of these approaches ultimately succeeds isn’t clear yet. What is becoming obvious is the need for flexibility in a design to cope with continued changes in software and market needs, coupled with higher performance that hardware alone cannot provide. So rather than just building faster hardware, the focus is on tailoring the hardware to provide sufficient performance while still leaving enough flexibility to adapt to changes.

This is a tough balancing act, and it requires knowledge of some complex programming of both hardware and software. But done right, the results could be significant improvements in both speed and power consumption, with enough flexibility left over to add some future-proofing into designs.



Leave a Reply


(Note: This name will be displayed publicly)