Synthesizing Hardware From Software

Can a software engineer create hardware? It may be possible, but not in the way that existing high-level synthesis tools do it.


The ability to automatically generate optimized hardware from software was one of the primary tenets of system-level design automation that was never fully achieved. The question now is whether that will ever happen, and whether it is just a matter of having the right technology or motivation to make it possible.

While high-level synthesis (HLS) did come out of this work and has proven to be highly valuable, but the input is not software. Rather, it is a more abstract description of hardware. Most HLS tools can take C, SystemC or C++ as input.

“The common complaint about C synthesis has been and remains that it is not ‘really’ C or C++,” says Chris Jones, vice president of marketing for Codasip. “That the language is so instrumented with custom syntax that by the end of it, the user may as well be writing in Verilog.”

When talking about synthesis from C/C++, the term has two meanings. “Technically C++ is a ‘language,’ but in the context of ‘C/C++ to hardware’, it also means the algorithmic abstraction level,” says Dave Pursley, product management director at Cadence. “While HLS will synthesize the algorithms you throw at it, in order to get good hardware you need to write the algorithm from the hardware’s point of view.”

That means that the original C code often has to be modified. “Algorithms are usually written in C/C++ but are lacking the SystemC details necessary for hardware implementation,” says Rob van Blommestein, head of marketing at OneSpin Solutions. “Even the SystemC models are often too abstract for high-level synthesis and must be refined before they can be transformed into hardware. Such models, especially when written by software teams, rarely reflect important aspects of the hardware, such as pipelines, limited internal storage and bounded memory bandwidth.”

Others report similar hurdles. “There are several fundamental challenges to converting software into hardware using HLS, first the code must be synthesizable by the HLS engine,” said Jordon Inkeles, vice president of product at Silexica. “This means that the code must be written or refactored into a hardware synthesizable format, which is not trivial for the software engineer that is accustomed to writing in standard C/C++. The coding synthesizability guidelines for HLS are substantial and an engineer must become familiar with these guidelines, covering hundreds of pages of documentation”

“Once the code is synthesizable it also requires a degree of awareness of the underlying hardware,” Inkeles said, adding that “when you put something into hardware it is important to consider resource usage. For example, when writing software to run on a processor you may just use floating point for all datatypes and not incur a significant penalty, but when implemented in hardware it leads to extra consumption of valuable hardware resources, which can be costly and also decrease the performance of the algorithm.”

So, is synthesis from software a fantasy? “High-level synthesis, be it from C++, SystemC, Matlab or other behavioral definition, has come a long way in the past decade,” says Stuart Biles, fellow and director of research architecture at Arm. “It’s a good candidate for a number of applications, particularly data processing pipelines that you might find in radio, audio or image processing domains. When coupled with an FPGA, it facilitates a rapid prototyping, verification and optimization design cycle with feedback from observed performance in development systems prior to deployment.”

It also has seen acceptance in other application spaces. “Other high-level description languages are being used, and gaining market acceptance, from established languages like NML from Synopsys or Cadence Tensilica’s TIE, to new languages such as Chisel and Codasip’s own CodAL,” says Jones. “These are best defined as ‘processor description languages’ and are used in a variety of ways, but mainly to customize instruction sets of established architectures or to design processors from the ground up.”

Market pressure is increasing. “The slowdown of Moore’s Law means that processors in general can no longer continue to drive performance like they have over the past 30 to 40 years,” says Clay Johnson, chief executive officer for CacheQ. “Other technologies are required. Those other technologies are GPUs, custom devices in areas such as machine learning and FPGAs. It is known that FPGAs can certainly accelerate things, but the development environment has traditionally, and is today, focused on hardware developers rather than software developers.”

The makeup of many development teams is changing. “One customer told me how many people they had on staff who can write RTL—basically zero, and how many can write C/C++—around 30,000,” says Raymond Nijssen, vice president and chief technologist at Achronix. “There are dinosaurs among us who say they can do better than a tool, but they are a dying breed. They are being replaced by people who want to push the button and don’t care if the solution is perfect. What is important to them is that they get a result fairly quickly. The turnaround times are beginning to drive this more and more. So, you pay a price in terms of efficiency and let the tools give you a result much faster.”

CacheQ’s Johnson agrees. “If I am a hardcore FPGA person, will I be able to take a function and design it and get a smaller area and higher performance—probably yes. If you look at large FPGAs, I would stipulate that the hardware person is not going to be able to sufficiently understand all of the algorithmic details enough to implement that at the detailed level.”

In the past, many co-processor solutions have been dogged by communications costs. “There was no good communication protocol between the accelerators that had low enough latency to justify moving the workload from one processing element to another,” points out Nijssen. “If it takes millisecond to move some matrix from one place to another, then the CPU would have been able to complete the task in the same time with less energy due to the time and cost of moving things around.”

Embedded FPGA is going to be interesting in this space,” says Russell Klein, HLS platform program director at Mentor, a Siemens Business. “If the FPGA is a separate device, going off-chip to the FPGA might nullify much of the benefit of moving from software to a hardware implementation. Keeping the FPGA on the SoC will mitigate those problems.”

Finding the right candidate
Not all software is suitable for mapping into hardware. “It’s possible to map something as complex as a processor from a high-level specification, though currently the results will not be optimal,” says Arm’s Biles. “Concepts like forwarding and control flow aren’t mapped quite as well as they could be. Data processing is one area where there are fewer control decisions, and this makes it easier to map effectively. Applications without huge numbers of configuration parameters or multiple modes of operation are easier for the designer to be satisfied that an efficient implementation has been synthesized and common functionality not duplicated.”

Performance comes from taking what is a serial process and exploiting concurrency. “The way you get performance is accelerating loops,” says Johnson. “If you look at C code, the vast majority of the time for complex algorithms is spent in loops. The way to accelerate a loop is to make the loop fully pipelined. In a processor, if I have a loop and that loop is n iterations and it takes C cycles to complete, then roughly the amount of time that it takes in a processor is (n x C)/clock rate. If you fully pipeline a loop, then the amount of time is (n + C)/clock rate.”

This is one of the places where HLS adds value. “High-level synthesis tools do a pretty good job of finding and introducing concurrency,” says Mentor’s Klein. “In cases where it cannot extract concurrency, due to the nature of the algorithm description, it will identify the dependencies for the developer, enabling them to manually modify the code to increase concurrency.”

And this is where the difficulties start, because hardware and software engineers have different outlooks and levels of experience. “Consider a complex multi-stage image filter,” says Cadence’s Pursley. “In pure software, the C++ will likely be several functions, one per filter stage, communicating by copying frames or passing pointers to frames. However, that isn’t the algorithm as it will be performed by optimized hardware. Usually, hardware will attempt to minimize storage, avoiding costly frame buffers as much as possible. Instead, to get good QoR, the hardware filters will pass around pixels perhaps via line buffers if it’s a 2D filter.”

One of the tricks to getting the best performance from the hardware implementation is optimization of the parallelism in the application, said Silexica’s Inkeles. But this requires insights in the algorithm to guide the HLS tools about how to implement the parallelism. By using a tool to detect the implicit parallelism in the design and refactor the code to convert it, a developer can maximize the performance of the algorithm when implemented in hardware.

Data storage and movement is critical. “I can have tremendous amounts of compute, but I must have the ability to have data that supports that compute,” says Johnson. “This means if I do not have sufficient bandwidth into memory or sufficient access into memory, then I will have a loop that can operate extremely efficiently and fast. But if I am stalled by my data access, then I will not be able to deliver that performance.”

Many algorithms are constrained by memory access performance. “It is usually not computational time that is the bottleneck but getting the data to the computational units that takes most of the time and energy,” adds Klein. “Again, this is the responsibility of designers and not yet compilers.”

But there is an even bigger barrier and problem for the software engineers. “High-level synthesis does not have global visibility,” he says. “While it does a good job with the modules it is used on, it cannot take into account activity in other modules, such as those not using high-level synthesis, or in software.”

Pursley puts that is more concrete terms. “HLS is essentially a different type of compiler, so as a rule of thumb, the C++ must be fully resolved at compile time. It explicitly enforces that there are no hidden assumptions about timing synchronization between blocks. For example, you can’t have unbounded malloc()’s, because in the end, you need to implement a memory of some physical size on actual silicon.”

That restriction does not sit well with software engineers. “If I am a software developer, what I want is to say malloc() on a variable, I want memory to get allocated, I want to point to that data and I want the data to return,” says Johnson. “With HLS I have to architect a memory structure based on what I am doing with HLS tool. That is an extremely complex and difficult thing to do, and that is why it often takes 9 or 12 months to develop something. This is a tremendous burden for someone who writes software. They are unable to develop a memory structure.”

Architecting a solution
This is where solutions may diverge for each development team. “As a hardware designer, you may want to explicitly model things like bit-widths and process-level concurrency,” says Pursley. “In those applications, the IEEE 1666 standard C++ class library, SystemC, should be used by your C++ models.”

The goal for hardware engineers is to arrive at the most efficient implementation possible. “A software engineer wants to take standard C code and be able to target a combination of x86 and an FPGA as an accelerator,” says Johnson. “They want to do that in a way that would be familiar to them as opposed to someone with a hardware background.”

Considering an FPGA as the target may change the equation. “If the goal for the end product is an FPGA rather than ASIC, then a high-level compilation approach may be desirable,” says Biles. “This is not only for rapid prototyping. Because it enables fast enhancement of the algorithm based on observed application behavior, this can lead to better real-world performance.”

Silexica’s approach to enabling software engineers to convert their C/C++code to hardware is to help optimize their application code prior to performing HLS. “By performing and combining static, dynamic, and semantic code analysis before the HLS phase, we can guide the software engineer through the unfamiliarity of coding for hardware—essentially, guide software engineers to make their code synthesizable and identify parallelism in sequential code,” Inkeles said. “Afterward, the tool automatically inserts pragmas/directives into the code to guide the HLS compiler. Adding code analysis tools to the HLS design flow can significantly reduce the barriers that software engineers face when trying to convert their software into hardware.”

CacheQ is trying a solution that may enable software engineers to access this technology. “We generate a virtual machine,” explains Johnson. “This is a bunch of instructions, where instructions are what you would think of as microprocessor instruction set—conditions, Boolean functions, arithmetic—and we generate a connected virtual machine of these instructions. The way they connect is how data flows through the system. By doing this, it allows us to run performance simulation to determine how fast it can be. We can profile that across the processor and the FPGA, and this representation allows us to automatically and fully pipeline loops.”

The key is to look at the code and find the main loop that needs to get accelerated, and then determine the complementary functions to that loop. “My printf() will finish up in the processor and will not be in the FPGA,” Johnson says. “It is not sufficient to just have the ability to take code and generate something in an FPGA. You must have the ability to do performance simulation on that, do some kind of profiling so you can identify the hotspots in the code. That is what goes into the FPGA. All of the instructions operate in a way that when the data is available, they fire. If all of the inputs are not available, then the instruction will wait. This allows us to implement things where we do not know exactly what the return time will be for access to memory. Also, being able to determine from a synchronization point of view when operations are happening in the proper sequence — the virtual machine understands and takes care of that.”

Another difference between these two worlds appears in verification. No software person thinks about verifying the output of a compiler, but this is standard practice for hardware.

“Regular synthesis can rely on formal equivalency checking tools that analyze design logic between static registers,” says Dave Kelf, vice president and CMO for Breker Verification Systems. “For untimed logic, this is much harder. It can be solved for simple synthesis using property checking over a restricted number of state transitions. Simulation is necessary for serious designs. Creating an intent-based testbench that mimics the DSP functionality can provide a rigorous solution for untimed to timed logic synthesis, which is yet another application of Portable Stimulus test suite synthesis tools.”

Formal can play a big role here. “Technology exists to share common SystemVerilog Assertions (SVA) between C++/SystemC models and post-HLS RTL code,” adds OneSpin’s van Blommestein. “Static analysis tools for C++/SystemC are now on par with those available for SystemVerilog RTL. These can detect problems that can compromise HLS, including excessive data length, race conditions and initialization errors. Even when the design process is not as automatic as one might wish, automated verification is key.”

Potentially, these issues are not as important for software targeting an FPGA because you can quickly apply updates if issues are found.

Good enough
While some may see this as a political debate, it really just comes down to performance and productivity.

“I often get asked, ‘Does HLS turn software into hardware?’,” says Pursley. “The answer is no—or more precisely, not into hardware you would want.”

Klein puts it into numbers. “Let’s say you can do a ‘bad’ implementation and make a function run 100x faster by moving it to hardware. You might find that the overhead of interfacing and moving data to the accelerator takes back most of that gain, so at the system level you’re only going 10X faster. But building a ‘great’ implementation, one that is comparable to handcrafted RTL goes, say, 10 times faster than the bad one, or 1000X faster than software. It still has that same overhead as the bad implementation, so if you do the math, the difference at the system level between the bad implementation and the great implementation may be less than 1%.”

But the software team has a different perspective. “We’d like to get to a runtime that deploys portions of a workload, defined at a high level, to the most appropriate execution resource (based on functionality, performance and availability) while meeting high level constraints also provided by the developer,” says Biles. “The desired behavior would be achieved through a combination of software and hardware functions, with the capability to optimize and retarget them as and when required.”

Getting this kind of technology into the hands of software engineers would greatly expand the potential market, says Johnson. “How do you create something that will support the kinds of capabilities we are talking about because that is the way people write software? If you can’t do that, then you cannot expand out and say I support a user community that is not hardware. Software people do not do extensive dataflow analysis to figure out exactly what data goes where, how it gets loaded into memory, ensures that when I do a read it is there because if it isn’t, I get a failure as opposed to how a software person writes code which is to allocate memory, to which I get a pointer, and it accesses that and at some point in time it shows up. How the memory architecture supports that—as a software guy I don’t know.”

Software and hardware engineers remain in separate worlds and often have different objectives when it comes to tools. For hardware engineers, efficiency continues to be the driving factor. For software engineers, productivity is the goal. Tool vendors are trying to find ways to satisfy both sets of requirements.

“For many applications, even bad hardware is much better than great software,” says Klein. “Many designers underestimate how slow software really is, and how inefficient processors are—especially big processors. Exceeding the capabilities of software and a processor is really easy to do.”

Related Stories
Hardware-Software Co-Design Reappears
There may be a second chance for co-design, but the same barriers also may get in the way.
Machine Learning Drives High-Level Synthesis Boom
As endpoint architectures get more complicated, EDA tool becomes key tool for experimenting with different options.
How To Integrate An Embedded FPGA
Adding an eFPGA into an SoC is more complex than just adding an accelerator.
High-Level Synthesis Knowledge Center
Top stories, white papers, blogs on HLS.


Theodore wilson says:

I don’t know if this is a technical problem so much as an economic one. Software dev teams work to different economic realities than hardware dev teams. Brian’s notes on efficiency and productivity are at the heart of it.

An effective flow that takes a software product to hardware likely has to improve the product in ways sellable to the software teams weekly needs before the hardware acceleration is delivered.

RTL is a highly restrictive coding style that de-risks a hardware project. When you add in the needs of effective simulation, formal checks, UPF… it gets more restrictive not less. Refactoring software that already ships into a restrictive acceleratable form should result in better testing, architectural flexibility and performance separate from the eventual acceleration. It seems critical to identify and freeze out some core pieces in the product so that a hardware acceleration effort is not chasing a moving target resulting from typical customer driven CI/CD.

Leave a Reply

(Note: This name will be displayed publicly)