New directions for a commonly used technology.
Most computer algorithms today are developed in high-level languages on general-purpose computers. But someday they may be deployed in embedded systems where the development, verification, and validation of algorithms is done in languages like python, Java, C++, or even numerical frameworks like MatLab.
This is the goal of high-level synthesis (HLS), and it aims to solve a fundamental problem in system design today. The basic idea is allowing hardware designers to build and verify hardware, with better control over optimization of the design architecture, describing the design at a higher level of abstraction while the tool implements the RTL. It also may be possible to use HLS to improve today’s algorithms that run on that hardware.
The ultimate description will be in a hardware description language (HDL) at the register transfer level (RTL). So it’s not so much the algorithm that will be affected by HLS. It’s the implementation of the algorithm.
“HLS allows a large variety of architectures to be explored quickly, and enables very different implementations of an algorithm to be created without a lot of coding and without the risk of breaking the algorithm or requiring a lot of verification and debug,” said Russell Klein, HLS platform program director at Mentor, a Siemens Business.
Sometimes the results of HLS are seen feeding back into the algorithm. Klein said he recently worked on a voice recognition system in which the original TensorFlow algorithm, as defined by the data scientists, called for a feature map of spectral data in a 99 x 40 array of floating-point numbers. “That worked great when the algorithm was implemented in software. But in hardware the design was more efficient (and produced more accurate results) using an array that is 128 x 32 words of fixed-point numbers. This is a rather simple example, but it illustrates how details discovered during the implementation phase can get fed back to the algorithm developers, causing them to modify the algorithm to produce a better implementation in hardware.”
Improving algorithms
Improving algorithms isn’t usually equated with HLS, and the terminology can mean a number of different things. However, this is one of the next possible directions for this technology.
“At the most basic level from a technology point of view, you can have one algorithm, and with high-level synthesis you can very quickly get area, power, performance, congestion,” said Dave Pursley, product management director at Cadence. “You can get feedback even for that front-end designer by pushing it through the flow very quickly, and that can help make decisions about micro-architectural changes — how many states, how big the data path should be, etc. If you extend that out a bit, you also can do things that are more like algorithm exploration. That would be things at the most basic level in DSP, which is almost everything these days. Even something as simple as changing the bit width can have a very significant impact on the power, performance, area, and accuracy, or signal-to-noise ratio, or whatever the exact metrics are. That would be sort of the most basic version of algorithmic exploration.”
Technically, this is possible because HLS starts from SystemC and C++, and in those languages the various data types allow the designer to write the algorithm and then change the data type of the variables, to give completely different synthesis results.
“For example, if you’re doing a 32-bit floating point multiplication versus a 5-bit fixed point multiplication, you’ll have completely different results and completely different power, performance, and area tradeoffs,” he said. “Most likely the system-level metrics, such as the signal-to-noise ratio or image quality will be different too. In this way, it allows you to figure out those tradeoffs, and that would literally entail changing one line of code in the C++. On the other hand, if you were designing RTL by hand, significantly changing the bit widths, especially with fixed point and floating-point types, this would basically require an entire datapath rewrite. It’s something you generally can’t do. You have to take a guess on what is the correct bit width and then just build your RTL and hope you were right. That is a very simple change, and it’s something that HLS users do all the time — and a lot of them have been doing it for a long time.”
Particularly for AI and machine learning applications, HLS may even play a role in determining if an algorithm is the correct one, not just whether the data types are correct, Pursley said. “This is where there are completely different architectures, completely different algorithms that you can use to address the same problem, and it’s really easy with things like Caffe or TensorFlow to do some analysis and figure out what the accuracy is, at least at full floating point or at specific bit widths. But it’s frankly impossible to figure out what the actual performance, area, congestion, etc., will be when you put it into hardware. With high-level synthesis, you have a path from this high-level algorithmic modeling. Similarly, with MatLab in DSP, you can explore all sorts of different architectures and algorithms And from them, there’s a path to get down to RTL and gates to get quick analysis of not just the theoretical impact, or the impact on picture quality or whatnot, but also the actual cost of that in terms of power, area or other constraints.”
How to leverage HLS
Really take advantage of HLS requires a change in thinking within an organization. “In a typical large company flow, you may have an algorithm or architecture team, and they hand off exactly what they want built to the front-end design team,” Pursley said. “The design team then goes and dutifully builds that to the given specs to the best of their ability. One of the big things when you’re doing this type of optimization or exploration is that you need the algorithm people and the front-end design people generally working together. You need that hardware expertise, but you also need the algorithm expertise. This is why, frankly, we see this at a lot of AI- and 5G-type startups. No one knows exactly what the best algorithm is. And depending on what the application is, there’s a different best one. In those startup environments, there’s a lot more co-mingling of responsibilities and expertise. That’s why it’s catching on like wildfire, especially in those types of companies.”
Wherever HLS has a significant impact on the algorithm itself is where the algorithm simply cannot be made to meet the constraints of the application, Mentor’s Klein suggested. “Once the design’s RTL is developed and it is determined that it is either too slow, too big, too power-hungry, or too whatever, and there is nothing in the implementation at the RTL level that can address the problem, then a change to the underlying algorithm is needed. A development team using traditional RTL methods is going to be in a pretty bad spot here. They need to conceive a new algorithm that can address the shortcomings of the original one, and then do a whole new RTL development cycle. There’s rarely time for that. Teams using HLS are in a much better position to recover if this unfortunate circumstance occurs. They fully understand at the algorithmic level what failed, and can deploy a new RTL implementation very quickly.”
In the past, complex algorithms that were not well understood or might be likely to change were simply relegated to software. But today, the algorithms that are being deployed, especially in the machine learning/AI space, are simply too computationally heavy to leave as software.
“It is just not practical to run them on today’s embedded processors,” Klein said. At the same time, we as an industry do not have enough experience with many of these algorithms to know which one should be used for a given problem. Should you use a deep neural network, or would a nearest neighbor algorithm work instead? And once you have selected an algorithm, is there an architecture implementing that algorithm that meets your requirements? Can you use an off-the-shelf TPU? Can you configure some open source IP to do the job? Or do you need something custom? HLS is a practical and proven way to explore these questions. A traditional RTL development process simply cannot react fast enough if you don’t make the right choices at the start of the project. HLS significantly reduces the risk that your project will fail if you select the wrong algorithm or choose an implementation that simply won’t meet your requirements.”
Hand optimization vs. tool-based optimization
The HLS market is an active one, with many threads of development. One area of discussion concerns whether an HLS optimization tool can outperform expert-level hand optimizations, wrote Zubair Wadood, technical marketing engineer at Silexica, in a recent blog post.
He pointed out that a recently published white paper examined a novel approach to optimize a secure hash algorithm, and compared the results to a competition-winning hand-optimized HLS implementation of the same algorithm. This approach showed a nearly 400X speed-up over the un-optimized implementation and outperformed the hand-optimized version by 14%. Moreover, it was more resource-efficient, consumed nearly 3.6 times fewer look-up tables, and 1.76 times fewer flip-flops.
There are different views of how to best utilize HLS. Chris Jones, vice president of marketing at Codasip, believes that rather than employing HLS to improve an algorithm, it is much more beneficial to leave the algorithm as is and use HLS to perfect the hardware running the algorithm.
There are two approaches here. “The traditional approach would be to write an algorithm that is intended to run on a particular target, and then analyze, profile, optimize, and then go down to the assembly level to squeeze every last cycle,” Jones said. “That is a time-consuming task and contributes to vendor lock-in, as no one wants to repeat that laborious process for another target architecture. The other approach, which is what HLS tools offer — particularly processor description languages and synthesis tools — is that after profiling code and carefully analyzing where the cycles are being spent, give the user the ability to generate a target optimized for the algorithm. This has many advantages. The programmer can stay at the C level without delving into assembly, and the SoC architect can minimize extraneous logic and produce a block that saves area and power as it has a single purpose. Code remains portable, as it is in C, and tailored hardware provides multiple cost benefits. Of course, this only holds true if the algorithm is stable and has reached a level of maturity where it is unlikely to change over time. For algorithms that are still evolving, then HLS can still provide assistance in optimizing hardware, but not to the same degree.”
Indeed, many companies already use SystemC/C++ models in signal and image processing applications, including image scaling, image interpolation, and video codec, said Sergio Marchese, technical marketing at OneSpin Solutions. “Although the adoption of SystemC and HLS is increasing steadily, some teams still generate the RTL manually, leaving those teams to perform verification at the RTL level. A far more productive approach is to do more verification before RTL generation. This is crucial to enable additional user cases for HLS and more widespread adoption. Some formal solutions allow users to do a lot of automated checks at the SystemC level, and also do formal verification with custom assertions, saving a lot of time-consuming HLS iterations.”
Speeding up emulation
Yet another interesting and new use is to use high-level synthesis to speed up emulation. “To get higher bandwidth out of your [emulator], one of the limiting factors is the communication between the verification environment and the testbench with whatever the design is inside the box,” said Pursley. “Given that HLS takes a fairly high-level algorithm along with its I/O and whatnot, and synthesizes it into RTL, that algorithm is now actually a verification. It’s whatever you’re trying to do in your verification testbench, using the same I/O that you’re using on your design side. That allows you to move much more, if not all, of that test bench into the box, assuming you have the capacity in your emulator box. Then you can get literally orders of magnitude bandwidth speed up in your emulator.”
Conclusion
Today, the target applications getting the most attention for HLS currently include imaging, and especially image signal processing, as well as 5G and AI.
“With AI, you end up having strong software experience, a little bit of hardware experience, and you need a way to get to hardware especially for your first silicon with a fairly streamlined hardware team, while at the same time do algorithm exploration,” Pursley said. “It’s just the perfect storm for HLS.”
Leave a Reply