How close can we get to automated system optimization from a software function? The target keeps moving but the tools keep becoming more capable.
Building an optimal implementation of a system using a functional description has been an industry goal for a long time, but it has proven to be much more difficult than it sounds.
The general idea is to take software designed to run on a processor and to improve performance using various types of alternative hardware. That performance can be specified in various ways and for specific applications. The problems start when you try to automate the function, especially when the starting point for software is probably in pure C or C++. These are very general-purpose languages, which makes them very difficult to analyze and manipulate for anything other than a defined instruction set processor.
Today’s systems often contain multiple heterogenous processors, including DSPs, GPUs and various sizes of CPUs. They also may contain programmable fabrics, and some pieces of the system require custom accelerators.
Sticking with software
Software has a lot of positive characteristics. “Most notably, it can be updated in the field,” says Russell Klein, HLS platform director at Mentor, a Siemens Business. “Today it can often be updated over the air, sometimes without any assistance from the end user. This means you can fix bugs and add features, which you can’t do with fixed hardware. Keeping things in software minimizes risk, not just during development but over the lifetime of the product. So anything that can be in software, should be in software.”
But that solution is not possible all of the time. “What drives us to put functions in hardware is that they cannot run fast enough in software, or they consume either too much power or too much energy when implemented in software,” adds Klein. “Software’s downside is that it is really inefficient. It is somewhat counterintuitive, but the larger the CPU the software is being run on, the more inefficient it is in terms of power and energy.”
Almost all solutions require a mix to be practical. “The big picture view is an ideal balance between hardware and software,” says Kevin McDermott, vice president of marketing for Imperas Software. “You want just the right amount of fixed hardware functionality that provides lower implementation costs and efficiencies, with the software adaptability offering flexibility as requirements change over time.”
It raises the question about what should remain in software. “Application code that is sequential or not in the critical path typically does not benefit from putting in specialized hardware, and should remain in software,” says Loren Hobbs, head of product and technical marketing at Silexica. “You can do frequency scaling of the clock if you need it to run faster for a particular function, or for power consumption reasons, you could scale it down.”
Today, with RISC-V and an evolving infrastructure that enables instruction extensions of an ISS and a whole range of implementations for a given ISS, even knowing the performance of software on an arbitrarily possible processor becomes more difficult. “Until now, the option to extend an ISA and add custom instructions was possible with some processors, but without a consistent software ecosystem and broad industry support this failed to attract widespread adoption,” adds Imperas’ McDermott. But the promise of tuning a processor to better address the application remained a persistent concept.”
While this has been possible in the past, it is becoming more accessible. “In the RISC-V context, you must have an ASIP generator that provides high enough quality,” says Frank Schirrmeister, senior group director for product management and marketing at Cadence. “You can take the instruction set and modify it, add instructions, and see the impact on the high-level characteristics. Every time you make a change at that level, you need to move all of the tool chain as well. From there, you can either do a manual implementation or you use an ASIP generation tool to autogenerate from a template.”
Migrating to hardware
Sometimes, even an optimized processor does not provide you with enough performance or low enough power. “Finding candidates to migrate into hardware isn’t that hard,” says Klein. “Amdahl’s Law applies here. Whatever you move needs to comprise at least 50% of the load on the CPU, ideally more. Some basic profiling will allow you to identify that. Figuring out where to cut the algorithm is more challenging, as you need to consider what data is accessed by the function. Anything you move off the CPU needs to have any data that it accesses moved, as well. That has a big performance and power penalty, so you need to make sure you’re minimizing that data movement — even if it means pushing more of the algorithm into hardware.”
Choosing what goes where isn’t always so easy to define, though. “The idealized balance between the hardware and software tradeoffs suggest a near infinite flexibility with a smooth boundary transition,” says McDermott. “Of course, it’s not that simple and much more granular. The hardware could be completely fixed and hard-wired, or allow some configurable control options ranging from a state machine or a micro-coded processor hardware. Similarly, the software has abstractions moving from assembly language to C, or the merits of a micro kernel, RTOS or OS to manage the mix of tasks and requirements.”
What are good candidates for migration? “By identifying the parallelism within functions that are in the critical path of an application, a designer can then perform what-if analysis to determine the benefit of moving the function into an accelerator,” says Silexica’s Hobbs. “Tools can profile the application by performing dynamic analysis of code execution and display call graphs, which show the execution time of each function — including the execution time of each line of code and each memory access within the application. A visual representation helps you to find those critical paths. Within the call graph you can select the function, detect the parallelism within each function that could be exploited into a custom accelerator.”
Parallelism is the key. “A hardware multiplier runs at the same speed as the multiplier inside the CPU,” says Mentor’s Klein. “What makes hardware faster is that it can do things in parallel. You can do 1,000 multiplies at the same time in hardware. Moving a serial algorithm off the CPU into hardware isn’t going to help much. In fact, it will probably make things worse. Anything moved into hardware has to be able to take advantage of parallelism.”
Cadence’s Schirrmeister has long proposed that there are eight ways in which a pure function can be turned into implementation. They are:
“Those are the ways to implement a block,” Schirrmeister says. “But the real question is one level above that and slightly orthogonal to it, which is how do I split things and what do I put where?”
Sometimes those choices are dictated by the market. “FPGAs have some of the benefits of software, but they are slower and more power hungry than ASICs,” says Klein. “Another thing you need to consider is future-proofing the design. With programmable fabric, you can accommodate changes down the road. Anyone putting together a 5G system before the specs are finalized should be putting any hardware in programmable fabric. This is also true anywhere that algorithms are rapidly changing, like we see in AI today.”
Optimize for what?
It is tough enough optimizing for one factor, but most designs have to consider multiple. “Performance is only one,” says Schirrmeister. “Just getting it to work is the most important. You may also want to consider cost, which translates into area, and power, which is affected by how much is put into hardware accelerators versus running in software.”
Optimization varies by design and application. “For some it is power, but for others power is not a key factor,” says Hobbs. “We constrain the clock frequencies, and that is how we derive our settings for what you put into the HLS compiler. We consider two things, the constraints for the clock frequency and a resource constraint level. When you combine those two things you can get an approximation of power consumption. We are not setting a wattage constraint, but you can approximate the power.”
Tradeoffs need suitable platforms on which the analysis can be made. “Virtual platforms have long offered a way to test and develop software before hardware is available and have been adopted extensively for hardware verification in professional DV flows for SoCs and processors,” says McDermott. “Architectural analysis was always seen as potential area of interest, but for the problem of the lack of readily available real software to analyze.”
This is a problem because hardware is often designed before software is available, making the whole top-down flow impractical.
“We want to help users make decisions quickly and then focus on the implementation, rather than striving to get absolute performance and resource numbers,” says Hobbs. “We want to be able to provide it quickly. So you may sacrifice a little accuracy, but it is good enough to enable the decisions to be made. Then you can zero in on an implementation that you are comfortable with and get the more accurate numbers from those tools. On the processor execution side, that is a little easier, and that can be more accurate than the synthesis side. But we are not talking about huge differences in performance numbers or resource consumption.”
Some have suggested that AI may have a role to play in finding the perfect partition. “This is going to be an unpopular statement, but I don’t think you need AI to solve this problem,” says Klein. “Certainly, AI could be successfully applied to it, but, to me, this is more of a problem that big-data and some basic statistics can solve, at least for most cases. You need to find parallelism and clustering of data to effectively partition the system. This can be done by tracing the system running under realistic conditions with typical data loads and analyzing those traces. This, of course, goes beyond your standard “gprof” traces, but the tooling needed is generally already available.”
A practical flow
Assuming that a top-down flow was to be followed, there are several steps and transformations that have to happen. “Every time you make a split, you have to have a corresponding set of interface implementations if they cannot be automatically synthesized,” says Schirrmeister. “For every split you have to look at the interface between the two blocks and find a way to stitch them back together. That can be done with interface synthesis, which is really just a smart library of components that can be used for assembly.”
Then you have to go back and ensure that system functionality has been maintained. “The interactions at the system level may still need to be analyzed,” says Hobbs. “You may need to manage timing and that access are in sync and arbitrated.”
“The movement of the data, the parallelism, and the overall architecture are all still defined by the developer through changes to the source code and constraints on the compilation of the algorithm,” adds Klein. “Successfully migrating an algorithm from software to hardware requires someone who understands hardware design. It is not that software folks can’t learn how to do this, but they usually do not have the training and background needed to reach an efficient implementation.”
Each piece of functionality has to be transformed. “The worst way to do this is to hand a hardware developer some C++ source code, and hope for the best,” Klein says. “Any manual translation relies on the developer understanding, and then correctly implementing, the algorithm. This will almost certainly introduce subtle problems that are hard to weed out.”
A process is required. “We are helping the software developer identify parallelism, and then helping them with the interfaces,” says Hobbs. “Hardware experts know HLS and they use the tools to get deep inside the application, visualizing the complex functions, finding parallelism and identifying critical paths. But even finding the right mixture of pragmas and directives to guide the compiler has a really big effect on the implementation of the design, and there is not a lot of visibility into that. It is more of a trial-and-error process. We help by providing quick analysis so they can quickly arrive at the optimum settings.”
Automation ensures minimal introduction of problems. “Some kind of automated translation is ideal, says Klein. “HLS tools enable algorithmic code to the translated into RTL. This bypasses the human interpretation problem. But you still need to thoroughly verify the resulting RTL. Fortunately, both the original algorithm and the RTL are machine readable and executable, and can be converted into control/data flow graphs. These are ideal for performing a formal equivalency check.”
In search of the dream
The dream of 25 years ago, where a functional description could be transformed into an implementation, has not died. The problem has become a lot more complex, but researchers still hope to get to a predictable flow from a high-level description down to an implementation. It requires that semantics are well-defined.
“From a high-level description, I can go into definable target architectures, where definable means my my processors, and FPGA,” says Schirrmeister. “Under the hood, there is often an intermediate representation. CUDA works because you have a defined target architecture and then you can use a higher-level description. OpenMP works because you are fairly flexible with the implementation underneath; I have seen OpenCAPI, which is multi-core mapping, and there are tools that can take that description and automatically map it into hardware components and into operating systems.”
The problem is that we are trying to develop tools with the minimum number of constraints. “C expresses everything,” adds Schirrmeister. “OpenMP, OpenCL, etc., have a much narrower space. It is a pure market problem. C was generally adoptable, but we threw away properly defined semantics. However, if you go with better defined semantics you can only represent a smaller amount of the applications, and you will find fewer users that it is applicable to.”
Related Stories
Addressing Pain Points In Chip Design
Partitioning, debug and first-pass working silicon lead the list of problems that need to be solved.
Partitioning Drives Architectural Considerations
Experts at the Table, part 2: Biggest tradeoffs in partitioning.
Partitioning Drives Architectural Considerations Part 1
Experts at the Table, part 1: When and how do chip architects prioritize partitioning?
FPGA Design Tradeoffs Getting Tougher
As chips grow in size, optimizing performance and power requires a bunch of new options and methodology changes.
Scaling, Packaging, And Partitioning
Why choreographing better yield is so difficult.
Leave a Reply