Using FPGAs to optimize high-performance computing, without specialized knowledge.
For most scientists, what is inside a high-performance computing platform is a mystery. All they usually want to know is that a platform will run an advanced algorithm thrown at it. What happens when a subject matter expert creates a powerful model for an algorithm that in turn automatically generates C code that runs too slowly? FPGA experts have created an answer.
More and more, the general-purpose processor found in server-class platforms is yielding to something more optimized for the challenges of high-performance computing (HPC). Advanced algorithms like convolutional neural networks (CNNs), real-time analytics, and high-throughput sensor fusion are quickly overwhelming traditional hardware platforms. In some cases, HPC developers are turning to GPUs as co-processors and deploying parallel programming schemes – but at a massive cost in increased power consumption.
A more promising approach for workload optimization using considerably less power is hardware acceleration using FPGAs. Much as in the early days of FPGAs where they found homes in reconfigurable compute engines for signal processing tasks, technology is coming full circle and the premise is again gaining favor. The challenge with FPGA technology in the HPC community has always been how the scientist with little to no hardware background translates their favorite algorithm into a reconfigurable platform.
Interrupting the flow
Most subject-matter experts today are first working out system-level algorithms in a modeling tool such as MATLAB. It’s a wonderful thing to be able to grab high-level block diagrams, state diagrams, and code fragments, and piece together a data flow architecture that runs like a charm in simulation. Using MATLAB Coder, C code can be generated directly from the model, and even brought back into the model as MEX functions to speed up simulation in many cases. The folks at the MathWorks have been diligently working to optimize auto-generated code for particular processors, such as leveraging Intel Integrated Performance Primitives.
The walls of scale and time soon appear, however. An algorithm proven out on a relatively small test data set can become bogged down when a large production data set is introduced. Reasonable execution time in simulation often explodes as the processor and memory get chewed up, and the entire HPC platform slows to an aggravating crawl trying to keep up with just one algorithm. After spending a lot of money on what seemed to be high performance hardware, the last thing any project manager wants to hear is they need more hardware to get the job done.
A smooth design flow is suddenly interrupted. While some algorithms vectorize well, many simply don’t, and more processor cores may not help at all unless a careful multi-threading exercise is undertaken. Parallel GPU programming is also not for the faint of heart. A more likely scenario is there is a critical section of C code that would relieve the system bottleneck if somehow made to run faster.
Moving from the familiar territory of MATLAB models and C code to the unfamiliar regions of LUTs, RTL, AXI, and PCI Express surrounding an FPGA is a lot to ask of most scientists. Fortunately, other experts have been solving the tool stack issues surrounding Xilinx technology, facilitating a move from unaided C code to FPGA-accelerated C code.
High-level synthesis for Virtex-7
The Xilinx Virtex-7 FPGA offers an environment that addresses the challenges of integrating FPGA hardware with an HPC host platform. With a large suite of programmable logic in a high-performance interconnect, the Virtex-7 is flexible enough to implement complex algorithms. When deployed on a system-level solution with a configurable PCI Express x8 link (implemented in another Xilinx Kintex-7 FPGA), Virtex-7 devices form the basis for a powerful acceleration platform.
What is the magic behind the boxes labeled “Compute Device”? There are several approaches to compiling algorithms from C to FPGA-ready code. One such platform is supplied by Xilinx directly: Vivado High-Level Synthesis. A no-cost upgrade to Vivado HLx Editions, Vivado HLS understands the Virtex-7 architecture and interfaces and delivers quality of results (QoR) with direct compilation from C, C++, and System C.
A typical acceleration flow partitions code into a host application running on the HPC platform, and a section of C code for acceleration in an FPGA. Partitioning is based on code profiling, identifying areas of code that deliver maximum benefit when dropped into executable FPGA hardware. The two platforms are connected via PCI Express, but a communication link is only part of the solution – more on this shortly.
Synchronization and simulation
To keep the two platforms synchronized, AXI messages can be used from end-to-end. Over a PCI Express x8 interface, AXI throughput between a host and an acceleration board exceeds 2GB/sec. Since AXI is a common protocol used in most Virtex-7 intellectual property (IP) blocks, it forms a natural high-bandwidth interconnect between the host and the Virtex-7 device including the C-accelerated Compute Device block. A pair of Virtex-7 devices are also easily interconnected using AXI as shown.
This is the same concept used in co-simulation, where event-driven simulation is divided between a software simulator and a hardware acceleration platform. Via a straightforward application programming interface (API), calls from the HPC host application can redirect execution to the Virtex-7 implementation of accelerated C code. Using an RTL simulator such as Aldec Riviera-PRO with support for AXI Bus Functional Models (BFMs), the integrated environment can be completely debugged, verified, and controlled.
While it would be possible for a sophisticated design team to bring tools together for this job, the whole point of the concept is to make integration simple for scientific teams using HPC platforms. Aldec has done the heavy lifting of designing high-performance Virtex-7-based hardware and creating a software stack that configures everything in a run-time solution. Once the solution is installed on the HPC host, configuration of the programmable logic in the Virtex-7 is done automatically. No specialized knowledge of FPGAs is required to use the platform – scientists simply create code in C, either hand-programmed or auto-generated.
HLS solutions produce RTL that can be hosted directly in an FPGA, delivering good results through automation. Still, some situations demand attention to the last few percentage points of RTL optimization. With over 30 years of experience in designing and optimizing FPGA systems, Aldec offers its RTL Porting Services to achieve the highest possible performance. Scientists can generate an accelerated algorithm quickly, then turn it over to Aldec RTL Porting Services to squeeze every advantage from the implementation via hand-tuning with intimate knowledge of Xilinx FPGA architecture and tools.
Less stress, more success
Reconfigurable computing platforms used to require a lot of care and feeding. Today’s high-performance FPGAs and high-level synthesis tools continue evolving. These improvements are approaching the performance of what hand-coded RTL could do – without a tedious translation from C to hardware primitives. Instead of an all-or-nothing solution, C acceleration of critical routines is an easier path allowing the benefits of an HPC platform and a Xilinx Virtex-7 platform to be combined.
Aldec’s HES-7 platform and the Proto-AXI software along with the Riviera-PRO simulator have been proven in years of RTL design for FPGAs and ASICs. When fully integrated with Xilinx Vivado HLS or NEC CyberWorkbench, C code drops into place quickly on accelerated hardware. Instead of just bare FPGA hardware, Aldec offers the complete run-time environment, tools, and advisory services to ensure HPC projects succeed.